`matches` DataFrame

Each row in the matches dataframe corresponds to a single citation match – a specific location in a syllabus where the matching procedure identified the presence of a work from the OS catalog. At a structural level, this is essentially a many-to-many join table, associating syllabi with catalog records. Each row also includes metadata about the position and context of the match inside the syllabus.

The match objects in this table are the result of a two-step citation extraction process. We first run a simple keyword matching procedure that searches documents for places where the tokens from the title and author fields in a catalog record appear in close proximity in a syllabus (within ~10 tokens). This produces a set of candidates with high recall but comparatively low precision. (Many false-positives will also make the cut – for example, sentences that happen to contain both “politics” and “Aristotle,” but where the Politics itself isn’t getting assigned). We then apply a neural validation model to these candidate matches, which extracts features from characters and tokens in the document contexts around the raw keyword matches and predicts whether or not the match is a legitimate reference to a text.

As of version 2.1 of the dataset, this model gets to ~90% accuracy on a held-out test set, meaning that most of these matches are valid, though some false-positives remain.

id

The OS-assigned unique identifier of the match.

Note

This unique identifier is not guaranteed to be consistent across versions of the dataset.

matcher_id

An identifier for the unique “signature” that produced the match, based on the title and author tokens (t_tokens, a_tokens).

doc_id

The id of the syllabus that contains the match.

work_id

An identifier that clusters together different editions / variants of the same “work”, as determined by OS. This is paired with the work_id of the Catalog dataframe.

doc_work_order

A ranking over matches of a work_id found in the same syllabus.

The same work can appear in multiple locations in a single syllabus. This field contains a 1-N ranking of these duplicate matches inside of each document, where the closer to 1, the closer to the top of the syllabus the match was found.

To select a set of unique (syllabus, work) pairs from the Matches dataframe, filter to rows where doc_work_order is equal to 1.

t_tokens

Normalized tokens representing the title.

a_tokens

Normalized tokens representing the author surname.

logp

The joint log probability of all unigrams in the matcher signature from both the title and author fields, based on third-party frequency data from the wordfreq package. This value can be interpreted as a crude signal for the lexical “specificity” or “focus” of a text – and by extension, the likelihood that a raw keyword match on the title and author tokens will be a real reference to the work, and not just an incidental co-occurrence of the tokens.

The lower the value, the more lexically “focused” the title + author pair. The value is pushed lower by more tokens in the title + author fields, and by more infrequent tokens.

pvalid

A number in the range [0, 1] that represents the probability that the match represents an actual text assignment string, as modeled by a validation model trained on the characters and tokens in the document contexts before, between, and after the raw keyword matches from the title and author fields. This validation step is necessary to remove false-positive matches, places where the title and author of a text appear in close proximity, but where the text isn’t being cited – for example, a sentence where “politics” and “Aristotle” appear in the context of regular prose, but where Aristotle’s Politics isn’t being assigned or cited.

OS’s validation model is trained on a set of ~12k hand-labeled matches, sampled uniformly between logp -45 and -15, and achieves an accuracy of ~90%. Experimentally, we find that ~100% of matches with logp < -45 are valid, and ~100% with logp > -15 are invalid, so we focus the validation model on the interval between these values. After applying the model, we keep only matches where pvalid > 0.5, or where logp < -45.

snippet

The raw character spans corresponding to the title and author strings in the document, as well as the left, middle, and right contexts. These strings are used as the input to a validation model (LSTM over GloVe embeddings) that discards false-positive matches.

Note

This field is only available in full-text versions of the dataset.

left

200 characters left of the first match.

m1

The first match.

middle

The span between the first and second matches.

m2

The second match.

right

200 characters right of the second match.

title

The text of the title field. This is the same as either m1 or m2, depending on whether the title appeared first or second.

author

The text of the author field. This is the same as either m1 or m2, depending on whether the author appeared first or second.

matches DataFrame

id