catalog DataFrame

The catalog dataframe contains bibliographic records for books and articles that are assigned in the syllabus corpus. Each record in this dataset represents a single expression of a resource, which are derived from records in a number of input datasets. Currently OS aggregates resources from:

Under the hood, these source catalogs add up to a “known universe” of about ~150M books and articles. During the citation extraction process, references to these resources are identified in the syllabi, and the final distribution data contains just the subset of books and articles that appear at least once in the corpus.


When working with this data, it’s important to remember that each row in this dataframe is a single bibliographic “expression,” which means that there can be multiple records that correspond to the same top-level “work.” In many cases, there will be just a single expression record for a given book or article - especially if the work is relatively recent and hasn’t been re-printed multiple times. But, for very well-known or canonical works, there are often multiple expression records in the dataset. For example, we might end up with 20-30 individual records that describe different editions of The Odyssey (or, different copies of the same edition in different bibliographic databases). These individual expression records are grouped together into “work clusters” by the work_id field, which represents the top-level entities that are matched against during the citation extraction process.

Because of this, when working with the citation graph, it generally makes sense to operate at the level of unique work_id values, not individual expression records. The simplest approach would be to keep just a single catalog record per work_id. In Spark, something like:


Or, for more advanced use-cases, it might make sense to select out aggregated metadata from the individual expression records. In the future, we plan to change the structure of this dataframe to provide pre-aggregated work clusters.


The OS-assigned unique identifier of the catalog record.


This unique identifier is not guaranteed to be consistent across versions of the dataset.


A slug that represents the source bibliographic database that the record was extracted from. For example – viaf, doab, gutenberg.


The original identifier of the bibliographic record in the source catalog. For example, War and Peace is book 2600 in the Project Gutenberg catalog, so the value here is 2600.


An identifier that clusters records that represent different copies or editions of the same work, as identified by OS.

For example, if we have 100 different editions of The Iliad in the catalog, each of these 100 records is assigned a common work_id, making it possible to operate on the group of records as a unit. If we only have a single bibliographic record for a given publication (generally the case for most resources published in the last ~10 years) the work_id will be unique to that record.

To operate on unique works, select a single record for each distinct work_id in the catalog.


The title of the work.


The subtitle of the work. Null if unknown.


The publisher of the work. Null if unknown.


A list of authors of the work.


The given name of the author (includes both first and middle names).


The surname of the author (if a person) or an organization name. Used by the citation matcher as the minimal lexical representation of the author.


The year the work was published. Null if unknown.


A list of any ISBNs associated with the work.


A list of any ISSNs associated with the work.


The DOI of the work. Null if unknown.


A list of URLs associated with the work – either a link to it or a link to information about it.


Whether or not the work is an open access work. Null if unknown.


The type of publication the work is. Possible values are one of ‘book’, ‘book-chapter’, ‘article’, or ‘report’. Null if unknown.


Metadata specific to journal articles, book chapters, or other resources that are published as part of a larger “container”. Null if the work is a standalone publication.


The name of the “container” in which the article was published. Generally a journal name, conference name, or the title of an edited volume.


A journal volume number. Null if unknown or not applicable.


A journal issue number. Null if unknown or not applicable.


The first page of the article. Null if unknown or not applicable.


The last page of the article. Null if unknown or not applicable.


The full-text abstract. Null if unknown or unavailable.


The number of other records in the catalog that share a work_id with this record.


The total number of times that the record’s work cluster (identified by work_id) appeared in the syllabus corpus.


This number is an aggregate count based on OS’s internal corpus, which includes syllabi not represented in this dataset.


The set of ISBNs associated with all records that share a work_id with this record.


The set of ISSNs associated with all records that share a work_id with this record.


The set of DOIs associated with all records that share a work_id with this record.