.. _catalog: :code:`catalog` DataFrame ========================= The :code:`catalog` dataframe contains bibliographic records for books and articles that are assigned in the syllabus corpus. Each record in this dataset represents a single expression of a resource, which are derived from records in a number of input datasets. Currently OS aggregates resources from: * `Crossref `_ * `The Library of Congress `_ * `Open Library `_ * `VIAF `_ * `Project Gutenberg `_ * `Directory of Open Access Books `_ * `Open Textbook Library `_ * `arXiv `_ Under the hood, these source catalogs add up to a "known universe" of about ~150M books and articles. During the citation extraction process, references to these resources are identified in the syllabi, and the final distribution data contains just the subset of books and articles that appear at least once in the corpus. .. note:: When working with this data, it's important to remember that each row in this dataframe is a single bibliographic "expression," which means that there can be multiple records that correspond to the same top-level "work." In many cases, there will be just a single expression record for a given book or article - especially if the work is relatively recent and hasn't been re-printed multiple times. But, for very well-known or canonical works, there are often multiple expression records in the dataset. For example, we might end up with 20-30 individual records that describe different editions of *The Odyssey* (or, different copies of the same edition in different bibliographic databases). These individual expression records are grouped together into "work clusters" by the :code:`work_id` field, which represents the top-level entities that are matched against during the citation extraction process. Because of this, when working with the citation graph, it generally makes sense to operate at the level of unique :code:`work_id` values, not individual expression records. The simplest approach would be to keep just a single catalog record per :code:`work_id`. In Spark, something like: .. code-block:: python df.dropDuplicates(['work_id']) Or, for more advanced use-cases, it might make sense to select out aggregated metadata from the individual expression records. In the future, we plan to change the structure of this dataframe to provide pre-aggregated work clusters. id ** The OS-assigned unique identifier of the catalog record. .. note:: This unique identifier is not guaranteed to be consistent across versions of the dataset. source ****** A slug that represents the source bibliographic database that the record was extracted from. For example -- :code:`viaf`, :code:`doab`, :code:`gutenberg`. source_id ********* The original identifier of the bibliographic record in the source catalog. For example, `War and Peace `_ is book 2600 in the Project Gutenberg catalog, so the value here is :code:`2600`. work_id ******* An identifier that clusters records that represent different copies or editions of the same work, as identified by OS. For example, if we have 100 different editions of *The Iliad* in the catalog, each of these 100 records is assigned a common `work_id`_, making it possible to operate on the group of records as a unit. If we only have a single bibliographic record for a given publication (generally the case for most resources published in the last ~10 years) the `work_id`_ will be unique to that record. To operate on unique works, select a single record for each distinct `work_id`_ in the catalog. title ***** The title of the work. subtitle ******** The subtitle of the work. Null if unknown. publisher ********* The publisher of the work. Null if unknown. authors ******* A list of authors of the work. forenames --------- The given name of the author (includes both first and middle names). keyname ------- The surname of the author (if a person) or an organization name. Used by the citation matcher as the minimal lexical representation of the author. year **** The year the work was published. Null if unknown. isbns ***** A list of any ISBNs associated with the work. issns ***** A list of any ISSNs associated with the work. doi *** The DOI of the work. Null if unknown. urls **** A list of URLs associated with the work -- either a link to it or a link to information about it. open_access *********** Whether or not the work is an open access work. Null if unknown. publication_type **************** The type of publication the work is. Possible values are one of 'book', 'book-chapter', 'article', or 'report'. Null if unknown. article ******* Metadata specific to journal articles, book chapters, or other resources that are published as part of a larger "container". Null if the work is a standalone publication. venue ----- The name of the "container" in which the article was published. Generally a journal name, conference name, or the title of an edited volume. volume ------ A journal volume number. Null if unknown or not applicable. issue ----- A journal issue number. Null if unknown or not applicable. page_start ---------- The first page of the article. Null if unknown or not applicable. page_end -------- The last page of the article. Null if unknown or not applicable. abstract -------- The full-text abstract. Null if unknown or unavailable. work_cluster_size ***************** The number of other records in the catalog that share a `work_id`_ with this record. work_match_count **************** The total number of times that the record’s work cluster (identified by `work_id`_) appeared in the syllabus corpus. .. note:: This number is an aggregate count based on OS's internal corpus, which includes syllabi not represented in this dataset. work_isbns ********** The set of ISBNs associated with all records that share a `work_id`_ with this record. work_issns ********** The set of ISSNs associated with all records that share a `work_id`_ with this record. work_dois ********* The set of DOIs associated with all records that share a `work_id`_ with this record.