Changelog ========= Version 2.8 *********** * Add ~5M new syllabi. * Add a pipeline for mining syllabi out of the CommonCrawl dataset. For 2.8, we applied the syllabus classifier to ~50% of the WET (extracted text) records in CommonCrawl as of March 2022, as well as 100% WETs from major English-language educational TLDs (:code:`edu`, :code:`ac.uk`, :code:`edu.au`, :code:`ca`). This yielded ~3M new documents in the final dataset. * Add a new :code:`school_name` field to the top-level span extraction model. * Fix a bug in the Spark pipeline from 2.7 -- the document URLs weren't getting passed into the document parser, which caused the institution match rate to fall significantly, since the document URL is often the basis for the match. * Refactor and generally improve institution matcher; add matching based on the new extracted :code:`school_name` field. Combined with the bug fix, this boosted the institution match rate to ~96%, a 5% improvement from before. * Expand the underlying institution metadata -- add :code:`wikipedia_url`, :code:`description`, and :code:`image_url`. * In bibliographic database that backs the citation matcher, remove the `open_access=True` flag on records from Project Gutenberg. Though these books are indeed free to read, they aren't OA-licensed in a narrow sense of the concept -- they're out-of-copyright and in the public domain, as opposed to works that are in copyright but published under open licenses. In the future, we might model this distinction explicitly in the data. * Expand the training dataset for the syllabus classifier, driven by the need to for a very high-precision model when probing through the CommonCrawl dataset. (~18k new documents.) * Expand the training dataset for the top-level span extractor -- ~5k new labeled documents for :code:`code`, :code:`title`, :code:`date`, :code:`description`, and :code:`section`; ~6k labeled documents for the new :code:`school_name` field. * Simplify the structure of the :code:`urls` field in the parser output. Instead of exposing separate fields for :code:`hyperlinks`, :code:`text_urls`, and :code:`combined_urls`, just provide a single :code:`urls` field with a flat list of normalized URLs. * Improve the near-duplicate clustering in the Spark pipeline. Namely, we now compute LSH buckets for each of the major full-text fields from the parser individually, instead of using the full document text. This will give a cleaner fingerprint for the document, since the extracted fields will hone in better on the relevant document content, and skip over boilerplate / navigation chrome. Version 2.7 *********** * Add ~650k new syllabi. * Switch to a single-dataframe distribution format -- instead of providing the data as 3x entity types (:code:`syllabi`, :code:`works`, :code:`citations``), denormalize everything onto a single :code:`syllabi` dataframe, which includes all of the core document metadata as well as joined metadata for the institution and book-and-article citations. Version 2.6 *********** * Add ~700k new syllabi. Version 2.5 *********** * Add ~750k new syllabi. * In the syllabus text content, we now convert unicode line breaks (:code:`\u2028`) to regular line breaks (:code:`\n`), which fixes a class of errors that can occur when reading JSON lines data using standard Python utilities. Version 2.4 *********** * Add ~1M new syllabi. * Extract 7 new span-level metadata fields - :code:`section`, :code:`learning_outcomes`, :code:`topic_outline`, :code:`required_reading`, :code:`assessment_strategy`, :code:`grading_rubric`, and :code:`assignment_schedule`. * When deduplicating documents, include the course section in the composite key used to identify a unique document. Version 2.3 *********** * Implement a new "fuzzy" deduplication step that identifies documents that are substantively the same even if they have slight differences. (For example, if the same page is crawled twice, and include an auto-generated timestamp.). * Improve date parsing. Previously we based the :code:`term` and :code:`year` fields just on the date span extracted from the text content, but 2.3 expands this to also look for dates in the document URLs and the text of the original HTML link that pointed to the page on the web. * Update institution metadata -- we now pull from the most recent releases of the underlying IPEDS, Carnegie, and Grid sources. Version 2.2 *********** Syllabi dataframe ------------------ * Add ~1.8 million new syllabi. * Update the syllabus, field and date classifiers for performance improvements. * Reorganize the syllabi schema into a series of nested groups based on the matching and classification routines run over the syllabi. Fields directly related to the "raw" syllabus, like `syllabus_probability`, are available at the top level. Then there are the following groups: * The `date` group contains output from the OS date classifier. * The `field` group contains output from the OS academic field classifier. * The `institution` group contains output from the OS institution matcher. * The `extracted_metadata` group contains sub-groups representing output from several OS course classifiers. Current sub-groups are `code`, `title`, `date` and `description`. * Drop the `field_score` field. The OS academic field classifier no longer provides meaningful or reliable output for this field. * Update institutions: * Update all data sources to recent (2018 or 2019) versions. * Reorganize the institution matcher fields (now all nested under `institution`, per the above) for clarity and consistency with other fields: * All field names are now lower-case. * Fields that represent info copied over from another dataset are now prefixed with the name of that dataset. For example, the APPLCN data from IPEDS is now `ipeds_applcn`. Fields that are not prefixed with the name of a dataset are generally aggregated from information about multiple datasets. * Add `institution.enrollment`. * Add `institution.term` * Add `institution.wikidata_id`. * Change policy on removing academic field classification. In previous versions of the OS dataset, certain academic fields considered poor quality (based on performance on a test set) were nulled after classification. With this version of the dataset, OS is no longer nulling academic field classifications. As a consequence, every syllabus is assigned an academic field, and several more academic fields are available. Users can decide which fields they trust; a new value, `field.label_precision`, has been added to help with that decision. * Change policy on manually marking documents as syllabi. OS occasionally marks certain groups of documents as syllabi even if they weren't identified by the syllabus classifier as being syllabi. In previous versions of the dataset, documents that bypassed the syllabus classifier in this way were assigned a `syllabus_probability` of 1.0, overwriting whatever `syllabus_probability` was assigned to them by the syllabus classifier. In this version of the dataset, `syllabus_probability` values are never overwritten. This means that some syllabi in the syllabi dataframe have a `syllabus_probability` of less than 0.5. * Drop `language`. Matches dataframe ------------------ * Add ~25 million new citation matches. * Drop `m1` and `m2`. Catalog dataframe ------------------ * Reorganize the catalog schema: * Rename `match_count` to `work_match_count`. * Drop the `title` array. Each catalog record now contains only a single `title` and `subtitle` field. * Treat normalized citation data as the primary citation data. Drop un-normalized citation data and remove `normalized_` from field names. * Rename `normalized_title` to `title`. * Rename `normalized_subtitle` to `subtitle`. * Rename `normalized_publisher` to `publisher`. * Rename `normalized_authors` to `authors`. * Remove `position` from author arrays. * Add the `source` and `source_id` fields, which describe catalog record provenance. * Drop `matcher_pairs`. * Re-introduce the `publication_type` column, with a broader set of possible values. Version 2.1 *********** Syllabi dataframe ------------------ * Add ~1 million new syllabi. Matches dataframe ------------------ * Add ~8 million new syllabi. Catalog dataframe ------------------ * Expand the set of bibliographic datasets that are used as sources for the catalog. The underlying database of work expressions increased to ~150M, up from ~65M in v2.0. * Reorganize and simplify the catalog schema, to better accommodate the wider range of input sources. The details of the changes are best examined in the schema documentation, but as a summary of changes: * `title` is now a list of known titles and subtitles. * Content related to journal articles (or other content published in a "container") is now nested in an `article` field. * Records contain a list of `urls` instead of a single string `url` * Renamed: * `publication_year` to `year` * `authors.given_name` to `authors.forenames` * `authors.surname` to `authors.keyname` * `journal_title` to `article.venue` * `first_page` to `article.page_start` * `last_page` to `article.page_end` * Dropped: * `language` * `original_language` * `medium` * `series` * `translator` * `journal_isbns` (rolled into `isbns`) * `journal_issns` (rolled into `issns`) * `edition_number` * `publication_type` * `matcher_pairs.logp` Version 2.0 *********** Syllabi dataframe ------------------ * Add a heuristic to the date classifier that nulls clearly incorrect `year` values. * Rename all cases of Timor-Leste in `country_name` to East Timor. * Add the Philippines to the global country blacklist. All syllabi identified as being from schools in the Philippines are no longer included in the dataset. * Improved coverage of the institution matcher. In version 1.9, there were ~1.4 million syllabi that did not have institutions matched to them. With 2.0, there are ~180 thousand. Matches dataframe ------------------ * Add the `pvalid` column. * Improve quality of matches with a validation classifier over the document contexts around the raw keyword matches, trained on 12k hand-labeled examples. Catalog dataframe ------------------ * Drop the `display_priority` field. This ranking was originally meant to, per version 1.9 documentation, represent the "'quality' or 'completeness' of the metadata on each record", where the top ranked record was "considered by OS to be the best 'representative' record for the work cluster". OS no longer uses such a ranking, and instead selects representative citation metadata for works -- such as the data displayed on the `Open Syllabus Explorer `_ -- based on aggregations across work clusters. * Improve the quality of the `normalized_title`, `normalized_subtitle` and `normalized_authors` fields. * Improve the quality of the `normalized_publisher` field.