Changes in 2.5¶
Introduce ~750k new syllabi into the underlying dataset.
In the syllabus text content, we now convert unicode line breaks (
\u2028) to regular line breaks (
\n), which fixes a class of errors that can occur when reading JSON lines data using standard Python utilities.
Changes in 2.4¶
Introduce ~1M new syllabi into the underlying dataset.
Extract 7 new span-level metadata fields -
When deduplicating documents, include the course section in the composite key used to identify a unique document.
Changes in 2.3¶
Implement a new “fuzzy” deduplication strategy that identifies documents that are substantively the same even if they have slight differences. (For example, if the same page is crawled twice, and include an auto-generated timestamp.).
Improve date parsing. Previously we based the
yearfields just on the date span extracted from the text content, but 2.3 expands this to also look for dates in the document URLs and the text of the original HTML link that pointed to the page on the web.
Update institution metadata – we now pull from the most recent releases of the underlying IPEDS, Carnegie, and Grid sources.
Changes in 2.2¶
Introduce ~1.8 million more syllabi to the underlying dataset.
Update the syllabus, field and date classifiers for performance improvements.
Reorganize the syllabi schema into a series of nested groups based on the matching and classification routines run over the syllabi. Fields directly related to the “raw” syllabus, like syllabus_probability, are available at the top level. Then there are the following groups:
The date group contains output from the OS date classifier.
The field group contains output from the OS academic field classifier.
The institution group contains output from the OS institution matcher.
The extracted_metadata group contains sub-groups representing output from several OS course classifiers. Current sub-groups are code, title, date and description.
Drop the field_score field. The OS academic field classifier no longer provides meaningful or reliable output for this field.
Update all data sources to recent (2018 or 2019) versions.
Reorganize the institution matcher fields (now all nested under institution, per the above) for clarity and consistency with other fields:
All field names are now lower-case.
Fields that represent info copied over from another dataset are now prefixed with the name of that dataset. For example, the APPLCN data from IPEDS is now ipeds_applcn. Fields that are not prefixed with the name of a dataset are generally aggregated from information about multiple datasets.
Change policy on removing academic field classification. In previous versions of the OS dataset, certain academic fields considered poor quality (based on performance on a test set) were nulled after classification. With this version of the dataset, OS is no longer nulling academic field classifications. As a consequence, every syllabus is assigned an academic field, and several more academic fields are available. Users can decide which fields they trust; a new value, field.label_precision, has been added to help with that decision.
Change policy on manually marking documents as syllabi. OS occasionally marks certain groups of documents as syllabi even if they weren’t identified by the syllabus classifier as being syllabi. In previous versions of the dataset, documents that bypassed the syllabus classifier in this way were assigned a syllabus_probability of 1.0, overwriting whatever syllabus_probability was assigned to them by the syllabus classifier. In this version of the dataset, syllabus_probability values are never overwritten. This means that some syllabi in the syllabi dataframe have a syllabus_probability of less than 0.5.
Introduce ~25 million more matches to the underlying dataset.
Drop m1 and m2.
Reorganize the catalog schema:
Rename match_count to work_match_count.
Drop the title array. Each catalog record now contains only a single title and subtitle field.
Treat normalized citation data as the primary citation data. Drop un-normalized citation data and remove normalized_ from field names.
Rename normalized_title to title.
Rename normalized_subtitle to subtitle.
Rename normalized_publisher to publisher.
Rename normalized_authors to authors.
Remove position from author arrays.
Add the source and source_id fields, which describe catalog record provenance.
Re-introduce the publication_type column, with a broader set of possible values.
Changes in 2.1¶
Introduce ~1 million more syllabi to the underlying dataset.
Introduce ~8 million more matches to the underlying dataset.
Greatly expand the set of bibliographic datasets that are used as sources for the catalog. The underlying database of work expressions increased to ~150M, up from ~65M in v2.0.
Reorganize and simplify the catalog schema, to better accommodate the wider range of input sources. The details of the changes are best examined in the schema documentation, but as a summary of changes:
title is now a list of known titles and subtitles.
Content related to journal articles (or other content published in a “container”) is now nested in an article field.
Records contain a list of urls instead of a single string url
publication_year to year
authors.given_name to authors.forenames
authors.surname to authors.keyname
journal_title to article.venue
first_page to article.page_start
last_page to article.page_end
journal_isbns (rolled into isbns)
journal_issns (rolled into issns)
Changes in 2.0¶
Add a heuristic to the date classifier that nulls clearly incorrect year values.
Rename all cases of Timor-Leste in country_name to East Timor.
Add the Philippines to the global country blacklist. All syllabi identified as being from schools in the Philippines are no longer included in the dataset.
Improved coverage of the institution matcher. In version 1.9, there were ~1.4 million syllabi that did not have institutions matched to them. With 2.0, there are ~180 thousand.
Add the pvalid column.
Improve quality of matches with a validation classifier over the document contexts around the raw keyword matches, trained on 12k hand-labeled examples.
Drop the display_priority field. This ranking was originally meant to, per version 1.9 documentation, represent the “‘quality’ or ‘completeness’ of the metadata on each record”, where the top ranked record was “considered by OS to be the best ‘representative’ record for the work cluster”. OS no longer uses such a ranking, and instead selects representative citation metadata for works – such as the data displayed on the Open Syllabus Explorer – based on aggregations across work clusters.
Improve the quality of the normalized_title, normalized_subtitle and normalized_authors fields.
Greatly improve the quality of the normalized_publisher field.