Changelog
Version 2.8
Add ~5M new syllabi.
Add a pipeline for mining syllabi out of the CommonCrawl dataset. For 2.8, we applied the syllabus classifier to ~50% of the WET (extracted text) records in CommonCrawl as of March 2022, as well as 100% WETs from major English-language educational TLDs (
edu
,ac.uk
,edu.au
,ca
). This yielded ~3M new documents in the final dataset.Add a new
school_name
field to the top-level span extraction model.Fix a bug in the Spark pipeline from 2.7 – the document URLs weren’t getting passed into the document parser, which caused the institution match rate to fall significantly, since the document URL is often the basis for the match.
Refactor and generally improve institution matcher; add matching based on the new extracted
school_name
field. Combined with the bug fix, this boosted the institution match rate to ~96%, a 5% improvement from before.Expand the underlying institution metadata – add
wikipedia_url
,description
, andimage_url
.In bibliographic database that backs the citation matcher, remove the open_access=True flag on records from Project Gutenberg. Though these books are indeed free to read, they aren’t OA-licensed in a narrow sense of the concept – they’re out-of-copyright and in the public domain, as opposed to works that are in copyright but published under open licenses. In the future, we might model this distinction explicitly in the data.
Expand the training dataset for the syllabus classifier, driven by the need to for a very high-precision model when probing through the CommonCrawl dataset. (~18k new documents.)
Expand the training dataset for the top-level span extractor – ~5k new labeled documents for
code
,title
,date
,description
, andsection
; ~6k labeled documents for the newschool_name
field.Simplify the structure of the
urls
field in the parser output. Instead of exposing separate fields forhyperlinks
,text_urls
, andcombined_urls
, just provide a singleurls
field with a flat list of normalized URLs.Improve the near-duplicate clustering in the Spark pipeline. Namely, we now compute LSH buckets for each of the major full-text fields from the parser individually, instead of using the full document text. This will give a cleaner fingerprint for the document, since the extracted fields will hone in better on the relevant document content, and skip over boilerplate / navigation chrome.
Version 2.7
Add ~650k new syllabi.
Switch to a single-dataframe distribution format – instead of providing the data as 3x entity types (
syllabi
,works
,citations`
), denormalize everything onto a singlesyllabi
dataframe, which includes all of the core document metadata as well as joined metadata for the institution and book-and-article citations.
Version 2.6
Add ~700k new syllabi.
Version 2.5
Add ~750k new syllabi.
In the syllabus text content, we now convert unicode line breaks (
\u2028
) to regular line breaks (\n
), which fixes a class of errors that can occur when reading JSON lines data using standard Python utilities.
Version 2.4
Add ~1M new syllabi.
Extract 7 new span-level metadata fields -
section
,learning_outcomes
,topic_outline
,required_reading
,assessment_strategy
,grading_rubric
, andassignment_schedule
.When deduplicating documents, include the course section in the composite key used to identify a unique document.
Version 2.3
Implement a new “fuzzy” deduplication step that identifies documents that are substantively the same even if they have slight differences. (For example, if the same page is crawled twice, and include an auto-generated timestamp.).
Improve date parsing. Previously we based the
term
andyear
fields just on the date span extracted from the text content, but 2.3 expands this to also look for dates in the document URLs and the text of the original HTML link that pointed to the page on the web.Update institution metadata – we now pull from the most recent releases of the underlying IPEDS, Carnegie, and Grid sources.
Version 2.2
Syllabi dataframe
Add ~1.8 million new syllabi.
Update the syllabus, field and date classifiers for performance improvements.
Reorganize the syllabi schema into a series of nested groups based on the matching and classification routines run over the syllabi. Fields directly related to the “raw” syllabus, like syllabus_probability, are available at the top level. Then there are the following groups:
The date group contains output from the OS date classifier.
The field group contains output from the OS academic field classifier.
The institution group contains output from the OS institution matcher.
The extracted_metadata group contains sub-groups representing output from several OS course classifiers. Current sub-groups are code, title, date and description.
Drop the field_score field. The OS academic field classifier no longer provides meaningful or reliable output for this field.
Update institutions:
Update all data sources to recent (2018 or 2019) versions.
Reorganize the institution matcher fields (now all nested under institution, per the above) for clarity and consistency with other fields:
All field names are now lower-case.
Fields that represent info copied over from another dataset are now prefixed with the name of that dataset. For example, the APPLCN data from IPEDS is now ipeds_applcn. Fields that are not prefixed with the name of a dataset are generally aggregated from information about multiple datasets.
Add institution.enrollment.
Add institution.term
Add institution.wikidata_id.
Change policy on removing academic field classification. In previous versions of the OS dataset, certain academic fields considered poor quality (based on performance on a test set) were nulled after classification. With this version of the dataset, OS is no longer nulling academic field classifications. As a consequence, every syllabus is assigned an academic field, and several more academic fields are available. Users can decide which fields they trust; a new value, field.label_precision, has been added to help with that decision.
Change policy on manually marking documents as syllabi. OS occasionally marks certain groups of documents as syllabi even if they weren’t identified by the syllabus classifier as being syllabi. In previous versions of the dataset, documents that bypassed the syllabus classifier in this way were assigned a syllabus_probability of 1.0, overwriting whatever syllabus_probability was assigned to them by the syllabus classifier. In this version of the dataset, syllabus_probability values are never overwritten. This means that some syllabi in the syllabi dataframe have a syllabus_probability of less than 0.5.
Drop language.
Matches dataframe
Add ~25 million new citation matches.
Drop m1 and m2.
Catalog dataframe
Reorganize the catalog schema:
Rename match_count to work_match_count.
Drop the title array. Each catalog record now contains only a single title and subtitle field.
Treat normalized citation data as the primary citation data. Drop un-normalized citation data and remove normalized_ from field names.
Rename normalized_title to title.
Rename normalized_subtitle to subtitle.
Rename normalized_publisher to publisher.
Rename normalized_authors to authors.
Remove position from author arrays.
Add the source and source_id fields, which describe catalog record provenance.
Drop matcher_pairs.
Re-introduce the publication_type column, with a broader set of possible values.
Version 2.1
Syllabi dataframe
Add ~1 million new syllabi.
Matches dataframe
Add ~8 million new syllabi.
Catalog dataframe
Expand the set of bibliographic datasets that are used as sources for the catalog. The underlying database of work expressions increased to ~150M, up from ~65M in v2.0.
Reorganize and simplify the catalog schema, to better accommodate the wider range of input sources. The details of the changes are best examined in the schema documentation, but as a summary of changes:
title is now a list of known titles and subtitles.
Content related to journal articles (or other content published in a “container”) is now nested in an article field.
Records contain a list of urls instead of a single string url
Renamed:
publication_year to year
authors.given_name to authors.forenames
authors.surname to authors.keyname
journal_title to article.venue
first_page to article.page_start
last_page to article.page_end
Dropped:
language
original_language
medium
series
translator
journal_isbns (rolled into isbns)
journal_issns (rolled into issns)
edition_number
publication_type
matcher_pairs.logp
Version 2.0
Syllabi dataframe
Add a heuristic to the date classifier that nulls clearly incorrect year values.
Rename all cases of Timor-Leste in country_name to East Timor.
Add the Philippines to the global country blacklist. All syllabi identified as being from schools in the Philippines are no longer included in the dataset.
Improved coverage of the institution matcher. In version 1.9, there were ~1.4 million syllabi that did not have institutions matched to them. With 2.0, there are ~180 thousand.
Matches dataframe
Add the pvalid column.
Improve quality of matches with a validation classifier over the document contexts around the raw keyword matches, trained on 12k hand-labeled examples.
Catalog dataframe
Drop the display_priority field. This ranking was originally meant to, per version 1.9 documentation, represent the “‘quality’ or ‘completeness’ of the metadata on each record”, where the top ranked record was “considered by OS to be the best ‘representative’ record for the work cluster”. OS no longer uses such a ranking, and instead selects representative citation metadata for works – such as the data displayed on the Open Syllabus Explorer – based on aggregations across work clusters.
Improve the quality of the normalized_title, normalized_subtitle and normalized_authors fields.
Improve the quality of the normalized_publisher field.