Architecture + models

At a technical level, Open Syllabus consists of essentially three sub-projects:

  1. A web crawling architecture for collecting syllabi. We use a combination of broad crawl strategies that scan for syllabi under a large set of seeds (harvested in bulk from Common Crawl and curated manually) and a set of custom scrapers that target specific sites.

  2. A suite of machine learning models that extract structured metadata from the raw documents – course code, title, year, semester, field, institution, course description, book and article assignments, learning objectives, and more.

  3. A Spark pipeline that does ETL on the source datasets and runs model inference.

Pipeline

_images/pipeline-flowchart.png

Syllabus classification

Syllabi are quite distinctive at a lexical level, and can be identified with simple bag-of-words classifiers. We use a logistic regression over ngram features, trained on a class-balanced set of 13,743 documents, which has 97% accuracy on test data.

It’s worth nothing that there’s some ambiguity about what does and doesn’t constitute a syllabus. For example, university websites often contain short “stub” pages about courses, that sit in a middle ground between a full-blown syllabus and a course catalog entry – they generally include a course code, title, date, and sometimes a description, but not the rich description of the course you’d get on a long-format syllabus document. When labeling training data, we currently define syllabi fairly loosely as a “document that describes a specific instance in which a course is taught.” Depending on your needs, though, this might mean that the number of documents that meet your requirements is somewhat lower than the overall count – for example, if you were only interested in syllabi with detailed week-by-week assignment sequences, or with long-format course description content.

In the future, we may train more granular models that can separate these different types of syllabi and syllabus-like documents – full-blown syllabi, short syllabi, course catalog entries, course stub pages on department websites, etc.

Course metadata extraction (“Syllaparse”)

Most syllabi are distributed as raw PDF / DOCX / HTML files. Some institutions use course management systems that represent syllabi in structured ways internally, but when scraped from the web, we just see the rendered HTML. So in practice, when working with syllabi at scale across hundreds of institutions, we’re effectively dealing with a big heap of unstructured text blobs.

To convert these into a format that can easily analyzed, Open Syllabus uses a suite of information extraction models to map the documents into a structured schema of sections and entities. The core of this is a sequence labeling architecture (“Syllaparse”) that extracts spans of various types from the documents. At a high level, we follow the CoNLL architecture in the original BERT paper – a transformer is used to form contextual representations of the document tokens (we use DistilBERT in production, for speed), and then each token state is passed to a classification head that predicts a tag for each token, which can then be clustered together to form span predictions.

bert Created with Sketch. CS 280 , Machine Learning for Natural Language 20116 B-CODE B-TITLE I-TITLE I-TITLE I-TITLE I-TITLE I-CODE O 13427 1010 3698 4083 2005 3019 2653 BERT

One challenge in applying BERT-like models in this setting is the length of the documents, which are almost always longer than the standard 512-token BERT input size. Syllabi vary in length, but the average word length in the Open Syllabus corpus is around 3,000 words, and many of the richest and most interesting syllabi can be even longer, sometimes 5,000 or more. Unlike classification tasks where it often works to just truncate documents at a certain size, for this kind of span extraction task – where an entity could appear anywhere in the document – all tokens have to get tagged. So, even with the newer sparse attention architectures like Longformer and BigBird, which push the input size to 4,096 tokens, we’d still need some kind of chunking strategy to handle long documents.

We first split documents into a set of chunks that overlap by N≈100 tokens on both sides of the break points. For example, if using a chunk size of 500, the first sequence would be tokens 0-500, the second would be 400-900, then 800-1300, etc. This way, every word in the document is guaranteed to be encoded at least once with no fewer than 50 words of context on both sides, which avoids low-context “sharp edges” at the chunk boundaries. In the real model, we use a chunk size of 510 tokens, which fills the 512-token BERT window once [CLS] and [SEP] are added.

Group chunk 1 chunk 2 chunk 3 ..... Word 410 Word 820 Word 920 Word 510

Once the documents are split into these overlapping chunks, we treat them as standalone sub-documents and train the classification heads over the tokens. At inference time, we split the documents into chunks, group into fixed-size batches, run the model forward pass, and then re-group the chunks by document, and then merge the token predictions to get a single sequence of tag predictions for each document. These are then clustered together for form multi-token span predictions, which can then be mapped back to character spans in the original documents.

In 2.7 we provide:

  • Course code – The identifier for the course that appears in the institutional catalog.

  • Course section – The identifier for the specific section of the course that the syllabus corresponds to.

  • Course title – The name of the course.

  • Date – The calendar year and semester in which the course was taught.

  • Course description – Narrative description of the course content.

  • Learning outcomes – Lists of skills or competencies that students are expected to acquire.

  • Topic outline – Lists of topics covered in the course.

  • Required readings – Citations for books, articles, or other materials that are required for the course. (As opposed to just cited or recommended.)

  • Assessment strategy – How grades are assigned.

  • Grading rubric – The type and range of grades that can be given.

  • Assignment schedule – Week-by-week sequences of readings, assignments, and topics.

We evaluate the predictions using traditional NER metrics at the span level, and also in terms of the character-level edit distance between the gold and predicted spans, a more flexible metric that captures cases where the model predicts a span that overlaps with gold annotation, but doesn’t exactly match it. For example, if the labeler marked 10 description sentences, and the model tags 9 – this is a fairly good result, though would count as a miss under a standard NER evaluation. The character-level comparisons also capture cases where an entity appears multiple times in the document (eg, the course title), and the model extracts the correct value but pulls it from a different location in the document from where the labeler marked it.

In 2.7, the predictions from the model have a 0.815 character-level edit distance with the gold annotations, when averaging the individual scores across all 11 fields.

This model is the newest piece of the Open Syllabus pipeline, and still a work in progress. For v2.7, we’re exploring a number of new directions at the modeling level, including more sophisticated strategies for chunking the documents and merging together overlapping predictions and modified transformer architectures that can accommodate longer sequences.

Citation extraction

Syllabi are full of references to different kinds of educational resources – mainly books and articles, but also websites, YouTube videos, Wikipedia articles, movies, paintings, musical scores, and more. From an information extraction standpoint, though, assignment strings in syllabi are tricky. There’s a large literature about extracting bibliographic data from scientific papers, where there’s a fairly high level of structure – citations are listed out as standard references at the bottom of the PDF, formatted in one of a handful of major citation styles.

Syllabi tend to be messier and more ad-hoc. Sometimes – mostly the sciences – you’ll see well-structured citations like you find in research papers:

_images/stem-citations.png

Fig. 1 https://www.mit.edu/~9.520/fall19/

But, it’s also common to find more abbreviated styles, often with just the title of the work and the author name:

_images/just-title-author.png

Fig. 2 http://web.stanford.edu/class/cs224n/

Sometimes the instructor will specify a particular edition or translation of the text:

_images/with-publisher.png

Fig. 3 http://www.melus.org/syllabi/African_American_Literature_Syllabus__vernacular_emphasis_.pdf

But again, generally we can only count on a title and the surname of the (first) author. And, even assuming that just those two elements are present, there’s lots of variation in how the reference is formatted. Sometimes, author first and then title; other times title then author; sometimes with words interspersed (“by”), other times with just a comma or dash; sometimes with the author listed once, and then multiple works by the author; and so on.

Citations to the same work can also appear mulitiple times with different formatting within the same syllabus. For example, one fairly common patterns is for a syllabus to list out a set of required texts somewhere in the document, maybe near the beginning:

_images/ds-top-list.png

Fig. 4 https://directedstudies.yale.edu/syllabi/syllabi/literature-syllabus

And then later on, in a section with week-by-week reading assignments, these texts will be referenced with some kind of abbreviated / short-hand format:

_images/week-by-week.png

Fig. 5 https://directedstudies.yale.edu/syllabi/syllabi/literature-syllabus

If possible, we want to get all of these individual citation occurrences, because in some cases the positions inside the document can be interesting. For example, with the chronologically-ordered assignment blocks – these sections encode information about the sequence or dependency among texts, which, potentially, is really rich information.

So, how to reliably extract these citation strings? We leverage the fact that, unlike with unconstrained free-text entities like course codes and titles, with citation extraction we’re linking against a closed (though very large) set of target entities – in effect, all known books and articles. This makes it possible to treat the problem as an entity linking task, split into two parts:

First – starting with a knowledge base of ~150M bibliographic records, we surface a set of candidate matches by scanning the documents for keywords that constitute what can be thought of as a “minimal lexical footprint” of a book or article. In the context of traditional entity linking tasks, this is comparable to the step of locating “mentions” to knowledge base entities, often implemented as a simple lookup table that maps keyword(s) to sets of candidate entities. In the case of book and article citations, though, this step is more complex in that a “mention” is defined by the presence of two entities, not just one – the tokens for the title of the work, and the surname of the (first) author. And, these two keyword sequences can be separated in the document by an arbitrary number of gap tokens. This also has to be done at significant scale – we need to probe for 150 million title/author pairs in 25 billion words of full-text data.

To do this, we use a procedure similar to the Aho-Corasick algorithm for multiple-pattern sequence matching, adapted to accommodate variable-length gaps between the title and author keywords. This allows us to search for all title/author pairs (in either order) with a single pass through tokens in the corpus, taking all pairs that are separated by up to 10 tokens.

_images/title-author-matches.png

Fig. 6 https://directedstudies.yale.edu/syllabi/syllabi/literature-syllabus

Second – we apply a binary validation model to prune out false-positive matches. In many cases, the combined tokens in the author name and title are so rare that the raw keyword match is almost guaranteed to represent a legitimate citation. For example, for a pair like attention is all you need / vaswani – if those two sequences appear within 10 tokens, raw joint probability over the tokens is so low that it’s virtually certain to be a reference to the paper.

But, this breaks down for a relatively small (but important) subset of books and articles with short titles and common word forms in the title and author fields, which includes many highly canonical and frequently-assigned works. For example, if the words “politics” and “aristotle” appear within a 10-token radius, this might be a reference to Aristotle’s Politics, or it might just be the words being used in the context of regular prose. For example, no actual Politics here:

_images/politics-invalid.png

Fig. 7 https://www.academia.edu/1746188

To remove these, we use an LSTM over the document contexts around the keyword matches that predicts whether the match is a real citation or just an incidental co-occurrence of terms.

_images/match-segments.png

This model is trained on 12,316 hand-labeled examples, and gets to ~90% test accuracy. When run on the 2.7 corpus, this results in 57,248,314 validated citations.

Note

It’s important to note that we currently just match at the level of the “work,” and don’t disambiguate among different “expressions” of the same work – publishers, editions, translations, etc. For example, the current version of the data can say that Paul Krugman’s textbook International Relations: Theory and Policy is assigned 2,002 times in the corpus, but we don’t distinguish among the 10 editions of the book. Likewise, for highly canonical works like the Iliad by Homer – in the underlying knowledge base we might have 50-60 records corresponding to different editions or translations, but the current data can only say that some combination of those editions were assigned 3,352 times. In future releases, we plan to disambiguate citation matches at the level of publisher, translator, and edition.

Near-duplicate clustering

Open Syllabus harvests syllabi from institutional repositories, course catalogs, faculty homepages, and departmental websites, and we regularly re-crawl the same sites to pick up new documents as they come online. This means that we often pick up multiple copies of the same documents. It’s easy to drop out exact duplicates with checksum comparisons, but, as with any web crawling effort, there’s long tail of ways that pages can change in subtle ways, even though the document content is unchanged at a substantive level – site redesigns can cause menu chrome to change; date widgets can print out unique timestamps every time the page is requested; a typo in a course description can get fixed, etc.

Typically the way to handle this is to do some kind of near-duplicate clustering over the document content. Syllabi are somewhat unusual though, in the sense that there are certain types of differences between documents that might be very small in terms of the size of the raw diff – maybe just a few words or characters – but that constitute a meaningful distinction between different versions of the documents that we want to preserve in the corpus. For example, say a course is taught for five years in a row, and each year the syllabus is essentially identical, with a couple small exceptions – the date changes each year, and in the fourth and fifth years, the required textbook is changed. Each of the five versions of the syllabus might be ~99% at a lexical level, with just the date string and the 4-5 word textbook citation differing. But, when aggregated across millions of courses, these types of changes can be interesting – for example, if you’re tracking shifts in book assignment trends over time. So, we wouldn’t want to discard these versions as duplicates, even though they’re very similar.

To get around this, we dedupe syllabi on a composite key formed from the combination of an LSH index bucket and a subset of the span-level metadata that we extract from the documents:

  • LSH bucket (trigram shingles, Jaccard similarity threshold of >0.95)

  • Course code

  • Title

  • Year

  • Semester

  • Email addresses

  • Assigned books + articles

In 2.7, this removes ~5M duplicates out of an initial set of 15M documents that are classified as syllabi and pass the initial checksum dedupe, leaving 10M deduplicated syllabi in the final corpus.

Field classification

The organization of fields and departments varies significantly across different institutions. To abstract over some of these differences and make it easy to facet syllabi by field across different schools, Open Syllabus classifies syllabi into a curated set of field labels based on the US Department of Education CIP codes (2010 version), but rolled up in places to avoid granular distinctions that aren’t consistently reflected in the department structure at many institutions.

Like with the syllabus classification, simple bag-of-words ngram models do fairly well with this. We use the LinearSVC from scikit-learn, trained on 19,882 documents. The current model is 84% accurate on test data across all 70 fields. With this model, it’s important to note that there’s significant class imbalance in both the training data and the overall corpus – there are far more business courses in the world than “Transportation” courses, and it’s possible that the model may generalize less well for these smaller fields when run against the full set of 7.5M syllabi. There are also a handful of fields in the current label set that seem to be fundamentally ill-defined (Public Administration, Career Skills, Basic Skills), which in future releases we may drop or merge into higher-order fields.

We plan to expand the training set for this model spring / summer 2020, with the goal of getting overall F1 to >0.90.

Table 1 Per-field P / R / F1

Academic Field

Precision

Recall

F1 score

Accounting

0.96

0.90

0.93

Agriculture

0.83

0.67

0.74

Anthropology

0.93

0.86

0.89

Architecture

0.81

0.75

0.78

Astronomy

0.82

0.60

0.69

Atmospheric Sciences

0.88

1.00

0.94

Basic Computer Skills

0.75

0.76

0.76

Basic Skills

0.77

0.62

0.69

Biology

0.85

0.92

0.88

Business

0.77

0.85

0.80

Career Skills

0.47

0.32

0.38

Chemistry

0.94

0.89

0.92

Chinese

0.92

0.97

0.94

Classics

0.67

0.67

0.67

Computer Science

0.74

0.82

0.78

Construction

0.69

0.53

0.60

Cosmetology

0.98

0.98

0.98

Criminal Justice

0.74

0.71

0.73

Criminology

0.60

0.50

0.55

Culinary Arts

0.81

0.97

0.88

Dance

0.90

0.98

0.94

Dentistry

1.00

0.97

0.99

Earth Sciences

0.91

0.80

0.85

Economics

0.93

0.92

0.92

Education

0.87

0.87

0.87

Engineering

0.66

0.70

0.68

Engineering Technician

0.58

0.57

0.57

English Literature

0.87

0.91

0.89

Film and Photography

0.82

0.72

0.77

Fine Arts

0.84

0.86

0.85

Fitness and Leisure

0.83

0.75

0.79

French

0.97

0.98

0.97

Geography

0.87

0.89

0.88

German

0.90

0.95

0.93

Health Technician

0.71

0.70

0.71

Hebrew

0.87

0.93

0.90

History

0.90

0.91

0.90

Japanese

0.95

1.00

0.98

Journalism

0.93

0.99

0.96

Law

0.82

0.82

0.82

Liberal Arts

0.78

0.65

0.71

Library Science

0.93

0.90

0.92

Linguistics

0.97

0.87

0.92

Marketing

0.82

0.82

0.82

Mathematics

0.92

0.97

0.95

Mechanic / Repair Tech

0.81

0.69

0.74

Media / Communications

0.83

0.77

0.80

Medicine

0.74

0.74

0.74

Military Science

0.91

0.91

0.91

Music

0.78

0.89

0.83

Natural Resource Management

0.70

0.57

0.63

Nursing

0.93

0.86

0.89

Nutrition

0.88

0.81

0.85

Philosophy

0.83

0.83

0.83

Physics

0.83

0.89

0.86

Political Science

0.87

0.93

0.90

Psychology

0.86

0.92

0.89

Public Administration

0.83

0.33

0.48

Public Safety

0.71

0.76

0.74

Religion

0.85

0.80

0.82

Sign Language

0.98

1.00

0.99

Social Work

0.90

0.86

0.88

Sociology

0.92

0.74

0.82

Spanish

0.94

0.94

0.94

Theatre Arts

0.86

0.81

0.84

Theology

0.94

0.81

0.87

Transportation

0.69

0.53

0.60

Veterinary Medicine

0.72

0.81

0.76

Women’s Studies

0.80

0.91

0.85

Table 2 Overall performance

Precision

Recall

F1

Accuracy

0.84

Macro avg

0.83

0.81

0.82

Weighted avg

0.84

0.84

0.84