Pipeline and models

At a technical level, Open Syllabus consists of essentially three sub-projects:

  1. A web crawling architecture for collecting syllabi. We use a combination of broad crawl strategies that scan for syllabi under a large set of seeds (harvested in bulk from Common Crawl and curated manually) and a set of custom scrapers that target specific sites.

  2. A suite of machine learning models that extract structured metadata from the raw documents – course code, title, year, semester, field, institution, course description, book and article assignments, learning objectives, and more.

  3. A Spark pipeline that glues everything together – ETL on the ~40T repository of web crawl data, model inference, and final dataset shaping.

_images/pipeline.png

Syllabus classification

Syllabi are quite distinctive documents at a lexical level, and can be reliably identified with basic bag-of-words classifiers. We use a simple logistic regression over ngram features (10,000 most frequent, size 1-3, tf-idf weighted) and train on a class-balanced set of 13,743 documents. On a held-out test set of 4,581 documents, the current model is 97% accurate.

In practice, when run on the full set of ~350 million candidate documents in our web crawl corpus, the real performance is probably slightly worse, due to the continually shifting distribution of the data in the corpus as we expand the crawling architecture and incorporate data from new sources.

Separate from the model itself, there’s also some fundamental ambiguity about what does and doesn’t constitute a syllabus. For example, university websites often contain short “stub” pages about courses, that sit in a middle ground between a full-blown syllabus and a course catalog entry – they generally include a course code, title, date, and sometimes a description, but not the rich description of the course you’d get on a long-format syllabus document that the instructor would pass out to the students in the class. When labeling training data, we currently define syllabi fairly loosely as a “document that describes a specific instance in which a course is taught.” Depending on your needs, though, this might mean that the number of documents that meet your requirements is somewhat lower than the overall count – for example, if you were only interested in syllabi with detailed week-by-week assignment sequences, or with long-format course description content.

In the future, we may train more granular models that can separate these different types of syllabi and syllabus-like documents – full-blown syllabi, short syllabi, course catalog entries, course stub pages on department websites, etc.

Course metadata extraction (“Syllaparse”)

Syllabi are an interesting mix of structured, “semi-structured,” and totally unstructured content. Most include core information about the course that might show up in course catalog data – course code, title, semester, year, department, and usually some kind narrative description of the course content, if only a couple of sentences. Beyond this, many syllabi include a set of more loosely structured sections that appear with some consistency in many syllabi – lists of required books, week-by-week assignment sequences, learning objectives, grading rubrics, assessment strategies, etc.

The difficulty, though, is that this is all locked inside of unstructured text documents. Most syllabi are distributed as raw PDF / DOCX / HTML files. Some institutions use course management systems that represent syllabi in structured ways internally, but when scraped from the web, we just see the rendered HTML. So in practice, when working with syllabi at scale across hundreds of institutions, we’re effectively dealing with a big heap of unstructured text blobs.

To convert these into a structured format that can easily analyzed, Open Syllabus uses a suite of information extraction models (“Syllaparse”) to extract span-level metadata from the documents. We approach this as a sequence labeling task, in which each individual word in a document is assigned to a tag, and then the individual tag predictions are clustered together into entities. At a high level, we follow the architecture described in the original BERT paper for the CoNLL task – a pre-trained transformer (we use DistilBERT in production, for speed) is used to form contextual representations of the document tokens, and then each token state is passed to a classification head that predicts a tag for each token.

bert Created with Sketch. CS 280 , Machine Learning for Natural Language 20116 B-CODE B-TITLE I-TITLE I-TITLE I-TITLE I-TITLE I-CODE O 13427 1010 3698 4083 2005 3019 2653 BERT

One challenge in applying BERT-like models in this setting is the length of the documents – syllabi vary in size, but the average word length in the Open Syllabus corpus is around 3,000 words, and almost always over the 512-token max sequence length in BERT. To get around this, we first split docs into a set of chunks that overlap by 50 tokens on both sides of the break points. For example, if using a chunk size of 400, the first sequence would be tokens 0-450, the second would be 350-850, then 750-1250, etc. This way, every word in the document is guaranteed to be encoded at least once with no fewer than 50 words of context on both sides, which avoids low-context “sharp edges” at the chunk boundaries. In the real model, we use a chunk size of 510 tokens, which fills the 512-token BERT window once [CLS] and [SEP] are added.

Slice Created with Sketch. chunk 1 Word 510 chunk 2 chunk 3 ..... Word 530 Word 490 20 word overlap 20 word overlap Document

Once the documents are split into these overlapping chunks, we treat them as standalone sub-documents and train standard classification heads over the tokens. At inference time, we first split the docs in to chunks, group into fixed-size minibatches, run the model forward pass, and then re-group the chunks by document to assemble a single set of tag predictions for each document.

One issue at this step is that we end up with two predictions for tokens that fall inside the “overlap” regions in the documents. For example, with the 400-token chunk example – tokens 0-350 would have just 1 prediction, from the first chunk; but tokens 350-450 would each have two predictions, one from chunk 1 and another from chunk 2. Ultimately, though, we want just a single tag sequence for the document, so these overlap predictions need to be somehow merged together. There are various ways to approach this, ranging from simple heuristics up to more complicated techniques that involve learning a function to interpolate between the two states. (Though interestingly, Joshi et al. found that this didn’t do better than a baseline with no overlap at all.) Currently, we just take the prediction that is the farthest away from one of the chunk edges - the prediction with the larger “support” inside the chunk.

This results in a single tag sequence per document, patched together from the predictions made on the individual chunks. We then cluster together the individual token tags to form multi-word entities, and then map these token sequences back to character spans in the original documents.

In 2.5 we provide:

  • Course code – The identifier for the course that appears in the institutional catalog.

  • Course section – The identifier for the specific section of the course that the syllabus corresponds to.

  • Course title – The name of the course.

  • Date – The calendar year and semester in which the course was taught.

  • Course description – Narrative description of the course content.

  • Learning outcomes – Lists of skills or competencies that students are expected to acquire.

  • Topic outline – Lists of topics covered in the course.

  • Required readings – Citations for books, articles, or other materials that are required for the course. (As opposed to just cited or recommended.)

  • Assessment strategy – How grades are assigned.

  • Grading rubric – The type and range of grades that can be given.

  • Assignment schedule – Week-by-week sequences of readings, assignments, and topics.

We evaluate the model in terms of the number of perfect matches (where the predicted span is an exact character match with the gold annotation) and also the character-level overlap ratio between the gold and predicted spans, a more forgiving metric that captures the fairly common case where the model predicts a subset or superset of the gold annotation. For example, if the labeler marked 10 description sentences, and the model tags 9 – this is a fairly good result, though would count as a miss under a standard NER-style evaluation.

In 2.5, the predictions from the model have a 0.815 character-level overlap with the gold annotations, when averaging the individual scores across all 11 fields.

This model is the newest piece of the Open Syllabus pipeline, and still very much a work in progress. For v2.5, we’re exploring a number of new directions at the modeling level, including more sophisticated strategies for chunking the documents and merging together overlapping predictions and modified transformer architectures that can accommodate longer sequences.

Citation extraction

Syllabi are full of references to different kinds of educational resources – mainly books and articles, but also websites, YouTube videos, Wikipedia articles, movies, paintings, musical scores, and more. These citations are an incredibly rich source of data, in that they represent millions of expert-level judgements about what materials are best to learn a given subject.

From an information extraction standpoint, though, assignment strings in syllabi are tricky. There’s a large literature about extracting bibliographic citations in scientific papers, where there’s a high level of structure – citations are listed out as formal references at the bottom of the PDF, formatted in one of a handful of major citation styles. In syllabi, by contrast, there’s no standardization in how resources are cited. Occasionally – mostly the sciences – you’ll see well-structured citations like you find in research papers:

But this is the exception, not the rule. More common is some sort of more casual style, often with just the title of the work and the author name:

Sometimes the instructor will specify a particular edition or translation of the text:

But again, this is relatively uncommon; generally you can only count on a title and the surname of the (first) author. And, even assuming that just those two elements are present, there’s wide variation in how the reference is formatted. Sometimes, author first and then title; other times title then author; sometimes with words interspersed (“by”), other times with just a comma or dash; sometimes with the author listed once, and then multiple works by the author; and so on.

Citations to the same work can also appear mulitiple times with different formatting within the same syllabus. For example, one fairly common patterns is for a syllabus to list out a set of required texts somewhere in the document, maybe near the beginning:

And then later on, in a section with week-by-week reading assignments, these texts will be referenced with some kind of abbreviated / short-hand format:

But if possible, we very much want to get all of these individual citation occurrences, because in some cases the specific positions inside the document can be very interesting. For example, with the chronologically-ordered assignment blocks – these sections encode information about the sequence or dependency among texts, which, potentially, is incredibly rich information. By modeling those sequences across millions of docs, you could induce a kind of massive dependency graph over books and articles.

So, how to reliably extract these kinds of highly unstructured citation strings? There are a few ways to go at this; we take an approach that tries to both maximize accuracy and also minimize the amount of manual data labeling. We leverage the fact that, unlike with unconstrained free-text entities like course codes and titles, with citation extraction we’re linking against a closed (though very large) set of target entities – in effect, all known books and articles. This makes it possible to treat the problem as an entity linking task, split into two parts:

First – starting with a knowledge base of ~150M bibliographic records, we surface a high-recall / low precision set of candidate matches by scanning the documents for keywords that constitute what can be thought of as a “minimal lexical footprint” of a book or article. In the context of traditional entity linking tasks, this is comparable to the step of locating “mentions” to knowledge base entities, often implemented as a simple lookup table that maps keyword(s) to sets of candidate entities. In the case of book and article citations, though, this step is more complex in that a “mention” is defined by the presence of two entities, not just one – the tokens for the title of the work, and the surname of the (first) author. And, these two keyword sequences can be separated in the document by an arbitrary number of gap tokens. This also has to be done at significant scale – we need to probe for 150 million title/author pairs in 25 billion words of full-text data.

To do this, we use a procedure similar to the Aho-Corasick algorithm for multiple-pattern string matching, adapted to accommodate variable-length gaps between the title and author keywords. This is written in Rust using a custom trie implementation optimized around the matching procedure, which allows us to index the full set of 150M query sequences in memory on a worker node. We then search for all title/author pairs (in either order) with a single linear pass through tokens in the corpus, taking all pairs that are separated by up to 10 tokens.

Second – we apply a binary validation model to prune out false-positive matches. In many cases, the combined tokens in the author name and title are sufficiently rare that the raw keyword match is almost guaranteed to represent a legitimate citation. For example, for a pair like attention is all you need / vaswani – if those two sequences appear within 10 tokens, the raw joint probability over the tokens is so low that it’s virtually certain to be a reference to the paper.

But, this breaks down for a relatively small (but important) subset of books and articles with short titles and common word forms in the title and author fields, which includes many highly canonical and frequently-assigned works. For example, if the words “politics” and “aristotle” appear within a 10-token radius, this might be a reference to Aristotle’s Politics, or it might just be the words being used in the context of regular prose. For example, no actual Politics here:

Also, at a more practical level – in large knowledge bases like the catalog of books and articles that we use, there are always some bibliographic records that are essentially junk, but have tokens in the title and author fields that appear frequently in the corpus. (For example – highly general stub records like “introduction” by “editor,” or cases where fields have been populated incorrectly – “los angeles” by “california,” etc.) Many of these can be excluded up front with simple heuristics, but invariably some slip through and end up producing large numbers of bogus matches that don’t actually represent real citations to books or articles.

To remove these, we use a bidirectional LSTM over the document contexts around the keyword matches that predicts whether the match represents a real reference / citation / assignment, or just an incidental co-occurrence of terms. We pass in the document snippet that contains the title and author keyword matches:

_images/match-segments.png

This could be encoded as a single sequence, but, because the salient features for the task tend to be concentrated around the boundaries between the segments (i.e. – Is the first letter of the author span capitalized? Is there a comma immediately after the last title token?), we find that the best performance comes from first encoding the segments independently and then concatenating together the embeddings of the individual pieces to form a representation of the full span, which essentially forces the model to attend to the known structure in the sequence.

This model is trained on 12,316 hand-labeled examples, and gets to ~90% test accuracy. When run on the 2.5 corpus, this results in 57,248,314 validated citations.

Note

It’s important to note that we currently just match at the level of the “work,” and don’t disambiguate among different “expressions” of the same work – publishers, editions, translations, etc. For example, the current version of the data can say that Paul Krugman’s textbook International Relations: Theory and Policy is assigned 2,002 times in the corpus, but we don’t distinguish among the 10 editions of the book. Likewise, for highly canonical works like the Iliad by Homer – in the underlying knowledge base we might have 50-60 records corresponding to different editions or translations, but the current data can only say that some combination of those editions were assigned 3,352 times. In future releases, we plan to disambiguate citation matches at the level of publisher, translator, and edition.

Near-duplicate clustering

Open Syllabus harvests syllabi from institutional repositories, course catalogs, faculty homepages, and departmental websites, and does so on a repeated basis roughly once per semester to pick up new documents as they come online. Because we regularly re-crawl the same sites, we’re susceptible to pulling in duplicate copies of the same documents. We can trivially drop out exact duplicates via checksum comparisons, but, as with any web crawling effort, there’s long tail of ways that pages can change in small and subtle ways, even though the document content is unchanged at a substantive level – site redesigns can cause menu chrome to change; date widgets can print out unique timestamps every time the page is requested; a typo in a course description can get fixed, etc.

Typically the way to handle this is to do some kind of near-duplicate clustering over the document content, generally via something like minhash / LSH indexing. Syllabi are somewhat unusual though, in the sense that there are certain types of differences between documents that might be very small in terms of the size of the raw diff – maybe just a few words or characters – but that constitute a meaningful distinction between different versions of the documents that we want to preserve in the corpus. For example, say a course is taught for five years in a row, and each year the syllabus is essentially identical, with a couple small exceptions – the date changes each year, and in the fourth and fifth years, the required textbook is changed. Each of the five versions of the syllabus might be ~99% similar, with just the date string and the 4-5 word textbook citation differing. But, when aggregated across millions of courses, these types of changes can be valuable in various analytical ways – for example, if you’re interested in tracking shifts in book assignment trends over time. So we wouldn’t want to treat these documents as duplicates, even though they’re virtually identical and would likely hash to the same bucket in an LSH index.

To get around this, we dedupe syllabi on a composite key formed from the combination of an LSH index bucket and a subset of the span-level metadata that we extract from the documents:

  • LSH bucket (trigram shingles, Jaccard similarity threshold of >0.95)

  • Course code

  • Title

  • Year

  • Semester

  • Email addresses

  • Assigned books + articles

In 2.5, this removes 2.5M duplicates out of an initial set of 10M documents that are classified as syllabi and pass the initial checksum dedupe, leaving 7.5M deduplicated syllabi in the final corpus.

Field classification

The organization of fields and departments varies significantly across different institutions. To abstract over some of these differences and make it easy to facet syllabi by field across different schools, Open Syllabus classifies syllabi into a curated set of field labels based on the US Department of Education CIP codes (2010 version), but rolled up in places to avoid granular distinctions that aren’t consistently reflected in the department structure at many institutions.

Like with the syllabus classification, simple bag-of-words ngram models do fairly well with this. We use the LinearSVC from scikit-learn, trained on 19,882 documents. The current model is 84% accurate on test data across all 70 fields. With this model, it’s important to note that there’s significant class imbalance in both the training data and the overall corpus – there are far more business courses in the world than “Transportation” courses, and it’s possible that the model may generalize less well for these smaller fields when run against the full set of 7.5M syllabi. There are also a handful of fields in the current label set that seem to be fundamentally ill-defined (Public Administration, Career Skills, Basic Skills), which in future releases we may drop or merge into higher-order fields.

We plan to expand the training set for this model spring / summer 2020, with the goal of getting overall F1 to >0.90.

Table 1 Per-field P / R / F1

Academic Field

Precision

Recall

F1 score

Accounting

0.96

0.90

0.93

Agriculture

0.83

0.67

0.74

Anthropology

0.93

0.86

0.89

Architecture

0.81

0.75

0.78

Astronomy

0.82

0.60

0.69

Atmospheric Sciences

0.88

1.00

0.94

Basic Computer Skills

0.75

0.76

0.76

Basic Skills

0.77

0.62

0.69

Biology

0.85

0.92

0.88

Business

0.77

0.85

0.80

Career Skills

0.47

0.32

0.38

Chemistry

0.94

0.89

0.92

Chinese

0.92

0.97

0.94

Classics

0.67

0.67

0.67

Computer Science

0.74

0.82

0.78

Construction

0.69

0.53

0.60

Cosmetology

0.98

0.98

0.98

Criminal Justice

0.74

0.71

0.73

Criminology

0.60

0.50

0.55

Culinary Arts

0.81

0.97

0.88

Dance

0.90

0.98

0.94

Dentistry

1.00

0.97

0.99

Earth Sciences

0.91

0.80

0.85

Economics

0.93

0.92

0.92

Education

0.87

0.87

0.87

Engineering

0.66

0.70

0.68

Engineering Technician

0.58

0.57

0.57

English Literature

0.87

0.91

0.89

Film and Photography

0.82

0.72

0.77

Fine Arts

0.84

0.86

0.85

Fitness and Leisure

0.83

0.75

0.79

French

0.97

0.98

0.97

Geography

0.87

0.89

0.88

German

0.90

0.95

0.93

Health Technician

0.71

0.70

0.71

Hebrew

0.87

0.93

0.90

History

0.90

0.91

0.90

Japanese

0.95

1.00

0.98

Journalism

0.93

0.99

0.96

Law

0.82

0.82

0.82

Liberal Arts

0.78

0.65

0.71

Library Science

0.93

0.90

0.92

Linguistics

0.97

0.87

0.92

Marketing

0.82

0.82

0.82

Mathematics

0.92

0.97

0.95

Mechanic / Repair Tech

0.81

0.69

0.74

Media / Communications

0.83

0.77

0.80

Medicine

0.74

0.74

0.74

Military Science

0.91

0.91

0.91

Music

0.78

0.89

0.83

Natural Resource Management

0.70

0.57

0.63

Nursing

0.93

0.86

0.89

Nutrition

0.88

0.81

0.85

Philosophy

0.83

0.83

0.83

Physics

0.83

0.89

0.86

Political Science

0.87

0.93

0.90

Psychology

0.86

0.92

0.89

Public Administration

0.83

0.33

0.48

Public Safety

0.71

0.76

0.74

Religion

0.85

0.80

0.82

Sign Language

0.98

1.00

0.99

Social Work

0.90

0.86

0.88

Sociology

0.92

0.74

0.82

Spanish

0.94

0.94

0.94

Theatre Arts

0.86

0.81

0.84

Theology

0.94

0.81

0.87

Transportation

0.69

0.53

0.60

Veterinary Medicine

0.72

0.81

0.76

Women’s Studies

0.80

0.91

0.85

Table 2 Overall performance

Precision

Recall

F1

Accuracy

0.84

Macro avg

0.83

0.81

0.82

Weighted avg

0.84

0.84

0.84