syllabi DataFrame

The syllabi dataframe is the core of the Open Syllabus data – each row corresponds to a single syllabus document. In addition to the raw document data, each row is also annotated with metadata extracted by a suite of information extraction and entity linking models - institution, field, date, course code, course title, and description text.

_id

An OS-assigned surrogate identifier for the syllabus.

Note

This unique identifier is not guaranteed to be consistent across versions of the dataset.

md5

The md5 checksum of the raw document bytes.

doc_type

The underlying source document type. Possible values are html, plain, pdf, doc, docx, rtf.

syllabus_probability

The probability that the document is a syllabus, per the syllabus classifier The classifier is trained and tested around a 0.5 threshold: Every document assigned a score above 0.5 is considered a syllabus. The majority of documents in the syllabi dataframe have a score greater than 0.5, but ocassionally OS will manually identify certain groups of documents as being syllabi, regardless of the output of the syllabus classifier.

Filtering the syllabi dataframe by a value greater than 0.5 will return a set of documents with higher precision, at the cost of recall. At the 0.5 threshold, the model is 97% accurate on held-out test data. (See Pipeline and models for details.)

field

The academic field assigned to the document, as determined by the OS field classifier. See Pipeline and models for details about the model.

_id

An OS-assigned surrogate identifier for the field.

Note

This unique identifier is not guaranteed to be consistent across versions of the dataset.

cip_codes

A list of one or more IPEDS CIP codes, representing the academic field(s) representing the field.

OS’s field classifier draws heavily from the IPEDS 2010 CIP taxonomy in order to determine the academic field best associated with each syllabus. CIP codes come in lengths of two-, four- and six-digits, where two-digit codes represent a discipline, four-digit codes a subdivision of that discipline, and six-digit codes a further subdivision of the previous subdivision. For example, the two-digit CIP code ‘01’ is the code for all ‘Agriculture, Agriculture Operations, and Related Sciences’ courses; within that, the four-digit CIP code ‘01.01’ is the subdivision for all ‘Agricultural Business and Management’ courses, and within that, ‘01.0103’ is the subdivision for all ‘Agricultural Economics’ courses.

Our field classifier is trained and tested on a subset of the CIP taxonomy that we find most useful for describing syllabi. In some cases, we have combined codes, though we have generally done so only within the same two-digit branch of the taxonomy. In those cases, the codes are separated by a forward-slash (‘/’). For example, the code ‘45.09/45.10’ is a combination of ‘International Relations and National Security Studies’ and ‘Political Science and Government’, which are both subdivisions of code ‘45’, ‘Social Sciences’; we combine them into a field that we call “Political Science” (see name).

name

The field, as chosen by OS.

language

The language code for the document, as inferred by pycld2.

institution

The college or university where the syllabus was taught, as determined by the OS institution matcher. Null if no institution was matched.

_id

An OS-assigned surrogate identifier for the field.

Note

This unique identifier is not guaranteed to be consistent across versions of the dataset.

grid_id

The GRID unique identifier of the institution.

wikidata_id

The Wikidata unique identifier of the institution.

unitid

The IPEDS unique identifier of the institution. Only defined for institutions within the United States.

city

The city the syllabus was taught in.

name

The name of the institution.

lat

The latitude of the institution location.

lng

The longitude of the institution location.

url

The URL to the home webpage of the institution.

country_code

The ISO 3166-1 alpha-2 code of the institution country.

country

The full English name of the country corresponding to country_code.

state_code

The ISO 3166-2 region code of the region (state, parish, district, etc.) the syllabus was taught in.

state

The English name of the state corresponding to state_code.

enrollment

The number of total students enrolled at the institution, as aggregated from several data sources. This data is the most recent available, usually the 2018-2019 school year.

two_year

Whether or not the institution is primarily a two year institution.

This value is derived from the carnegie_basic2018 classification. It is true when carnegie_basic2018 is in the range [1, 14].

four_year

Whether or not the institution is a four year institution.

This value is derived from the carnegie_basic2018 classification. It is true when carnegie_basic2018 is in the range [15, 32].

graduate

Whether or not the institution is a graduate institution.

This value is derived from the carnegie_basic2018 classification. It is true when carnegie_basic2018 is in the range [15, 20].

research

Whether or not the institution is an R1 or R2 research institution.

date

Date information about when the syllabus is taught.

year

The academic year the syllabus was or will be taught.

term

The academic term of the syllabus. Possible values are winter, spring, summer and fall.

urls

URLs extracted from the document.

text_urls

A list of URLs that were extracted directly from the plain text of the document.

combined_urls

A merged list of unique, canonicalized URLs formed from hyperlinks and text_urls. When the URL of the underlying document is known (eg, for a syllabus scraped from the web), relative hyperlinks are expanded into absolute URLs in this list.

extracted_metadata

Structured text fields extracted from the syllabus by a token-level sequence tagging model. (We use DistilBERT via the transformers library, finetuned on ~35,000 annotated syllabi.)

For each field, we provide a list of document spans extracted from the syllabus. Each individual span includes fields for:

  • text - The raw text span extracted from the document, without any postprocessing.

  • mean_proba - The average of the probabilities assigned by the model to the start and end “boundary” tokens that define the span in the document. (Or, for single-token spans, just the probability assigned to the one token.) This can be used in a comparative sense as a signal for the confidence of the model on the prediction.

  • ci1 - The index of the first character in the span.

  • ci2 - The index of the last character in the span.

  • ti1 - The index of the first token in the span.

  • ti2 - The index of the last token in the span.

title

The name of the course. E.g., Statistical Learning Theory and Applications.

code

The identifier that appears in the institutional course catalog. Generally (but not always) a combination of a department code and course number. E.g., CS224n or 9.520.

section

The identifier for the specific section of the course that the syllabus corresponds to. Most common for lower-level courses with large enrollments.

date

The raw, un-normalized representation of the course date, as it appears in the document. Generally a term + year pair – eg, Fall 2009.

class_days

The days of the week when the class meets.

class_time

The time when the class meets.

class_location

The location of the class meetings.

instructor

The name of the instructor(s).

instructor_phone

The instructor’s phone number.

office_hours_days

The days of the week when office hours are held.

office_location

The location of the instructor’s office.

office_hours_times

The time when office hours are held.

credits

The credits earned by taking the course.

description

Narrative description of course content. Often a 1-2 paragraphs at the beginning of the syllabus.

learning_outcomes

Lists of competencies or skills that students are expected to acquire in the course. Often structured as verb phrases. These will often overlap with description and topic_outline. E.g.:

  • Articulate the relationship between derivatives and integrals using the Fundamental Theorem of Calculus

  • Sharpen and develop new research methodology skills.

citations

Book and article citations in the syllabus. These are the raw document spans that are taken in by the citation parsing and linking models, the output of which is provided under the top-level citations key.

required_reading

The citation(s) for books, articles, or other resources that are required for the course.

grading_rubric

The description of the range of grades that can be assigned. Often - a mapping between letter grades and numeric ranges. E.g., A+ = 90-100, A = 85-89, ...

assessment_strategy

Descriptions of how grades are calculated – the types of assignments, the weighting or percentage assigned to each, etc.

topic_outline

Lists of topics that are covered in the course. Compared to learning_outcomes, these lists tend to be more narrowly focused on the course material itself (as opposed to the student interaction with the material), and are often structured like a table of contents in a book.

assignment_schedule

A chronologically-ordered sequence of assignments, readings, or topics. Often structured as an ordered list where each item corresponds to a week or class meeting.

citations

doc_span

parsed_citation

title

subtitle

author

editor

publisher

isbn

catalog_key

clean_title

clean_author

title_key

author_key

catalog_record

_id

work_cluster_size

sources

title

subtitle

authors

forenames
keyname

publisher

year

dois

isbns

issns

urls

publication_type

open_access

article

venue
volume
issue
page_start
page_end
abstract