Syllabus schema

Open Syllabus uses a unified document format to represent the metadata extracted from a syllabus:

|-- _id: long
|-- md5: string
|-- doc_type: string
|-- text: string
|-- syllabus_probability: float
|-- field: struct
|    |-- _id: long
|    |-- cip_codes: array
|    |    |-- element: string
|    |-- name: string
|-- language: string
|-- institution: struct
|    |-- _id: long
|    |-- grid_id: string
|    |-- wikidata_id: string
|    |-- unitid: long
|    |-- city: string
|    |-- name: string
|    |-- lat: float
|    |-- lng: float
|    |-- url: string
|    |-- country_code: string
|    |-- country: string
|    |-- state_code: string
|    |-- state: string
|    |-- enrollment: long
|    |-- two_year: boolean
|    |-- four_year: boolean
|    |-- graduate: boolean
|    |-- research: boolean
|    |-- wikipedia_url: string
|    |-- description: string
|    |-- image_url: string
|-- date: struct
|    |-- term: string
|    |-- year: long
|-- urls: array
|    |-- element: string
|-- extracted_sections: struct
|    |-- title: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- code: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- section: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- date: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- class_days: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- class_time: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- class_location: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- instructor: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- instructor_phone: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- office_hours_days: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- office_location: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- office_hours_times: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- credits: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- description: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- learning_outcomes: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- citations: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- required_reading: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- grading_rubric: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- assessment_strategy: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- topic_outline: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- assignment_schedule: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |-- school_name: array
|    |    |-- element: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|-- citations: array (nullable = false)
|    |-- element: struct (containsNull = false)
|    |    |-- doc_span: struct
|    |    |    |-- text: string
|    |    |    |-- mean_proba: float
|    |    |    |-- ci1: long
|    |    |    |-- ci2: long
|    |    |    |-- ti1: long
|    |    |    |-- ti2: long
|    |    |-- parsed_citation: struct
|    |    |    |-- title: array
|    |    |    |    |-- element: struct
|    |    |    |    |    |-- text: string
|    |    |    |    |    |-- mean_proba: float
|    |    |    |    |    |-- ci1: long
|    |    |    |    |    |-- ci2: long
|    |    |    |    |    |-- ti1: long
|    |    |    |    |    |-- ti2: long
|    |    |    |-- subtitle: array
|    |    |    |    |-- element: struct
|    |    |    |    |    |-- text: string
|    |    |    |    |    |-- mean_proba: float
|    |    |    |    |    |-- ci1: long
|    |    |    |    |    |-- ci2: long
|    |    |    |    |    |-- ti1: long
|    |    |    |    |    |-- ti2: long
|    |    |    |-- author: array
|    |    |    |    |-- element: struct
|    |    |    |    |    |-- text: string
|    |    |    |    |    |-- mean_proba: float
|    |    |    |    |    |-- ci1: long
|    |    |    |    |    |-- ci2: long
|    |    |    |    |    |-- ti1: long
|    |    |    |    |    |-- ti2: long
|    |    |    |-- editor: array
|    |    |    |    |-- element: struct
|    |    |    |    |    |-- text: string
|    |    |    |    |    |-- mean_proba: float
|    |    |    |    |    |-- ci1: long
|    |    |    |    |    |-- ci2: long
|    |    |    |    |    |-- ti1: long
|    |    |    |    |    |-- ti2: long
|    |    |    |-- publisher: array
|    |    |    |    |-- element: struct
|    |    |    |    |    |-- text: string
|    |    |    |    |    |-- mean_proba: float
|    |    |    |    |    |-- ci1: long
|    |    |    |    |    |-- ci2: long
|    |    |    |    |    |-- ti1: long
|    |    |    |    |    |-- ti2: long
|    |    |    |-- isbn: array
|    |    |    |    |-- element: struct
|    |    |    |    |    |-- text: string
|    |    |    |    |    |-- mean_proba: float
|    |    |    |    |    |-- ci1: long
|    |    |    |    |    |-- ci2: long
|    |    |    |    |    |-- ti1: long
|    |    |    |    |    |-- ti2: long
|    |    |-- catalog_key: struct
|    |    |    |-- clean_title: string
|    |    |    |-- clean_author: struct
|    |    |    |    |-- forenames: string
|    |    |    |    |-- keyname: string
|    |    |    |-- title_key: string
|    |    |    |-- author_key: string
|    |    |-- catalog_record: struct
|    |    |    |-- _id: long
|    |    |    |-- work_cluster_size: long
|    |    |    |-- sources: map
|    |    |    |    |-- key: string
|    |    |    |    |-- value: array (valueContainsNull = true)
|    |    |    |    |    |-- element: string
|    |    |    |-- title: string
|    |    |    |-- subtitle: string
|    |    |    |-- authors: array
|    |    |    |    |-- element: struct
|    |    |    |    |    |-- forenames: string
|    |    |    |    |    |-- keyname: string
|    |    |    |-- publisher: string
|    |    |    |-- year: long
|    |    |    |-- description: string
|    |    |    |-- image_urls: array
|    |    |    |    |-- element: string
|    |    |    |-- dois: array
|    |    |    |    |-- element: string
|    |    |    |-- isbns: array
|    |    |    |    |-- element: string
|    |    |    |-- issns: array
|    |    |    |    |-- element: string
|    |    |    |-- urls: array
|    |    |    |    |-- element: string
|    |    |    |-- publication_type: string
|    |    |    |-- open_access: boolean
|    |    |    |-- article: struct
|    |    |    |    |-- venue: string
|    |    |    |    |-- volume: string
|    |    |    |    |-- issue: string
|    |    |    |    |-- page_start: string
|    |    |    |    |-- page_end: string
|    |    |    |    |-- abstract: string

Syllabus

The top-level document schema in the Open Syllabus dataset. Each row in the syllabi dataframe is a single Syllabus instance.

_id

An OS-assigned surrogate identifier for the syllabus.

Note

This unique identifier is not guaranteed to be consistent across versions of the dataset.

md5

The md5 checksum of the raw document bytes.

Note

This field is removed in some distributions of the dataset.

doc_type

The underlying source document type. Possible values are html, plain, pdf, doc, docx, rtf.

Note

This field is removed in some distributions of the dataset.

text

The plain text extracted from the document.

Note

This field is removed in some distributions of the dataset.

syllabus_probability

The probability that the document is a syllabus, per the syllabus classifier The classifier is trained and tested around a 0.5 threshold: Every document assigned a score above 0.5 is considered a syllabus. The majority of documents in the syllabi dataframe have a score greater than 0.5, but ocassionally OS will manually identify certain groups of documents as being syllabi, regardless of the output of the syllabus classifier.

Filtering the syllabi dataframe by a value greater than 0.5 will return a set of documents with higher precision, at the cost of recall. At the 0.5 threshold, the model is 97% accurate on held-out test data. (See Pipeline and models for details.)

field

Instance of Field – The academic field assigned to the document, as determined by the OS field classifier. See Pipeline and models for details about the model.

language

The language code for the document, as inferred by pycld2.

institution

Instance of Institution – The college or university where the syllabus was taught, as determined by the OS institution matcher. Null if no institution was matched.

date

Instance of Date – The year and semester of the course.

urls

Instance of URLs – URLs extracted from the document.

Note

This field is removed in some distributions of the dataset.

extracted_sections

Instance of ExtractedSections – Normalized document sections extracted by the parser.

citations

List of Citation – Parsed and linked citations extracted from the document.

Field

_id

An OS-assigned surrogate identifier for the field.

Note

This unique identifier is not guaranteed to be consistent across versions of the dataset.

cip_codes

A list of one or more IPEDS CIP codes, representing the academic field(s) representing the field.

OS’s field classifier draws heavily from the IPEDS 2010 CIP taxonomy in order to determine the academic field best associated with each syllabus. CIP codes come in lengths of two-, four- and six-digits, where two-digit codes represent a discipline, four-digit codes a subdivision of that discipline, and six-digit codes a further subdivision of the previous subdivision. For example, the two-digit CIP code ‘01’ is the code for all ‘Agriculture, Agriculture Operations, and Related Sciences’ courses; within that, the four-digit CIP code ‘01.01’ is the subdivision for all ‘Agricultural Business and Management’ courses, and within that, ‘01.0103’ is the subdivision for all ‘Agricultural Economics’ courses.

Our field classifier is trained and tested on a subset of the CIP taxonomy that we find most useful for describing syllabi. In some cases, we have combined codes, though we have generally done so only within the same two-digit branch of the taxonomy. For example, the codes ['45.09', '45.10'] are a combination of ‘International Relations and National Security Studies’ and ‘Political Science and Government’, which are both subdivisions of code ‘45’, ‘Social Sciences’; we combine them into a field that we call “Political Science” (see name).

name

The field, as chosen by OS.

Institution

_id

An OS-assigned surrogate identifier for the field.

Note

This unique identifier is not guaranteed to be consistent across versions of the dataset.

grid_id

The GRID unique identifier of the institution.

wikidata_id

The Wikidata unique identifier of the institution.

unitid

The IPEDS unique identifier of the institution. Only defined for institutions within the United States.

city

The city the syllabus was taught in.

name

The name of the institution.

lat

The latitude of the institution location.

lng

The longitude of the institution location.

url

The URL to the home webpage of the institution.

country_code

The ISO 3166-1 alpha-2 code of the institution country.

country

The full English name of the country corresponding to country_code.

state_code

The ISO 3166-2 region code of the region (state, parish, district, etc.) the syllabus was taught in.

state

The English name of the state corresponding to state_code.

enrollment

The number of total students enrolled at the institution, as aggregated from several data sources. This data is the most recent available, usually the 2018-2019 school year.

two_year

Whether or not the institution is primarily a two year institution.

This value is derived from the carnegie_basic2018 classification. It is true when carnegie_basic2018 is in the range [1, 14].

four_year

Whether or not the institution is a four year institution.

This value is derived from the carnegie_basic2018 classification. It is true when carnegie_basic2018 is in the range [15, 32].

graduate

Whether or not the institution is a graduate institution.

This value is derived from the carnegie_basic2018 classification. It is true when carnegie_basic2018 is in the range [15, 20].

research

Whether or not the institution is an R1 or R2 research institution.

wikipedia_url

The URL of the institution’s Wikipedia article.

description

A paragraph-length description of the institution, extracted from Wikipedia.

image_url

A URL to a “profile” image for the school (a logo, seal, or wordmark), extracted from Wikipedia.

Date

year

The academic year the syllabus was or will be taught.

term

The academic term of the syllabus. Possible values are winter, spring, summer and fall.

ExtractedSections

Normalized document sections extracted by the parser. Each field is an instance of Span.

title

The name of the course. E.g., Statistical Learning Theory and Applications.

code

The identifier that appears in the institutional course catalog. Generally (but not always) a combination of a department code and course number. E.g., CS224n or 9.520.

section

The identifier for the specific section of the course that the syllabus corresponds to. Most common for lower-level courses with large enrollments.

date

The raw, un-normalized representation of the course date, as it appears in the document. Generally a term + year pair – eg, Fall 2009.

class_days

The days of the week when the class meets.

class_time

The time when the class meets.

class_location

The location of the class meetings.

instructor

The name of the instructor(s).

Note

This field is removed in anonymized distributions of the dataset.

instructor_phone

The instructor’s phone number.

Note

This field is removed in anonymized distributions of the dataset.

office_hours_days

The days of the week when office hours are held.

office_location

The location of the instructor’s office.

office_hours_times

The time when office hours are held.

credits

The credits earned by taking the course.

description

Narrative description of course content. Often a 1-2 paragraphs at the beginning of the syllabus.

learning_outcomes

Lists of competencies or skills that students are expected to acquire in the course. Often structured as verb phrases. These will often overlap with description and topic_outline. E.g.:

Articulate the relationship between derivatives and integrals using the Fundamental Theorem of Calculus
Sharpen and develop new research methodology skills.

citations

Book and article citations in the syllabus. These are the raw document spans that are taken in by the citation parsing and linking models, the output of which is provided under the top-level citations key.

required_reading

The citation(s) for books, articles, or other resources that are required for the course.

grading_rubric

The description of the range of grades that can be assigned. Often - a mapping between letter grades and numeric ranges. E.g., A+ = 90-100, A = 85-89, ...

assessment_strategy

Descriptions of how grades are calculated – the types of assignments, the weighting or percentage assigned to each, etc.

topic_outline

Lists of topics that are covered in the course. Compared to learning_outcomes, these lists tend to be more narrowly focused on the course material itself (as opposed to the student interaction with the material), and are often structured like a table of contents in a book.

assignment_schedule

A chronologically-ordered sequence of assignments, readings, or topics. Often structured as an ordered list where each item corresponds to a week or class meeting.

school_name

The name of the institution where the course was taught. When present, this value is passed into the institution matcher and used as the basis for one of the entity linking routines used to match the syllabus with an institution.

Citation

doc_span

Instance of Span – The raw span from the document where the citation appears.

parsed_citation

Instance of ParsedCitation – A segmented version of the citation, with title, subtitle, author, editor, publisher, and ISBN extracted as structured items.

catalog_key

Instance of CatalogKey – The normalized title and author values used to link the citation with an canonical bibliographic record.

catalog_record

Instance of CatalogRecord – The linked bibliographic record. Null if no record is found.

ParsedCitation

Normalized sections extracted by the citation parser. Each field is an instance of Span.

title

The title of the work.

subtitle

The subtitle of the work.

author

The author of the work.

editor

The editor of the work.

publisher

The publisher of the work.

isbn

The ISBN of the work.

CatalogKey

clean_title

A lightly normalized version of the title string.

clean_author

A lightly normalized version of the parsed author keyname string.

title_key

A heavily normalized version of the title string. Used for lookups into the bibliographic database.

author_key

A heavily normalized version of the author string. Used for lookups into the bibliographic database.

CatalogRecord

_id

An OS-assigned surrogate identifier for the record.

Note

This unique identifier is not guaranteed to be consistent across versions of the dataset.

work_cluster_size

The number of individual “expression” records that were clustered together to form this work record.

sources

Third-party identifiers of the work in the source catalogs. A dictionary of {source_name: [id1, id2, id3, ...]}.

title

The title of the work.

subtitle

The subtitle of the work.

authors

List of Author – the authors of the work.

publisher

The publisher of the work.

year

The year the work was published.

description

A free-text description of the work, intended for display.

image_urls

URLs to cover images for the work.

dois

A list of DOIs for the work, derived from lists in the source catalogs.

isbns

A list of ISBNs for the work, derived from lists in the source catalogs.

issns

A list of ISSNs for the work, derived from lists in the source catalogs.

urls

A list of URLs for the work, derived from lists in the source catalogs.

publication_type

The type of publication the work is. Possible values are book, book-chapter, article, or report.

open_access

Whether or not the work is an open access work.

article

Instance of Article – Metadata specific to article records.

Article

venue

The name of the journal or conference.

volume

The journal volume.

issue

The journal issue.

page_start

The starting page number.

page_end

The ending page number.

abstract

The abstract of the article.

Author

forenames

A single string containing the names or initials of the first and/or middle names.

keyname

The last name of the author.

Span

text

The raw text span extracted from the document, without any postprocessing.

mean_proba

The average of the probabilities assigned by the model to the start and end “boundary” tokens that define the span in the document. (Or, for single-token spans, just the probability assigned to the one token.) This can be used in a comparative sense as a signal for the confidence of the model on the prediction.

ci1

The index of the first character in the span.

ci2

The index of the last character in the span.

ti1

The index of the first token in the span.

ti2

The index of the last token in the span.