syllabi DataFrame

The syllabi dataframe is the core of the Open Syllabus data – each row corresponds to a single syllabus document. In addition to the raw document data, each row is also annotated with metadata extracted by a suite of information extraction and entity linking models - institution, field, date, course code, course title, and description text.

  T.StructType([
      T.StructField('id', T.LongType()),
      T.StructField('syllabus_probability', T.FloatType()),
      T.StructField('date', T.StructType([
          T.StructField('year', T.IntegerType()),
          T.StructField('term', T.StringType()),
      ])),
      T.StructField('field', T.StructType([
          T.StructField('code', T.StringType()),
          T.StructField('name', T.StringType()),
          T.StructField('label_precision', T.FloatType()),
          T.StructField('label_recall', T.FloatType()),
          T.StructField('label_f1', T.FloatType()),
      ])),
      T.StructField('institution', T.StructType([
          T.StructField('id', T.LongType()),
          T.StructField('grid_id', T.StringType()),
          T.StructField('wikidata_id', T.StringType()),
          T.StructField('unitid', T.StringType()),
          T.StructField('name', T.StringType()),
          T.StructField('url', T.StringType()),
          T.StructField('lat', T.FloatType()),
          T.StructField('lng', T.FloatType()),
          T.StructField('country_code', T.StringType()),
          T.StructField('country_name', T.StringType()),
          T.StructField('state_code', T.StringType()),
          T.StructField('state_name', T.StringType()),
          T.StructField('city', T.StringType()),
          T.StructField('ipeds_applcn', T.IntegerType()),
          T.StructField('ipeds_control', T.ShortType()),
          T.StructField('ipeds_hbcu', T.BooleanType()),
          T.StructField('ipeds_tribal', T.BooleanType()),
          T.StructField('carnegie_basic2018', T.ShortType()),
          T.StructField('carnegie_ugprofile2018', T.ShortType()),
          T.StructField('carnegie_womens', T.BooleanType()),
          T.StructField('two_year', T.BooleanType()),
          T.StructField('four_year', T.BooleanType()),
          T.StructField('graduate', T.BooleanType()),
          T.StructField('research', T.BooleanType()),
          T.StructField('enrollment', T.IntegerType()),
      ])),
      T.StructField('extracted_metadata', T.StructType([
          T.StructField('code', T.ArrayType(T.StructType([
              T.StructField('text', T.StringType()),
              T.StructField('clean_text', T.StringType()),
              T.StructField('mean_logp', T.FloatType()),
              T.StructField('ci1', T.IntegerType()),
              T.StructField('ci2', T.IntegerType()),
          ]))),
          T.StructField('section', T.ArrayType(T.StructType([
              T.StructField('text', T.StringType()),
              T.StructField('clean_text', T.StringType()),
              T.StructField('mean_logp', T.FloatType()),
              T.StructField('ci1', T.IntegerType()),
              T.StructField('ci2', T.IntegerType()),
          ]))),
          T.StructField('title', T.ArrayType(T.StructType([
              T.StructField('text', T.StringType()),
              T.StructField('clean_text', T.StringType()),
              T.StructField('mean_logp', T.FloatType()),
              T.StructField('ci1', T.IntegerType()),
              T.StructField('ci2', T.IntegerType()),
          ]))),
          T.StructField('date', T.ArrayType(T.StructType([
              T.StructField('text', T.StringType()),
              T.StructField('clean_text', T.StringType()),
              T.StructField('mean_logp', T.FloatType()),
              T.StructField('ci1', T.IntegerType()),
              T.StructField('ci2', T.IntegerType()),
          ]))),
          T.StructField('description', T.ArrayType(T.StructType([
              T.StructField('text', T.StringType()),
              T.StructField('clean_text', T.StringType()),
              T.StructField('mean_logp', T.FloatType()),
              T.StructField('ci1', T.IntegerType()),
              T.StructField('ci2', T.IntegerType()),
          ]))),
          T.StructField('learning_outcomes', T.ArrayType(T.StructType([
              T.StructField('text', T.StringType()),
              T.StructField('clean_text', T.StringType()),
              T.StructField('mean_logp', T.FloatType()),
              T.StructField('ci1', T.IntegerType()),
              T.StructField('ci2', T.IntegerType()),
          ]))),
          T.StructField('topic_outline', T.ArrayType(T.StructType([
              T.StructField('text', T.StringType()),
              T.StructField('clean_text', T.StringType()),
              T.StructField('mean_logp', T.FloatType()),
              T.StructField('ci1', T.IntegerType()),
              T.StructField('ci2', T.IntegerType()),
          ]))),
          T.StructField('required_reading', T.ArrayType(T.StructType([
              T.StructField('text', T.StringType()),
              T.StructField('clean_text', T.StringType()),
              T.StructField('mean_logp', T.FloatType()),
              T.StructField('ci1', T.IntegerType()),
              T.StructField('ci2', T.IntegerType()),
          ]))),
          T.StructField('assessment_strategy', T.ArrayType(T.StructType([
              T.StructField('text', T.StringType()),
              T.StructField('clean_text', T.StringType()),
              T.StructField('mean_logp', T.FloatType()),
              T.StructField('ci1', T.IntegerType()),
              T.StructField('ci2', T.IntegerType()),
          ]))),
          T.StructField('grading_rubric', T.ArrayType(T.StructType([
              T.StructField('text', T.StringType()),
              T.StructField('clean_text', T.StringType()),
              T.StructField('mean_logp', T.FloatType()),
              T.StructField('ci1', T.IntegerType()),
              T.StructField('ci2', T.IntegerType()),
          ]))),
          T.StructField('assignment_schedule', T.ArrayType(T.StructType([
              T.StructField('text', T.StringType()),
              T.StructField('clean_text', T.StringType()),
              T.StructField('mean_logp', T.FloatType()),
              T.StructField('ci1', T.IntegerType()),
              T.StructField('ci2', T.IntegerType()),
          ]))),
      ])),
      T.StructField('text_md5', T.StringType()),
      T.StructField('mime_type', T.StringType()),
      T.StructField('text', T.StringType()),
      T.StructField('anonymized_text', T.StringType()),
])

id

The OS-assigned unique identifier of the syllabus.

Note

This unique identifier is not guaranteed to be consistent across versions of the dataset.

syllabus_probability

A number in the range [0.0, 1.0] representing the certainty that the document is a syllabus.

Every document analyzed by OS is assigned a score in the range [0.0, 1.0] by our syllabus classifier, where the closer to 1.0, the greater the certainty that the document is a syllabus. The classifier is trained and tested around a 0.5 threshold: Every document assigned a score above 0.5 is considered a syllabus. The majority of documents in the syllabi dataframe have a score greater than 0.5, but ocassionally OS will manually identify certain groups of documents as being syllabi, regardless of the output of the syllabus classifier.

Filtering the syllabi dataframe by a value greater than 0.5 will return a set of documents with higher precision, at the cost of recall. At the 0.5 threshold, the model is 97% accurate on held-out test data. (See Pipeline and models for details.)

date

Date information about when the syllabus is taught. This data is parsed from the content of extracted_metadta.date.clean_text. Null if unknown.

year

The academic year the syllabus was or will be taught. Null if unknown.

OS only considers years valid if they fall in the range 1990-2022.

term

The academic term of the syllabus.

Possible values are ‘winter’, ‘spring’, ‘summer’ and ‘fall’. Null if unknown.

field

Academic field information, as determined by the OS academic field classifier. See Pipeline and models for details about the model.

code

A string containing one or more IPEDS CIP codes, representing the academic field(s) representing the field.

OS’s field classifier draws heavily from the IPEDS 2010 CIP taxonomy in order to determine the academic field best associated with each syllabus. CIP codes come in lengths of two-, four- and six-digits, where two-digit codes represent a discipline, four-digit codes a subdivision of that discipline, and six-digit codes a further subdivision of the previous subdivision. For example, the two-digit CIP code ‘01’ is the code for all ‘Agriculture, Agriculture Operations, and Related Sciences’ courses; within that, the four-digit CIP code ‘01.01’ is the subdivision for all ‘Agricultural Business and Management’ courses, and within that, ‘01.0103’ is the subdivision for all ‘Agricultural Economics’ courses.

Our field classifier is trained and tested on a subset of the CIP taxonomy that we find most useful for describing syllabi. In some cases, we have combined codes, though we have generally done so only within the same two-digit branch of the taxonomy. In those cases, the codes are separated by a forward-slash (‘/’). For example, the code ‘45.09/45.10’ is a combination of ‘International Relations and National Security Studies’ and ‘Political Science and Government’, which are both subdivisions of code ‘45’, ‘Social Sciences’; we combine them into a field that we call “Political Science” (see name).

name

The name representing the field, as chosen by OS.

label_precision

The precision score of the field, measured against a held-out test set.

label_recall

The recall score of the field, measured against a held-out test set.

label_f1

The F1 score of the field, measured against a held-out test set.

institution

The college or university where the syllabus was taught, as determined by the OS institution matcher. Null if no institution was matched.

id

The OS-assigned unique identifier for the institution matched to this syllabus.

Note

This unique identifier is not guaranteed to be consistent across versions of the dataset.

grid_id

The GRID unique identifier of the institution. Null if unknown.

wikidata_id

The Wikidata unique identifier of the institution. Null if unknown.

unitid

The IPEDS unique identifier of the institution. Null if unknown. Only defined for institutions within the United States.

name

The name of the institution.

url

The URL to the home webpage of the institution.

lat

The latitude of the institution location.

This data is taken from GRID. Null if unknown.

lng

The longitude of the institution location.

This data is taken from GRID. Null if unknown.

country_code

The ISO 3166-1 alpha-2 code of the institution country.

country_name

The full English name of the country corresponding to country_code.

state_code

The ISO 3166-2 region code of the region (state, parish, district, etc.) the syllabus was taught in. Null if unknown.

state_name

The English name of the state corresponding to state_code. Null if unknown.

city

The city the syllabus was taught in. Null if unknown.

ipeds_applcn

The number of applicants to the school who applied, were admitted and enrolled, per IPEDS.

This data is for the 2018-2019 school year.

Null if unknown. Only defined for institutions in the United States.

ipeds_control

The ‘Control’ of the institution, per IPEDS.

‘Control’ is

A classification of whether an institution is operated by publicly elected or appointed officials (public control) or by privately elected or appointed officials and derives its major source of funds from private sources (private control).

(Source: https://surveys.nces.ed.gov/ipeds/VisGlossaryAll.aspx.)

This field is equal to

  • 1, if a public institution;

  • 2, if a private not-for-profit institution;

  • 3, if a private for-profit institution.

Null if unknown. Only defined for institutions in the United States.

ipeds_hbcu

Whether or not the institution is a historically black college or university, per IPEDS. Null if unknown. Only defined for institutions in the United States.

ipeds_tribal

Whether or not the institution is a tribal college or university, per IPEDS. Null if unknown. Only defined for institutions in the United States.

carnegie_basic2018

The Carnegie Basic Classification for 2018.

Possible values for this field are described in Table 3.

Null if unknown. Only defined for institutions in the United States.

Table 3 Possible values for basic2018

Value

Meaning

0

(Not classified)

1

Associate’s Colleges: High Transfer-High Traditional

2

Associate’s Colleges: High Transfer-Mixed Traditional/Nontraditional

3

Associate’s Colleges: High Transfer-High Nontraditional

4

Associate’s Colleges: Mixed Transfer/Career & Technical-High Traditional

5

Associate’s Colleges: Mixed Transfer/Career & Technical-Mixed Traditional/Nontraditional

6

Associate’s Colleges: Mixed Transfer/Career & Technical-High Nontraditional

7

Associate’s Colleges: High Career & Technical-High Traditional

8

Associate’s Colleges: High Career & Technical-Mixed Traditional/Nontraditional

9

Associate’s Colleges: High Career & Technical-High Nontraditional

10

Special Focus Two-Year: Health Professions

11

Special Focus Two-Year: Technical Professions

12

Special Focus Two-Year: Arts & Design

13

Special Focus Two-Year: Other Fields

14

Baccalaureate/Associate’s Colleges: Associate’s Dominant

15

Doctoral Universities: Very High Research Activity

16

Doctoral Universities: High Research Activity

17

Doctoral/Professional Universities

18

Master’s Colleges & Universities: Larger Programs

19

Master’s Colleges & Universities: Medium Programs

20

Master’s Colleges & Universities: Small Programs

21

Baccalaureate Colleges: Arts & Sciences Focus

22

Baccalaureate Colleges: Diverse Fields

23

Baccalaureate/Associate’s Colleges: Mixed Baccalaureate/Associate’s

24

Special Focus Four-Year: Faith-Related Institutions

25

Special Focus Four-Year: Medical Schools & Centers

26

Special Focus Four-Year: Other Health Professions Schools

27

Special Focus Four-Year: Engineering Schools

28

Special Focus Four-Year: Other Technology-Related Schools

29

Special Focus Four-Year: Business & Management Schools

30

Special Focus Four-Year: Arts, Music & Design Schools

31

Special Focus Four-Year: Law Schools

32

Special Focus Four-Year: Other Special Focus Institutions

33

Tribal Colleges

carnegie_ugprofile2018

The Carnegie Undergraduate Profile Classification for 2018.

Possible values for this field are described in Table 4.

Null if unknown. Only defined for institutions in the United States.

Table 4 Possible values for carnegie_ugprofile2018

Value

Meaning

0

Not classified (Exclusively Graduate)

1

Two-year, higher part-time

2

Two-year, mixed part/full-time

3

Two-year, medium full-time

4

Two-year, higher full-time

5

Four-year, higher part-time

6

Four-year, medium full-time, inclusive, lower transfer-in

7

Four-year, medium full-time, inclusive, higher transfer-in

8

Four-year, medium full-time, selective, lower transfer-in

9

Four-year, medium full-time , selective, higher transfer-in

10

Four-year, full-time, inclusive, lower transfer-in

11

Four-year, full-time, inclusive, higher transfer-in

12

Four-year, full-time, selective, lower transfer-in

13

Four-year, full-time, selective, higher transfer-in

14

Four-year, full-time, more selective, lower transfer-in

15

Four-year, full-time, more selective, higher transfer-in

carnegie_womens

Whether or not the institution is a womens college, per Carnegie. Null if unknown. Only defined for institutions in the United States.

two_year

Whether or not the institution is primarily a two year institution.

This value is derived from the carnegie_basic2018 classification. It is true when carnegie_basic2018 is in the range [1, 14].

four_year

Whether or not the institution is a four year institution.

This value is derived from the carnegie_basic2018 classification. It is true when carnegie_basic2018 is in the range [15, 32].

graduate

Whether or not the institution is a graduate institution.

This value is derived from the carnegie_basic2018 classification. It is true when carnegie_basic2018 is in the range [15, 20].

research

Whether or not the institution is an R1 or R2 research institution.

This value is derived from the carnegie_basic2018 classification. It is true when carnegie_basic2018 is in the range [15, 16].

enrollment

The number of total students enrolled at the institution, as aggregated from several data sources. This data is the most recent available, usually the 2018-2019 school year. Null if unknown.

extracted_metadata

Structured text fields extracted from the syllabus by a token-level sequence tagging model. (We use DistilBERT via the transformers library, finetuned on ~16,000 annotated syllabi.)

For each field, we provide a list of document spans extracted from the syllabus. Each individual span includes fields for:

  • text - The raw text span extracted from the document, without any postprocessing.

  • clean_text - A minimally cleaned version of the raw text, often more suitable for display. We normalize encoding and remove extraneous whitespace characters.

  • mean_logp - The average log-probability assigned to the predicted tag for each token in the match. This can be used in a comparative sense as a rough indication of the “confidence” of the model on the prediction.

  • ci1 - The index of the first character in span.

  • ci2 - The index of the last character in span.

code

The identifier that appears in the institutional course catalog. Generally (but not always) a combination of a department code and course number. E.g., CS224n or 9.520.

section

The identifier for the specific section of the course that the syllabus corresponds to. Most common for lower-level courses with large enrollments.

title

The name of the course. E.g., Statistical Learning Theory and Applications.

date

The raw, un-normalized representation of the course date, as it appears in the document. Generally a term + year pair – eg, Fall 2009.

description

Narrative description of course content. Often a 1-2 paragraphs at the beginning of the syllabus.

learning_outcomes

Lists of competencies or skills that students are expected to acquire in the course. Often structured as verb phrases. These will often overlap with description and topic_outline. E.g.:

  • Articulate the relationship between derivatives and integrals using the Fundamental Theorem of Calculus

  • Sharpen and develop new research methodology skills.

topic_outline

Lists of topics that are covered in the course. Compared to learning_outcomes, these lists tend to be more narrowly focused on the course material itself (as opposed to the student interaction with the material), and are often structured like a table of contents in a book.

required_reading

The citation(s) for books, articles, or other resources that are required for the course.

Note

Unlike the structured citation graph data provided by the catalog and matches dataframes, these citations are provided as raw strings extracted from the documents.

assessment_strategy

Descriptions of how grades are calculated – the types of assignments, the weighting or percentage assigned to each, etc.

grading_rubric

The description of the range of grades that can be assigned. Often - a mapping between letter grades and numeric ranges. E.g., A+ = 90-100, A = 85-89, ...

assignment_schedule

A chronologically-ordered sequence of assignments, readings, or topics. Often structured as an ordered list where each item corresponds to a week or class meeting.

text_md5

The md5sum of the text.

Note

This field is only available in full-text versions of the dataset.

mime_type

The mime type of the document text was extracted from.

We extract text from HTML, PDF, DOC, DOCX and RTF files.

Note

This field is only available in full-text versions of the dataset.

text

The extracted text of the syllabus.

Note

This field is only available in full-text versions of the dataset.

anonymized_text

The extracted text of the syllabus, anonymized to remove person names, email addresses and phone numbers.

Note

This field is only available in full-text versions of the dataset.