Data format

The Open Syllabus dataset consists of a single partitioned JSON-lines dataset called syllabi.json. Each row represents one syllabus.

Open Syllabus uses Apache Spark for ETL, model inference, and distribution packaging. Raw datasets are distributed as JSON lines files produced by the standard JSON dataframe writer in Spark. For full-size datasets, we use gzip compression and split the dataframe into 100 partition files, each of which contains ~1% of the full data.

The partition files can be downloaded individually or in small batches for inspection and testing. The complete dataset can be downloaded in bulk using a tool like rclone.