Details

This page describe how the dataset is derived and the description of it.

Data Sources

The citations are obtained from the following sources:

  1. CrossRef via DOI obtained from Open Academic Graph

  2. JSTOR Sample Dataset (not accessible anymore)

  3. PubMed (2019 Baseline)

CSL Styles

In total, 17 styles have been employed. The table below summarises the number of reference strings available in the dataset for each style.

Number of Strings By CSL Style

CSL Style

Number of Reference Strings

Annual Reviews

placeholder

APA 6th edition

placeholder

Cambridge University Press

placeholder

Chicago

placeholder

Current Opinion

placeholder

Elsevier (Harvard)

placeholder

Elsevier (Vancouver)

placeholder

IEEE

placeholder

MLA 7th edition

placeholder

Nature

placeholder

University of New South Wales (Oxford)

placeholder

Springer Humanities

placeholder

Springer MathPhys

placeholder

Springer (Vancouver)

placeholder

Taylor and Francis (Harvard)

placeholder

Wiley-VCH Books

placeholder

BibTeX Entry Types

The table below summarises the number of reference strings available for each BibTeX entry type.

Number of Strings By BibTeX Entry Types

Entry Type

Number of Reference Strings

article

placeholder

book

placeholder

inbook

placeholder

incollection

placeholder

inproceedings

placeholder

misc

placeholder

phdthesis

placeholder

techreport

placeholder

Data Format

The data are stored as JSON lines in each file. Each line of the files represents a citation rendered in a specified CSL style with its corresponding annotated sequence.

An example (annotated by segment) with its metadata such as the source, document type and the style the citation is rendered
{
   "style": "apa",
   "doc_type": "article",
   "source": "crossref",
   "data": "<author>Watson, J. D., & Crick, F. H. C.</author> <year>(1953).</year> <title>Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid.</title> <container-title>Nature,</container-title> <volume>171</volume> <issue>(4356),</issue> <page>737\\u2013738.</page> <DOI>https://doi.org/10.1038/171737a0</DOI>"
}

Important

Not all tokens are enclosed within the tags. These should be labelled as O (according to tagging scheme).