Details ======= This page describe how the dataset is derived and the description of it. Data Sources ------------ The citations are obtained from the following sources: #. CrossRef via :abbr:`DOI (Document Object Identifier)` obtained from `Open Academic Graph `_ #. JSTOR Sample Dataset (not accessible anymore) #. `PubMed (2019 Baseline) `_ CSL Styles ---------- In total, 17 styles have been employed. The table below summarises the number of reference strings available in the dataset for each style. .. csv-table:: Number of Strings By CSL Style :file: ../stats/csl-styles-strings-count.csv :widths: 50, 80 :header-rows: 1 BibTeX Entry Types ------------------ The table below summarises the number of reference strings available for each BibTeX entry type. .. csv-table:: Number of Strings By BibTeX Entry Types :file: ../stats/bibtex-entry-types-strings-count.csv :widths: 50, 80 :header-rows: 1 Data Format ----------- The data are stored as JSON lines in each file. Each line of the files represents a citation rendered in a specified CSL style with its corresponding annotated sequence. .. code-block:: json :caption: An example (annotated by segment) with its metadata such as the source, document type and the style the citation is rendered { "style": "apa", "doc_type": "article", "source": "crossref", "data": "Watson, J. D., & Crick, F. H. C. (1953). Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid. Nature, 171 (4356), 737\\u2013738. https://doi.org/10.1038/171737a0" } .. important:: Not all tokens are enclosed within the tags. These should be labelled as **O** (according to tagging scheme).