Reference Strings Dataset

Quick Introduction

Reference Strings Dataset is a collection of synthetically generated bibliographies that comes with annotations on each token. For example, a citation rendered in APA style:

Plain text render of a citation
Watson, J. D., & Crick, F. H. C. (1953). Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid. Nature, 171(4356), 737–738. https://doi.org/10.1038/171737a0

The corresponding annotated form encloses contigious segment of tokens with XML-like tags:

Plain text render of a citation with annotation
<author>Watson, J. D., & Crick, F. H. C.</author> <year>(1953).</year> <title>Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid.</title> <container-title>Nature,</container-title> <volume>171</volume> <issue>(4356),</issue> <page>737–738.</page> <DOI>https://doi.org/10.1038/171737a0</DOI>

These XML-like tags are based on CSL Variables.

For more information about the method which the dataset is synthesized and the data format, read Details.

Use Cases

Sequence tagging/labeling

Assignment of categorical label to each member of the sequence.

How to obtain the dataset

Visit the Downloads page for instructions.

Citing

This dataset is part of a Master project in NUS.

If you are using the dataset for scientific work, please cite the following:

Indices and tables