References¶
CSL¶
- class dataset.csl.Variable(value)¶
Bases:
enum.Enum
An enumeration of CSL 1.0.1 variables.
Note
other is the only non-standard value and its intention is to assign to tokens that are outside the segments enclosed by the tags
Datasets¶
- class dataset.TextLineDataset(file_path: Union[pathlib.Path, str], encoding: str = 'utf-8', lazy_load: bool = False, deserializer: Optional[Callable[[str], Sequence[Tuple[str, str]]]] = None)¶
Bases:
object
Read a text-based file with a given encoding and deserialise each line with deserializer (if provided).
- Parameters
file_path (Union[Path, str]) – The path of the text file
encoding (str, optional) – The encoding of the file. Defaults to “utf-8”.
lazy_load (bool, optional) – Use lazy loading. Defaults to
False
.deserializer (Callable[[str], Sequence[Tuple[str, str]]]) – A callable that takes the line and return a deserialised form. If
None
, it will bestr
- Raises
FileNotFoundError – If
file_path
does not existIndexError – If index of the dataset is more than the
len()
- 1
- Returns
A dataset object
- Return type
Parsers¶
- class dataset.parsers.CSLParser(delimiter: str = ' ')¶
Bases:
object
This class allows the parsing of an annotated string that uses XML-like tags (based on CSL variables) that enclose delimited token(s) using regular expressions.
- Parameters
delimiter (str) – delimiter to be used for tokenization. Defaults to whitespace.
- Returns
A Parser object
- Return type
- parse(s: str) List[Tuple[str, str]] ¶
Parse a given string to a list of tuple
- Parameters
s (str) – the annotated string to be parsed
- Returns
A sequence of tuples; the first element of the tuple is the token and the second element is the label.
If the token is not enclosed by any tag, it will be labelled as other.
If the token is nested by more than 1 tag, it will be labelled as with period adjoining the labels in hierarchical manner, e.g. accessed.year
- Return type