References¶

CSL¶

class dataset.csl.Variable(value)¶

Bases: enum.Enum

Note

other is the only non-standard value and its intention is to assign to tokens that are outside the segments enclosed by the tags

Datasets¶

class dataset.TextLineDataset(file_path: Union[pathlib.Path, str], encoding: str = 'utf-8', lazy_load: bool = False, deserializer: Optional[Callable[[str], Sequence[Tuple[str, str]]]] = None)¶

Bases: object

Read a text-based file with a given encoding and deserialise each line with deserializer (if provided).

Parameters

file_path (Union[Path, str]) – The path of the text file
encoding (str, optional) – The encoding of the file. Defaults to “utf-8”.
lazy_load (bool, optional) – Use lazy loading. Defaults to False.
deserializer (Callable[[str], Sequence[Tuple[str, str]]]) – A callable that takes the line and return a deserialised form. If None, it will be str

Raises

FileNotFoundError – If file_path does not exist
IndexError – If index of the dataset is more than the len() - 1

Returns

A dataset object

Return type

TextLineDataset

root¶

The path of the text file

Type: Union[Path, str]

encoding¶

The encoding of the text file

Type: str

lazy_load¶

Whether to lazy load the line when __getitem__() is called, otherwise use linecache

Type: bool

deserializer¶

A function to deserialise each line

Type: Callable[[str], Sequence[Tuple[str, str]]]]

Parsers¶

class dataset.parsers.CSLParser(delimiter: str = ' ')¶

Bases: object

This class allows the parsing of an annotated string that uses XML-like tags (based on CSL variables) that enclose delimited token(s) using regular expressions.

Parameters: delimiter (str) – delimiter to be used for tokenization. Defaults to whitespace.
Returns: A Parser object
Return type: CSLParser

parse(s: str) → List[Tuple[str, str]]¶

Parse a given string to a list of tuple

Parameters

s (str) – the annotated string to be parsed

Returns

A sequence of tuples; the first element of the tuple is the token and the second element is the label.

If the token is not enclosed by any tag, it will be labelled as other.

If the token is nested by more than 1 tag, it will be labelled as with period adjoining the labels in hierarchical manner, e.g. accessed.year

Return type

List[Tuple[str, str]]

class dataset.parsers.IOBTag(value)¶

Bases: enum.Enum

An enumeration of IOB tags