References

CSL

class dataset.csl.Variable(value)

Bases: enum.Enum

An enumeration of CSL 1.0.1 variables.

Note

other is the only non-standard value and its intention is to assign to tokens that are outside the segments enclosed by the tags

Datasets

class dataset.TextLineDataset(file_path: Union[pathlib.Path, str], encoding: str = 'utf-8', lazy_load: bool = False, deserializer: Optional[Callable[[str], Sequence[Tuple[str, str]]]] = None)

Bases: object

Read a text-based file with a given encoding and deserialise each line with deserializer (if provided).

Parameters
  • file_path (Union[Path, str]) – The path of the text file

  • encoding (str, optional) – The encoding of the file. Defaults to “utf-8”.

  • lazy_load (bool, optional) – Use lazy loading. Defaults to False.

  • deserializer (Callable[[str], Sequence[Tuple[str, str]]]) – A callable that takes the line and return a deserialised form. If None, it will be str

Raises
Returns

A dataset object

Return type

TextLineDataset

root

The path of the text file

Type

Union[Path, str]

encoding

The encoding of the text file

Type

str

lazy_load

Whether to lazy load the line when __getitem__() is called, otherwise use linecache

Type

bool

deserializer

A function to deserialise each line

Type

Callable[[str], Sequence[Tuple[str, str]]]]

Parsers

class dataset.parsers.CSLParser(delimiter: str = ' ')

Bases: object

This class allows the parsing of an annotated string that uses XML-like tags (based on CSL variables) that enclose delimited token(s) using regular expressions.

Parameters

delimiter (str) – delimiter to be used for tokenization. Defaults to whitespace.

Returns

A Parser object

Return type

CSLParser

parse(s: str) List[Tuple[str, str]]

Parse a given string to a list of tuple

Parameters

s (str) – the annotated string to be parsed

Returns

A sequence of tuples; the first element of the tuple is the token and the second element is the label.

If the token is not enclosed by any tag, it will be labelled as other.

If the token is nested by more than 1 tag, it will be labelled as with period adjoining the labels in hierarchical manner, e.g. accessed.year

Return type

List[Tuple[str, str]]

class dataset.parsers.IOBTag(value)

Bases: enum.Enum

An enumeration of IOB tags