`medcat.utils.relation_extraction.tokenizer`

Module Contents

Classes

BaseTokenizerWrapper_RelationExtraction

Base class for all fast tokenizers (wrapping HuggingFace tokenizers library).

Attributes

logger

medcat.utils.relation_extraction.tokenizer.logger

class medcat.utils.relation_extraction.tokenizer.BaseTokenizerWrapper_RelationExtraction(hf_tokenizers=None, max_seq_length=None, add_special_tokens=False)

Bases: transformers.PreTrainedTokenizerFast

Base class for all fast tokenizers (wrapping HuggingFace tokenizers library).

Inherits from [~tokenization_utils_base.PreTrainedTokenizerBase].

Handles all the shared methods for tokenization and special tokens, as well as methods for downloading/caching/loading pretrained tokenizers, as well as adding tokens to the vocabulary.

This class also contains the added tokens in a unified way on top of all tokenizers so we don’t have to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece…).

Parameters:

max_seq_length (Optional[int]) –
add_special_tokens (Optional[bool]) –

name = 'base_tokenizer_wrapper_rel'

__init__(hf_tokenizers=None, max_seq_length=None, add_special_tokens=False)

Parameters:

max_seq_length (Optional[int]) –
add_special_tokens (Optional[bool]) –

get_size()

token_to_id(token)

get_pad_id()

__call__(text, truncation=True)

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.

Parameters:

text (str, List[str], List[List[str]], optional) – The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
text_pair (str, List[str], List[List[str]], optional) – The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
text_target (str, List[str], List[List[str]], optional) – The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
text_pair_target (str, List[str], List[List[str]], optional) – The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
truncation (Optional[bool]) –

save(dir_path)

Parameters:: dir_path (str) –

classmethod load(tokenizer_path, relcat_config, **kwargs)

Parameters:

tokenizer_path (str) –
relcat_config (medcat.config_rel_cat.ConfigRelCAT) –

Return type:

BaseTokenizerWrapper_RelationExtraction

medcat.utils.relation_extraction.tokenizer

Module Contents

Classes

Attributes

`medcat.utils.relation_extraction.tokenizer`