`medcat.utils.relation_extraction.tokenizer`

Module Contents

Wrapper around a huggingface BERT tokenizer so that it works with the

class medcat.utils.relation_extraction.tokenizer.TokenizerWrapperBERT(hf_tokenizers=None, max_seq_length=None, add_special_tokens=False)

Bases: transformers.models.bert.tokenization_bert_fast.BertTokenizerFast

Wrapper around a huggingface BERT tokenizer so that it works with the RelCAT models.

Parameters:

hf_tokenizers (transformers.models.bert.tokenization_bert_fast.BertTokenizerFast) – A huggingface Fast BERT.
max_seq_length (Optional[int]) –
add_special_tokens (Optional[bool]) –

__init__(hf_tokenizers=None, max_seq_length=None, add_special_tokens=False)

Parameters:

__call__(text, truncation=True)

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.

Parameters:

text (str, List[str], List[List[str]], optional) – The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
text_pair (str, List[str], List[List[str]], optional) – The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
text_target (str, List[str], List[List[str]], optional) – The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
text_pair_target (str, List[str], List[List[str]], optional) – The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
truncation (Optional[bool]) –