:py:mod:`medcat.utils.relation_extraction.tokenizer` ==================================================== .. py:module:: medcat.utils.relation_extraction.tokenizer Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: medcat.utils.relation_extraction.tokenizer.TokenizerWrapperBERT .. py:class:: TokenizerWrapperBERT(hf_tokenizers=None, max_seq_length = None, add_special_tokens = False) Bases: :py:obj:`transformers.models.bert.tokenization_bert_fast.BertTokenizerFast` Wrapper around a huggingface BERT tokenizer so that it works with the RelCAT models. :param hf_tokenizers: A huggingface Fast BERT. :type hf_tokenizers: `transformers.models.bert.tokenization_bert_fast.BertTokenizerFast` .. py:attribute:: name :value: 'bert-tokenizer' .. py:method:: __init__(hf_tokenizers=None, max_seq_length = None, add_special_tokens = False) .. py:method:: __call__(text, truncation = True) Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences. :param text: The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences). :type text: `str`, `List[str]`, `List[List[str]]`, *optional* :param text_pair: The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences). :type text_pair: `str`, `List[str]`, `List[List[str]]`, *optional* :param text_target: The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences). :type text_target: `str`, `List[str]`, `List[List[str]]`, *optional* :param text_pair_target: The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences). :type text_pair_target: `str`, `List[str]`, `List[List[str]]`, *optional* .. py:method:: save(dir_path) .. py:method:: load(dir_path, **kwargs) :classmethod: .. py:method:: get_size() .. py:method:: token_to_id(token) .. py:method:: get_pad_id()