medcat.tokenizers.meta_cat_tokenizers

Module Contents

Classes

TokenizerWrapperBase

Helper class that provides a standard way to create an ABC using

TokenizerWrapperBPE

Wrapper around a huggingface tokenizer so that it works with the

TokenizerWrapperBERT

Wrapper around a huggingface BERT tokenizer so that it works with the

class medcat.tokenizers.meta_cat_tokenizers.TokenizerWrapperBase(hf_tokenizer=None)

Bases: abc.ABC

Helper class that provides a standard way to create an ABC using inheritance.

Parameters:

hf_tokenizer (Optional[tokenizers.Tokenizer]) –

name: str
__init__(hf_tokenizer=None)
Parameters:

hf_tokenizer (Optional[tokenizers.Tokenizer]) –

Return type:

None

__call__(text: str) Dict
__call__(text: List[str]) List[Dict]
abstract save(dir_path)
Parameters:

dir_path (str) –

Return type:

None

abstract classmethod load(dir_path, model_variant='', **kwargs)
Parameters:
  • dir_path (str) –

  • model_variant (Optional[str]) –

Return type:

tokenizers.Tokenizer

abstract get_size()
Return type:

int

abstract token_to_id(token)
Parameters:

token (str) –

Return type:

Union[int, List[int]]

abstract get_pad_id()
Return type:

Union[Optional[int], List[int]]

ensure_tokenizer()
Return type:

tokenizers.Tokenizer

class medcat.tokenizers.meta_cat_tokenizers.TokenizerWrapperBPE(hf_tokenizers=None)

Bases: TokenizerWrapperBase

Wrapper around a huggingface tokenizer so that it works with the MetaCAT models.

Parameters:
  • tokenizers.ByteLevelBPETokenizer – A huggingface BBPE tokenizer.

  • hf_tokenizers (Optional[tokenizers.ByteLevelBPETokenizer]) –

name = 'bbpe'
__init__(hf_tokenizers=None)
Parameters:

hf_tokenizers (Optional[tokenizers.ByteLevelBPETokenizer]) –

Return type:

None

__call__(text: str) Dict
__call__(text: List[str]) List[Dict]

Tokenize some text

Parameters:

text (Union[str, List[str]]) – Text/texts to be tokenized.

Returns:

Union (dict, List[Dict]) – Dictionary/ies containing offset_mapping, input_ids and tokens corresponding to the input text/s.

Raises:

Exception – If the input is something other than text or a list of text.

save(dir_path)
Parameters:

dir_path (str) –

Return type:

None

classmethod load(dir_path, model_variant='', **kwargs)
Parameters:
  • dir_path (str) –

  • model_variant (Optional[str]) –

Return type:

TokenizerWrapperBPE

get_size()
Return type:

int

token_to_id(token)
Parameters:

token (str) –

Return type:

Union[int, List[int]]

get_pad_id()
Return type:

Union[int, List[int]]

class medcat.tokenizers.meta_cat_tokenizers.TokenizerWrapperBERT(hf_tokenizers=None)

Bases: TokenizerWrapperBase

Wrapper around a huggingface BERT tokenizer so that it works with the MetaCAT models.

Parameters:
  • transformers.models.bert.tokenization_bert_fast.BertTokenizerFast – A huggingface Fast BERT.

  • hf_tokenizers (Optional[transformers.models.bert.tokenization_bert_fast.BertTokenizerFast]) –

name = 'bert-tokenizer'
__init__(hf_tokenizers=None)
Parameters:

hf_tokenizers (Optional[transformers.models.bert.tokenization_bert_fast.BertTokenizerFast]) –

Return type:

None

__call__(text: str) Dict
__call__(text: List[str]) List[Dict]
save(dir_path)
Parameters:

dir_path (str) –

Return type:

None

classmethod load(dir_path, model_variant='', **kwargs)
Parameters:
  • dir_path (str) –

  • model_variant (Optional[str]) –

Return type:

TokenizerWrapperBERT

get_size()
Return type:

int

token_to_id(token)
Parameters:

token (str) –

Return type:

Union[int, List[int]]

get_pad_id()
Return type:

Optional[int]