medcat.tokenizers.transformers_ner

Module Contents

Classes

TransformersTokenizerNER

Args:

class medcat.tokenizers.transformers_ner.TransformersTokenizerNER(hf_tokenizer=None, max_len=512, id2type=None, cui2name=None)

Bases: object

Args: hf_tokenizer

Must be able to return token offsets.

max_len:

Max sequence length, if longer it will be split into multiple examples.

id2type:
Can be ignored in most cases, should be a map from token to ‘start’ or ‘sub’ meaning is the token

a subword or the start/full word. For BERT ‘start’ is everything that does not begin with ##.

cui2name:

Map from CUI to full name for labels.

Parameters:
  • hf_tokenizer (Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase]) –

  • max_len (int) –

  • id2type (Optional[Dict]) –

  • cui2name (Optional[Dict]) –

__init__(hf_tokenizer=None, max_len=512, id2type=None, cui2name=None)
Parameters:
  • hf_tokenizer (Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase]) –

  • max_len (int) –

  • id2type (Optional[Dict]) –

  • cui2name (Optional[Dict]) –

Return type:

None

calculate_label_map(dataset)
Return type:

None

encode(examples, ignore_subwords=False)

Used with huggingface datasets map function to convert medcat_ner dataset into the appropriate form for NER with BERT. It will split long text segments into max_len sequences.

Parameters:
  • examples (Dict) – Stream of examples.

  • ignore_subwords (bool) – If set to True subwords of any token will get the special label X.

Returns:

Dict – The same dict, modified.

Return type:

Dict

save(path)
Parameters:

path (str) –

Return type:

None

ensure_tokenizer()
Return type:

transformers.tokenization_utils_base.PreTrainedTokenizerBase

classmethod load(path)
Parameters:

path (str) –

Return type:

TransformersTokenizerNER