medcat.tokenizers.transformers_ner
Module Contents
Classes
Args: |
- class medcat.tokenizers.transformers_ner.TransformersTokenizerNER(hf_tokenizer=None, max_len=512, id2type=None, cui2name=None)
Bases:
object
Args: hf_tokenizer
Must be able to return token offsets.
- max_len:
Max sequence length, if longer it will be split into multiple examples.
- id2type:
- Can be ignored in most cases, should be a map from token to ‘start’ or ‘sub’ meaning is the token
a subword or the start/full word. For BERT ‘start’ is everything that does not begin with ##.
- cui2name:
Map from CUI to full name for labels.
- Parameters:
hf_tokenizer (Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase]) –
max_len (int) –
id2type (Optional[Dict]) –
cui2name (Optional[Dict]) –
- __init__(hf_tokenizer=None, max_len=512, id2type=None, cui2name=None)
- Parameters:
hf_tokenizer (Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase]) –
max_len (int) –
id2type (Optional[Dict]) –
cui2name (Optional[Dict]) –
- Return type:
None
- calculate_label_map(dataset)
- Return type:
None
- encode(examples, ignore_subwords=False)
Used with huggingface datasets map function to convert medcat_ner dataset into the appropriate form for NER with BERT. It will split long text segments into max_len sequences.
- Parameters:
examples (Dict) – Stream of examples.
ignore_subwords (bool) – If set to True subwords of any token will get the special label X.
- Returns:
Dict – The same dict, modified.
- Return type:
Dict
- save(path)
- Parameters:
path (str) –
- Return type:
None
- ensure_tokenizer()
- Return type:
transformers.tokenization_utils_base.PreTrainedTokenizerBase
- classmethod load(path)
- Parameters:
path (str) –
- Return type: