:py:mod:`medcat.tokenizers.transformers_ner` ============================================ .. py:module:: medcat.tokenizers.transformers_ner Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: medcat.tokenizers.transformers_ner.TransformersTokenizerNER Attributes ~~~~~~~~~~ .. autoapisummary:: medcat.tokenizers.transformers_ner.logger .. py:data:: logger .. py:class:: TransformersTokenizerNER(hf_tokenizer = None, max_len = 512, id2type = None, cui2name = None) Bases: :py:obj:`object` Args: hf_tokenizer Must be able to return token offsets. max_len: Max sequence length, if longer it will be split into multiple examples. id2type: Can be ignored in most cases, should be a map from token to 'start' or 'sub' meaning is the token a subword or the start/full word. For BERT 'start' is everything that does not begin with ##. cui2name: Map from CUI to full name for labels. .. py:method:: __init__(hf_tokenizer = None, max_len = 512, id2type = None, cui2name = None) .. py:method:: calculate_label_map(dataset) .. py:method:: encode(examples, ignore_subwords = False) Used with huggingface datasets map function to convert medcat_ner dataset into the appropriate form for NER with BERT. It will split long text segments into max_len sequences (performs chunking). :param examples: Stream of examples. :type examples: Dict :param ignore_subwords: If set to `True` subwords of any token will get the special label `X`. :type ignore_subwords: bool :Returns: **Dict** -- The same dict, modified. .. py:method:: save(path) .. py:method:: ensure_tokenizer() .. py:method:: load(path) :classmethod: