:py:mod:`medcat.preprocessing.tokenizers` ========================================= .. py:module:: medcat.preprocessing.tokenizers Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: medcat.preprocessing.tokenizers.WordpieceTokenizer medcat.preprocessing.tokenizers.SpacyHFTok medcat.preprocessing.tokenizers.SpacyHFDoc medcat.preprocessing.tokenizers.TokenizerWrapperBPE medcat.preprocessing.tokenizers.TokenizerWrapperBERT Functions ~~~~~~~~~ .. autoapisummary:: medcat.preprocessing.tokenizers.spacy_extended medcat.preprocessing.tokenizers.spacy_split_all .. py:function:: spacy_extended(nlp) .. py:function:: spacy_split_all(nlp, config) .. py:class:: WordpieceTokenizer(vocab, unk_token = '[UNK]', max_input_chars_per_word = 200) Bases: :py:obj:`object` Runs WordPiece tokenziation. .. py:method:: __init__(vocab, unk_token = '[UNK]', max_input_chars_per_word = 200) .. py:method:: tokenize(text) Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. For example: input = "unaffable" output = ["un", "##aff", "##able"] :param text: A single token or whitespace separated tokens. This should have. already been passed through `BasicTokenizer`. :type text: str :Returns: **List** -- A list of wordpiece tokens. .. py:class:: SpacyHFTok(w2v) Bases: :py:obj:`object` .. py:method:: __init__(w2v) .. py:method:: encode(text) .. py:method:: token_to_id(tok) .. py:class:: SpacyHFDoc(doc) Bases: :py:obj:`object` .. py:method:: __init__(doc) .. py:class:: TokenizerWrapperBPE(hf_tokenizers) Bases: :py:obj:`object` .. py:method:: __init__(hf_tokenizers) .. py:method:: __call__(text) .. py:method:: save(dir_path, name='bbpe') .. py:method:: load(dir_path, name='bbpe', **kwargs) :classmethod: .. py:class:: TokenizerWrapperBERT(hf_tokenizers=None) Bases: :py:obj:`object` .. py:method:: __init__(hf_tokenizers=None) .. py:method:: __call__(text) .. py:method:: save(dir_path, name = 'bert') .. py:method:: load(dir_path, name = 'bert', **kwargs) :classmethod: