medcat.preprocessing.tokenizers
Module Contents
Classes
Runs WordPiece tokenziation. |
|
Functions
|
|
|
- medcat.preprocessing.tokenizers.spacy_extended(nlp)
- Parameters:
nlp (spacy.language.Language) –
- Return type:
spacy.tokenizer.Tokenizer
- medcat.preprocessing.tokenizers.spacy_split_all(nlp, config)
- Parameters:
nlp (spacy.language.Language) –
config (medcat.config.Config) –
- Return type:
spacy.tokenizer.Tokenizer
- class medcat.preprocessing.tokenizers.WordpieceTokenizer(vocab, unk_token='[UNK]', max_input_chars_per_word=200)
Bases:
object
Runs WordPiece tokenziation.
- Parameters:
vocab (Any) –
unk_token (str) –
max_input_chars_per_word (int) –
- __init__(vocab, unk_token='[UNK]', max_input_chars_per_word=200)
- Parameters:
vocab (Any) –
unk_token (str) –
max_input_chars_per_word (int) –
- Return type:
None
- tokenize(text)
Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. For example:
input = “unaffable” output = [“un”, “##aff”, “##able”]
- Parameters:
text (str) – A single token or whitespace separated tokens. This should have. already been passed through BasicTokenizer.
- Returns:
List – A list of wordpiece tokens.
- Return type:
List
- class medcat.preprocessing.tokenizers.SpacyHFTok(w2v)
Bases:
object
- Parameters:
w2v (Any) –
- __init__(w2v)
- Parameters:
w2v (Any) –
- Return type:
None
- encode(text)
- Parameters:
text (str) –
- Return type:
- token_to_id(tok)
- Parameters:
tok (Any) –
- Return type:
Any
- class medcat.preprocessing.tokenizers.SpacyHFDoc(doc)
Bases:
object
- Parameters:
doc (spacy.tokens.Doc) –
- __init__(doc)
- Parameters:
doc (spacy.tokens.Doc) –
- Return type:
None
- class medcat.preprocessing.tokenizers.TokenizerWrapperBPE(hf_tokenizers)
Bases:
object
- Parameters:
hf_tokenizers (Any) –
- __init__(hf_tokenizers)
- Parameters:
hf_tokenizers (Any) –
- Return type:
None
- __call__(text)
- Parameters:
text (str) –
- Return type:
Dict
- save(dir_path, name='bbpe')
- classmethod load(dir_path, name='bbpe', **kwargs)
- class medcat.preprocessing.tokenizers.TokenizerWrapperBERT(hf_tokenizers=None)
Bases:
object
- __init__(hf_tokenizers=None)
- __call__(text)
- Parameters:
text (str) –
- Return type:
Dict
- save(dir_path, name='bert')
- Parameters:
dir_path (str) –
name (str) –
- Return type:
None
- classmethod load(dir_path, name='bert', **kwargs)
- Parameters:
dir_path (str) –
name (str) –
- Return type:
Any