medcat.preprocessing.tokenizers

Module Contents

Classes

WordpieceTokenizer

Runs WordPiece tokenziation.

SpacyHFTok

SpacyHFDoc

TokenizerWrapperBPE

TokenizerWrapperBERT

Functions

spacy_extended(nlp)

spacy_split_all(nlp, config)

medcat.preprocessing.tokenizers.spacy_extended(nlp)
Parameters:

nlp (spacy.language.Language) –

Return type:

spacy.tokenizer.Tokenizer

medcat.preprocessing.tokenizers.spacy_split_all(nlp, config)
Parameters:
Return type:

spacy.tokenizer.Tokenizer

class medcat.preprocessing.tokenizers.WordpieceTokenizer(vocab, unk_token='[UNK]', max_input_chars_per_word=200)

Bases: object

Runs WordPiece tokenziation.

Parameters:
  • vocab (Any) –

  • unk_token (str) –

  • max_input_chars_per_word (int) –

__init__(vocab, unk_token='[UNK]', max_input_chars_per_word=200)
Parameters:
  • vocab (Any) –

  • unk_token (str) –

  • max_input_chars_per_word (int) –

Return type:

None

tokenize(text)

Tokenizes a piece of text into its word pieces.

This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. For example:

input = “unaffable” output = [“un”, “##aff”, “##able”]

Parameters:

text (str) – A single token or whitespace separated tokens. This should have. already been passed through BasicTokenizer.

Returns:

List – A list of wordpiece tokens.

Return type:

List

class medcat.preprocessing.tokenizers.SpacyHFTok(w2v)

Bases: object

Parameters:

w2v (Any) –

__init__(w2v)
Parameters:

w2v (Any) –

Return type:

None

encode(text)
Parameters:

text (str) –

Return type:

SpacyHFDoc

token_to_id(tok)
Parameters:

tok (Any) –

Return type:

Any

class medcat.preprocessing.tokenizers.SpacyHFDoc(doc)

Bases: object

Parameters:

doc (spacy.tokens.Doc) –

__init__(doc)
Parameters:

doc (spacy.tokens.Doc) –

Return type:

None

class medcat.preprocessing.tokenizers.TokenizerWrapperBPE(hf_tokenizers)

Bases: object

Parameters:

hf_tokenizers (Any) –

__init__(hf_tokenizers)
Parameters:

hf_tokenizers (Any) –

Return type:

None

__call__(text)
Parameters:

text (str) –

Return type:

Dict

save(dir_path, name='bbpe')
classmethod load(dir_path, name='bbpe', **kwargs)
class medcat.preprocessing.tokenizers.TokenizerWrapperBERT(hf_tokenizers=None)

Bases: object

__init__(hf_tokenizers=None)
__call__(text)
Parameters:

text (str) –

Return type:

Dict

save(dir_path, name='bert')
Parameters:
  • dir_path (str) –

  • name (str) –

Return type:

None

classmethod load(dir_path, name='bert', **kwargs)
Parameters:
  • dir_path (str) –

  • name (str) –

Return type:

Any