medcat.utils.normalizers

Module Contents

Classes

BasicSpellChecker

TokenNormalizer

Will normalize all tokens in a spacy document.

Attributes

CONTAINS_NUMBER

medcat.utils.normalizers.CONTAINS_NUMBER
class medcat.utils.normalizers.BasicSpellChecker(cdb_vocab, config, data_vocab=None)

Bases: object

__init__(cdb_vocab, config, data_vocab=None)
P(word)

Probability of word.

Parameters:

word (str) – The word in question.

Returns:

float – The probability.

Return type:

float

__contains__(word)
fix(word)

Most probable spelling correction for word.

Parameters:

word (str) – The word.

Returns:

Optional[str] – Fixed word, or None if no fixes were applied.

Return type:

Optional[str]

candidates(word)

Generate possible spelling corrections for word.

Parameters:

word (str) – The word.

Returns:

Iterable[str] – The list of candidate words.

Return type:

Iterable[str]

known(words)

The subset of words that appear in the dictionary of WORDS.

Parameters:

words (Iterable[str]) – The words.

Returns:

Set[str] – The set of candidates.

Return type:

Set[str]

edits1(word)

All edits that are one edit away from word.

Parameters:

word (str) – The word.

Returns:

Set[str] – The set of all edits

Return type:

Set[str]

edits2(word)

All edits that are two edits away from word.

Parameters:

word (str) – The word to start from.

Returns:

Iterator[str] – All 2-away edits.

Return type:

Iterator[str]

edits3(word)

All edits that are two edits away from word.

class medcat.utils.normalizers.TokenNormalizer(config, spell_checker=None)

Bases: medcat.pipeline.pipe_runner.PipeRunner

Will normalize all tokens in a spacy document.

Parameters:
  • config

  • spell_checker

name = 'token_normalizer'
__init__(config, spell_checker=None)
__call__(doc)