medcat.utils.normalizers

Module Contents

Classes

BasicSpellChecker

TokenNormalizer

Will normalize all tokens in a spacy document.

Functions

get_all_edits_n(word, use_diacritics, n[, return_ordered])

Get all N-th order edits of a word.

Attributes

CONTAINS_NUMBER

medcat.utils.normalizers.CONTAINS_NUMBER
class medcat.utils.normalizers.BasicSpellChecker(cdb_vocab, config, data_vocab=None)

Bases: object

__init__(cdb_vocab, config, data_vocab=None)
P(word)

Probability of word.

Parameters:

word (str) – The word in question.

Returns:

float – The probability.

Return type:

float

__contains__(word)
fix(word)

Most probable spelling correction for word.

Parameters:

word (str) – The word.

Returns:

Optional[str] – Fixed word, or None if no fixes were applied.

Return type:

Optional[str]

candidates(word)

Generate possible spelling corrections for word.

Parameters:

word (str) – The word.

Returns:

Iterable[str] – The list of candidate words.

Return type:

Iterable[str]

known(words)

The subset of words that appear in the dictionary of WORDS.

Parameters:

words (Iterable[str]) – The words.

Returns:

Set[str] – The set of candidates.

Return type:

Set[str]

edits1(word)
Parameters:

word (str) –

Return type:

Set[str]

classmethod get_edits1(word, use_diacritics)

All edits that are one edit away from word.

Parameters:
  • word (str) – The word.

  • use_diacritics (bool) – Whether to use diacritics or not.

Returns:

Set[str] – The set of all edits

Return type:

Set[str]

edits2(word)

All edits that are two edits away from word.

Parameters:

word (str) – The word to start from.

Returns:

Iterator[str] – All 2-away edits.

Return type:

Iterator[str]

edits3(word)

All edits that are two edits away from word.

medcat.utils.normalizers.get_all_edits_n(word, use_diacritics, n, return_ordered=False)

Get all N-th order edits of a word.

The output can be ordered. This can be useful when run-to-run is of concern. But by default this should be avoided where possible since it adds overhead and limits the operations permitted on the returned value (i.e for distance 1, in unordered case you get a set).

Parameters:
  • word (str) – The original word.

  • use_diacritics (bool) – Whether or not to use diacritics.

  • n (int) – The number of edits to allow.

  • return_ordered (bool) – Whether to order the output. Defaults to False.

Raises:

ValueError – If the number of edits is smaller than 0.

Yields:

Iterator[str] – The generator of the various edits.

Return type:

Iterator[str]

class medcat.utils.normalizers.TokenNormalizer(config, spell_checker=None)

Bases: medcat.pipeline.pipe_runner.PipeRunner

Will normalize all tokens in a spacy document.

Parameters:
  • config

  • spell_checker

name = 'token_normalizer'
__init__(config, spell_checker=None)
__call__(doc)