medcat.utils.normalizers
Module Contents
Classes
Will normalize all tokens in a spacy document. |
Functions
|
Get all N-th order edits of a word. |
Attributes
- medcat.utils.normalizers.CONTAINS_NUMBER
- class medcat.utils.normalizers.BasicSpellChecker(cdb_vocab, config, data_vocab=None)
Bases:
object- __init__(cdb_vocab, config, data_vocab=None)
- P(word)
Probability of word.
- Parameters:
word (str) – The word in question.
- Returns:
float – The probability.
- Return type:
float
- __contains__(word)
- fix(word)
Most probable spelling correction for word.
- Parameters:
word (str) – The word.
- Returns:
Optional[str] – Fixed word, or None if no fixes were applied.
- Return type:
Optional[str]
- candidates(word)
Generate possible spelling corrections for word.
- Parameters:
word (str) – The word.
- Returns:
Iterable[str] – The list of candidate words.
- Return type:
Iterable[str]
- known(words)
The subset of words that appear in the dictionary of WORDS.
- Parameters:
words (Iterable[str]) – The words.
- Returns:
Set[str] – The set of candidates.
- Return type:
Set[str]
- edits1(word)
- Parameters:
word (str) –
- Return type:
Set[str]
- classmethod get_edits1(word, use_diacritics)
All edits that are one edit away from word.
- Parameters:
word (str) – The word.
use_diacritics (bool) – Whether to use diacritics or not.
- Returns:
Set[str] – The set of all edits
- Return type:
Set[str]
- edits2(word)
All edits that are two edits away from word.
- Parameters:
word (str) – The word to start from.
- Returns:
Iterator[str] – All 2-away edits.
- Return type:
Iterator[str]
- edits3(word)
All edits that are two edits away from word.
- medcat.utils.normalizers.get_all_edits_n(word, use_diacritics, n, return_ordered=False)
Get all N-th order edits of a word.
The output can be ordered. This can be useful when run-to-run is of concern. But by default this should be avoided where possible since it adds overhead and limits the operations permitted on the returned value (i.e for distance 1, in unordered case you get a set).
- Parameters:
word (str) – The original word.
use_diacritics (bool) – Whether or not to use diacritics.
n (int) – The number of edits to allow.
return_ordered (bool) – Whether to order the output. Defaults to False.
- Raises:
ValueError – If the number of edits is smaller than 0.
- Yields:
Iterator[str] – The generator of the various edits.
- Return type:
Iterator[str]
- class medcat.utils.normalizers.TokenNormalizer(config, spell_checker=None)
Bases:
medcat.pipeline.pipe_runner.PipeRunnerWill normalize all tokens in a spacy document.
- Parameters:
config –
spell_checker –
- name = 'token_normalizer'
- __init__(config, spell_checker=None)
- __call__(doc)