:py:mod:`medcat.utils.normalizers` ================================== .. py:module:: medcat.utils.normalizers Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: medcat.utils.normalizers.BasicSpellChecker medcat.utils.normalizers.TokenNormalizer Functions ~~~~~~~~~ .. autoapisummary:: medcat.utils.normalizers.get_all_edits_n Attributes ~~~~~~~~~~ .. autoapisummary:: medcat.utils.normalizers.CONTAINS_NUMBER .. py:data:: CONTAINS_NUMBER .. py:class:: BasicSpellChecker(cdb_vocab, config, data_vocab=None) Bases: :py:obj:`object` .. py:method:: __init__(cdb_vocab, config, data_vocab=None) .. py:method:: P(word) Probability of `word`. :param word: The word in question. :type word: str :Returns: **float** -- The probability. .. py:method:: __contains__(word) .. py:method:: fix(word) Most probable spelling correction for word. :param word: The word. :type word: str :Returns: **Optional[str]** -- Fixed word, or None if no fixes were applied. .. py:method:: candidates(word) Generate possible spelling corrections for word. :param word: The word. :type word: str :Returns: **Iterable[str]** -- The list of candidate words. .. py:method:: known(words) The subset of `words` that appear in the dictionary of WORDS. :param words: The words. :type words: Iterable[str] :Returns: **Set[str]** -- The set of candidates. .. py:method:: edits1(word) .. py:method:: get_edits1(word, use_diacritics) :classmethod: All edits that are one edit away from `word`. :param word: The word. :type word: str :param use_diacritics: Whether to use diacritics or not. :type use_diacritics: bool :Returns: **Set[str]** -- The set of all edits .. py:method:: edits2(word) All edits that are two edits away from `word`. :param word: The word to start from. :type word: str :Returns: **Iterator[str]** -- All 2-away edits. .. py:method:: edits3(word) All edits that are two edits away from `word`. .. py:function:: get_all_edits_n(word, use_diacritics, n, return_ordered = False) Get all N-th order edits of a word. The output can be ordered. This can be useful when run-to-run is of concern. But by default this should be avoided where possible since it adds overhead and limits the operations permitted on the returned value (i.e for distance 1, in unordered case you get a set). :param word: The original word. :type word: str :param use_diacritics: Whether or not to use diacritics. :type use_diacritics: bool :param n: The number of edits to allow. :type n: int :param return_ordered: Whether to order the output. Defaults to False. :type return_ordered: bool :raises ValueError: If the number of edits is smaller than 0. :Yields: *Iterator[str]* -- The generator of the various edits. .. py:class:: TokenNormalizer(config, spell_checker=None) Bases: :py:obj:`medcat.pipeline.pipe_runner.PipeRunner` Will normalize all tokens in a spacy document. :param config: :param spell_checker: .. py:attribute:: name :value: 'token_normalizer' .. py:method:: __init__(config, spell_checker=None) .. py:method:: __call__(doc)