:py:mod:`medcat.config`
=======================

.. py:module:: medcat.config


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   medcat.config.FakeDict
   medcat.config.ValueExtractor
   medcat.config.MixingConfig
   medcat.config.VersionInfo
   medcat.config.CDBMaker
   medcat.config.AnnotationOutput
   medcat.config.CheckPoint
   medcat.config.UsageMonitor
   medcat.config.General
   medcat.config.Preprocessing
   medcat.config.Ner
   medcat.config._DefPartial
   medcat.config.LinkingFilters
   medcat.config.Linking
   medcat.config.Config


Functions
~~~~~~~~~

.. autoapisummary::

   medcat.config.workers
   medcat.config._set_value_or_alt
   medcat.config._wrapper


Attributes
~~~~~~~~~~

.. autoapisummary::

   medcat.config.logger
   medcat.config._EMPTY_DICT_2_EMPTY_SET
   medcat.config._DEFAULT_EXTRACTOR
   medcat.config._DEFAULT_PARTIAL
   medcat.config._waf_advice


.. py:data:: logger

   
.. py:function:: workers(workers_override = None)


.. py:class:: FakeDict


   FakeDict that allows the use of the __getitem__ and __setitem__ method for legacy access.

   .. py:method:: __getitem__(arg)


   .. py:method:: __setattr__(arg, val)

      Implement setattr(self, name, value).


   .. py:method:: __setitem__(arg, val)


   .. py:method:: get(key, default=None)


.. py:data:: _EMPTY_DICT_2_EMPTY_SET
   :type: Callable[[str, Any], Optional[set]]

   
.. py:class:: ValueExtractor(alt_generators = [_EMPTY_DICT_2_EMPTY_SET])


   The current example config has a value for an empty set as '{}'.
   However, that evaluates to an empty dictionary instead.
   In case there are other such examples, this allows adding other alternatives as well.

   .. py:method:: __init__(alt_generators = [_EMPTY_DICT_2_EMPTY_SET])


   .. py:method:: extract(rhs)

      Extracts value and its alternatives based on the alternative generators defined.

      :param rhs: The parsable right hand side
      :type rhs: str

      :Returns: **Tuple[str, List[str]]** -- The main value and the (potentially many) alternatives


.. py:data:: _DEFAULT_EXTRACTOR

   
.. py:function:: _set_value_or_alt(conf, key, value, alt_values, err = None)


.. py:class:: MixingConfig


   Bases: :py:obj:`FakeDict`

   Config that is able to saved and loaded from disk as well as mixed with other configs.
   It is not intended to be initialised directly and it is assumed that instances also inherit from
   pydantic's BaseModel.

   .. py:method:: save(save_path)

      Save the config into a .json file

      :param save_path: Where to save the created json file
      :type save_path: str


   .. py:method:: merge_config(config_dict)

      Merge a config_dict with the existing config object.

      :param config_dict: A dictionary which key/values should be added to this class.
      :type config_dict: Dict


   .. py:method:: parse_config_file(path, extractor = _DEFAULT_EXTRACTOR)

      Parses a configuration file in text format. Must be like:
              cat.<variable>.<key> = <value>
              ...

          - variable: linking, general, ner, ...
          - key: a key in the config dict e.g. subsample_after for linking
          - value: the value for the key, will be parsed with `eval`

      :param path: the path to the config file
      :type path: str
      :param extractor: (Default value = _DEFAULT_EXTRACTOR)
      :type extractor: ValueExtractor

      :raises ValueError: In case of unknown attribute.


   .. py:method:: rebuild_re()


   .. py:method:: _calc_hash(hasher = None)


   .. py:method:: get_hash(hasher = None)


   .. py:method:: __str__()

      Return str(self).


   .. py:method:: load(save_path)
      :classmethod:

      Load config from a json file, note that fields that
      did not exist in the old config but do exist in the current
      version of the ConfigMetaCAT class will be kept.

      :param save_path: Path to the json file to load
      :type save_path: str

      :Returns: **MixingConfig** -- The loaded config


   .. py:method:: from_dict(config_dict)
      :classmethod:

      Generate a MixingConfig (of an extending type) from a a dictionary.

      :param config_dict: The dictionary to create the config from
      :type config_dict: Dict

      :Returns: **MixingConfig** -- The resulting config


   .. py:method:: asdict()

      Get the config as a dictionary.

      :Returns: **Dict[str, Any]** -- The dictionary associated with this config


   .. py:method:: fields()

      Get the fields associated with this config.

      :Returns: **dict** -- The dictionary of the field names and fields


.. py:class:: VersionInfo(/, **data)


   Bases: :py:obj:`MixingConfig`, :py:obj:`pydantic.BaseModel`

   The version info part of the config

   .. py:class:: Config


      .. py:attribute:: extra
         :value: 'allow'

         
      .. py:attribute:: validate_assignment
         :value: True

         
   .. py:attribute:: history
      :type: list
      :value: []

      Populated automatically

   .. py:attribute:: meta_cats
      :type: Any

      Populated automatically

   .. py:attribute:: cdb_info
      :type: dict

      Populated automatically, output from cdb.print_stats

   .. py:attribute:: performance
      :type: dict

      {'model_name': {'f1': <>, 'p': <>, ...}, ...}}

      :type: NER general performance, meta should be

      :type: {'meta'

   .. py:attribute:: description
      :type: str
      :value: 'No description'

      General description and what it was trained on

   .. py:attribute:: id
      :type: Any

      hash of most things

      :type: Will be

   .. py:attribute:: last_modified
      :type: Optional[Union[int, datetime.datetime, str]]

      
   .. py:attribute:: location
      :type: Optional[str]

      Path/URL/Whatever to where is this CDB located

   .. py:attribute:: ontology
      :type: Optional[Union[str, List[str]]]

      What was used to build the CDB, e.g. SNOMED_202009

   .. py:attribute:: medcat_version
      :type: Optional[str]

      Which version of medcat was used to build the CDB


.. py:class:: CDBMaker(/, **data)


   Bases: :py:obj:`MixingConfig`, :py:obj:`pydantic.BaseModel`

   The Context Database (CDB) making part of the config

   .. py:class:: Config


      .. py:attribute:: extra
         :value: 'allow'

         
      .. py:attribute:: validate_assignment
         :value: True

         
   .. py:attribute:: name_versions
      :type: list
      :value: ['LOWER', 'CLEAN']

      Name versions to be generated.

   .. py:attribute:: multi_separator
      :type: str
      :value: '|'

      If multiple names or type_ids for a concept present in one row of a CSV, they are separated
      by the character below.

   .. py:attribute:: remove_parenthesis
      :type: int
      :value: 5

      Should preferred names with parenthesis be cleaned 0 means no, else it means if longer than or equal
      e.g. Head (Body part) -> Head

   .. py:attribute:: min_letters_required
      :type: int
      :value: 2

      Minimum number of letters required in a name to be accepted for a concept


.. py:class:: AnnotationOutput(/, **data)


   Bases: :py:obj:`MixingConfig`, :py:obj:`pydantic.BaseModel`

   The annotation output part of the config

   .. py:class:: Config


      .. py:attribute:: extra
         :value: 'allow'

         
      .. py:attribute:: validate_assignment
         :value: True

         
   .. py:attribute:: doc_extended_info
      :type: bool
      :value: False

      
   .. py:attribute:: context_left
      :type: int

      
   .. py:attribute:: context_right
      :type: int

      
   .. py:attribute:: lowercase_context
      :type: bool
      :value: True

      
   .. py:attribute:: include_text_in_output
      :type: bool
      :value: False

      
.. py:class:: CheckPoint(/, **data)


   Bases: :py:obj:`MixingConfig`, :py:obj:`pydantic.BaseModel`

   The checkpoint part of the config

   .. py:class:: Config


      .. py:attribute:: extra
         :value: 'allow'

         
      .. py:attribute:: validate_assignment
         :value: True

         
   .. py:attribute:: output_dir
      :type: str
      :value: 'checkpoints'

      When doing training this is the name of the directory where checkpoints will be saved

   .. py:attribute:: steps
      :type: Optional[int]

      When training how often to save the checkpoint (one step represents one document), if None no ckpts will be created

   .. py:attribute:: max_to_keep
      :type: int
      :value: 1

      When training the maximum checkpoints will be kept on the disk


.. py:class:: UsageMonitor(/, **data)


   Bases: :py:obj:`MixingConfig`, :py:obj:`pydantic.BaseModel`

   Config that is able to saved and loaded from disk as well as mixed with other configs.
   It is not intended to be initialised directly and it is assumed that instances also inherit from
   pydantic's BaseModel.

   .. py:attribute:: enabled
      :type: Literal[True, False, auto]
      :value: False

      Whether usage monitoring is enabled (True), disabled (False), or automatic ('auto').

      If set to False, no logging is performed.
      If set to True, logs are saved in the location specified by `log_folder`.
      If set to 'auto', logs will be automatically enabled or disabled based on
      environmenta variable (`MEDCAT_LOGS` - setting it to False or 0 disabled logging)
      and distributed according to the OS preferred logs location (`MEDCAT_LOGS_LOCATION`).
      The defaults for the location are:
       - For Linux: ~/.local/share/medcat/logs/
       - For Windows: C:\Users\%USERNAME%\.cache\medcat\logs\

   .. py:attribute:: batch_size
      :type: int
      :value: 100

      Number of logged events to write at once.

   .. py:attribute:: file_prefix
      :type: str
      :value: 'usage_'

      The prefix for logged files. The suffix will be the model hash.

   .. py:attribute:: log_folder
      :type: str
      :value: '.'

      The folder which contains the usage logs. In certain situations,
      it may make sense to keep this separate from the overall logs.

      NOTE: Does not take affect if `enabled` is set to 'auto'


.. py:class:: General(/, **data)


   Bases: :py:obj:`MixingConfig`, :py:obj:`pydantic.BaseModel`

   The general part of the config

   .. py:class:: Config


      .. py:attribute:: extra
         :value: 'allow'

         
      .. py:attribute:: validate_assignment
         :value: True

         
   .. py:attribute:: spacy_disabled_components
      :type: list
      :value: ['ner', 'parser', 'vectors', 'textcat', 'entity_linker', 'sentencizer', 'entity_ruler',...

      The list of spacy components that will be disabled.

      NB! For these changes to take effect, the pipe would need to be recreated.

   .. py:attribute:: checkpoint
      :type: CheckPoint

      
   .. py:attribute:: usage_monitor
      :type: UsageMonitor

      Checkpointing config

   .. py:attribute:: log_level
      :type: int

      Logging config for everything | 'tagger' can be disabled, but will cause a drop in performance

   .. py:attribute:: log_format
      :type: str
      :value: '%(levelname)s:%(name)s: %(message)s'

      
   .. py:attribute:: log_path
      :type: str
      :value: './medcat.log'

      
   .. py:attribute:: spacy_model
      :type: str
      :value: 'en_core_web_md'

      What model will be used for tokenization

   .. py:attribute:: separator
      :type: str
      :value: '~'

      Separator that will be used to merge tokens of a name. Once a CDB is built this should
      always stay the same.

   .. py:attribute:: spell_check
      :type: bool
      :value: True

      Should we check spelling - note that this makes things much slower, use only if necessary. The only thing necessary
      for the spell checker to work is vocab.dat and cdb.dat built with concepts in the respective language.

   .. py:attribute:: diacritics
      :type: bool
      :value: False

      Should we process diacritics - for languages other than English, symbols such as 'é, ë, ö' can be relevant.
      Note that this makes spell_check slower.

   .. py:attribute:: spell_check_deep
      :type: bool
      :value: False

      If True the spell checker will try harder to find mistakes, this can slow down
      things drastically.

   .. py:attribute:: spell_check_len_limit
      :type: int
      :value: 7

      Spelling will not be checked for words with length less than this

   .. py:attribute:: show_nested_entities
      :type: bool
      :value: False

      If set to True functions like get_entities and get_json will return nested_entities and overlaps

   .. py:attribute:: full_unlink
      :type: bool
      :value: False

      When unlinking a name from a concept should we do full_unlink (means unlink a name from all concepts, not just the one in question)

   .. py:attribute:: workers
      :type: int

      Number of workers used by a parallelizable pipeline component

   .. py:attribute:: make_pretty_labels
      :type: Optional[str]

      Should the labels of entities (shown in displacy) be pretty or just 'concept'. Slows down the annotation pipeline
      should not be used when annotating millions of documents. If `None` it will be the string "concept", if `short` it will be CUI,
      if `long` it will be CUI | Name | Confidence

   .. py:attribute:: map_cui_to_group
      :type: bool
      :value: False

      If the cdb.addl_info['cui2group'] is provided and this option enabled, each CUI will be mapped to the group

   .. py:attribute:: simple_hash
      :type: bool
      :value: False

      Whether to use a simple hash.

      NOTE: While using a simple hash is faster at save time, it is less
      reliable due to not taking into account all the details of the changes.


.. py:class:: Preprocessing(/, **data)


   Bases: :py:obj:`MixingConfig`, :py:obj:`pydantic.BaseModel`

   The preprocessing part of the config

   .. py:class:: Config


      .. py:attribute:: extra
         :value: 'allow'

         
      .. py:attribute:: validate_assignment
         :value: True

         
   .. py:attribute:: words_to_skip
      :type: set

      This words will be completely ignored from concepts and from the text (must be a Set)

   .. py:attribute:: keep_punct
      :type: set

      All punct will be skipped by default, here you can set what will be kept

   .. py:attribute:: do_not_normalize
      :type: set

      e.g. running -> run
      Values are detailed part-of-speech tags. See:
      - https://spacy.io/usage/linguistic-features#pos-tagging
      - Label scheme section per model at https://spacy.io/models/en

      :type: Should specific word types be normalized

   .. py:attribute:: skip_stopwords
      :type: bool
      :value: False

      Should stopwords be skipped/ignored when processing input

   .. py:attribute:: min_len_normalize
      :type: int
      :value: 5

      Nothing below this length will ever be normalized (input tokens or concept names), normalized means lemmatized in this case

   .. py:attribute:: stopwords
      :type: Optional[set]

      If None the default set of stowords from spacy will be used. This must be a Set.

      NB! For these changes to take effect, the pipe would need to be recreated.

   .. py:attribute:: max_document_length
      :type: int
      :value: 1000000

      Documents longer  than this will be trimmed.

      NB! For these changes to take effect, the pipe would need to be recreated.


.. py:class:: Ner(/, **data)


   Bases: :py:obj:`MixingConfig`, :py:obj:`pydantic.BaseModel`

   The NER part of the config

   .. py:class:: Config


      .. py:attribute:: extra
         :value: 'allow'

         
      .. py:attribute:: validate_assignment
         :value: True

         
   .. py:attribute:: min_name_len
      :type: int
      :value: 3

      Do not detect names below this limit, skip them

   .. py:attribute:: max_skip_tokens
      :type: int
      :value: 2

      When checking tokens for concepts you can have skipped tokens between
      used ones (usually spaces, new lines etc). This number tells you how many skipped can you have.

   .. py:attribute:: check_upper_case_names
      :type: bool
      :value: False

      Check uppercase to distinguish uppercase and lowercase words that have a different meaning.

   .. py:attribute:: upper_case_limit_len
      :type: int
      :value: 4

      Any name shorter than this must be uppercase in the text to be considered. If it is not uppercase
      it will be skipped.

   .. py:attribute:: try_reverse_word_order
      :type: bool
      :value: False

      Try reverse word order for short concepts (2 words max), e.g. heart disease -> disease heart


.. py:class:: _DefPartial


   This is a helper class to make it possible to check equality of two default Linking instances

   .. py:method:: __init__()


   .. py:method:: __call__(*args, **kwargs)


   .. py:method:: __eq__(other)

      Return self==value.


.. py:data:: _DEFAULT_PARTIAL

   
.. py:class:: LinkingFilters(**data)


   Bases: :py:obj:`MixingConfig`, :py:obj:`pydantic.BaseModel`

   These describe the linking filters used alongside the model.

   When no CUIs nor excluded CUIs are specified (the sets are empty),
   all CUIs are accepted.
   If there are CUIs specified then only those will be accepted.
   If there are excluded CUIs specified, they are excluded.

   In some cases, there are extra filters as well as MedCATtrainer (MCT) export filters.
   These are expected to follow the following:
   extra_cui_filter ⊆ MCT filter ⊆ Model/config filter

   While any other CUIs can be included in the the extra CUI filter or the MCT filter,
   they would not have any real effect.

   .. py:attribute:: cuis
      :type: Set[str]

      
   .. py:attribute:: cuis_exclude
      :type: Set[str]

      
   .. py:method:: __init__(**data)

      Create a new model by parsing and validating input data from keyword arguments.

      Raises [`ValidationError`][pydantic_core.ValidationError] if the input data cannot be
      validated to form a valid model.

      `self` is explicitly positional-only to allow `self` as a field name.


   .. py:method:: check_filters(cui)

      Checks is a CUI in the filters

      :param cui: The CUI in question
      :type cui: str

      :Returns: **bool** -- True if the CUI is allowed


   .. py:method:: merge_with(other)

      Merge CUIs and excluded CUIs within two filters.
      The data will be kept within this filter (and not the other).

      :param other: The other filter to merge with
      :type other: LinkingFilters


   .. py:method:: copy_of()

      Create a copy of this LinkingFilters.
      This copy will describe an identical filter but will refer to
      different sets so they can be mutated separately.

      :Returns: **LinkingFilters** -- A copy of the original filters.


.. py:class:: Linking(/, **data)


   Bases: :py:obj:`MixingConfig`, :py:obj:`pydantic.BaseModel`

   The linking part of the config

   .. py:class:: Config


      .. py:attribute:: extra
         :value: 'allow'

         
      .. py:attribute:: validate_assignment
         :value: True

         
   .. py:attribute:: optim
      :type: dict

      Linear anneal

   .. py:attribute:: context_vector_sizes
      :type: dict

      Context vector sizes that will be calculated and used for linking

   .. py:attribute:: context_vector_weights
      :type: dict

      Weight of each vector in the similarity score - make trainable at some point. Should add up to 1.

   .. py:attribute:: filters
      :type: LinkingFilters

      Filters

   .. py:attribute:: train
      :type: bool
      :value: True

      Should it train or not, this is set automatically ignore in 99% of cases and do not set manually

   .. py:attribute:: random_replacement_unsupervised
      :type: float
      :value: 0.8

      If <1 during unsupervised training the detected term will be randomly replaced with a probability of 1 - random_replacement_unsupervised
      Replaced with a synonym used for that term

   .. py:attribute:: disamb_length_limit
      :type: int
      :value: 3

      All concepts below this will always be disambiguated

   .. py:attribute:: filter_before_disamb
      :type: bool
      :value: False

      If True it will filter before doing disamb. Useful for the trainer.

   .. py:attribute:: train_count_threshold
      :type: int
      :value: 1

      Concepts that have seen less training examples than this will not be used for
      similarity calculation and will have a similarity of -1.

   .. py:attribute:: always_calculate_similarity
      :type: bool
      :value: False

      Do we want to calculate context similarity even for concepts that are not ambiguous.

   .. py:attribute:: calculate_dynamic_threshold
      :type: bool
      :value: False

      Concepts below this similarity will be ignored. Type can be static/dynamic - if dynamic each CUI has a different TH
      and it is calculated as the average confidence for that CUI * similarity_threshold. Take care that dynamic works only
      if the cdb was trained with calculate_dynamic_threshold = True.

   .. py:attribute:: similarity_threshold_type
      :type: str
      :value: 'static'

      
   .. py:attribute:: similarity_threshold
      :type: float
      :value: 0.25

      
   .. py:attribute:: negative_probability
      :type: float
      :value: 0.5

      Probability for the negative context to be added for each positive addition

   .. py:attribute:: negative_ignore_punct_and_num
      :type: bool
      :value: True

      Do we ignore punct/num when negative sampling

   .. py:attribute:: prefer_primary_name
      :type: float
      :value: 0.35

      If >0 concepts for which a detection is its primary name will be preferred by that amount (0 to 1)

   .. py:attribute:: prefer_frequent_concepts
      :type: float
      :value: 0.35

      If >0 concepts that are more frequent will be preferred by a multiply of this amount

   .. py:attribute:: subsample_after
      :type: int
      :value: 30000

      Subsample during unsupervised training if a concept has received more than

      :type: DISABLED in code permanetly

   .. py:attribute:: devalue_linked_concepts
      :type: bool
      :value: False

      When adding a positive example, should it also be treated as Negative for concepts
      which link to the positive one via names (ambiguous names).

   .. py:attribute:: context_ignore_center_tokens
      :type: bool
      :value: False

      If true when the context of a concept is calculated (embedding) the words making that concept are not taken into account


.. py:class:: Config(*args, **kwargs)


   Bases: :py:obj:`MixingConfig`, :py:obj:`pydantic.BaseModel`

   The MedCAT config

   .. py:class:: Config


      .. py:attribute:: arbitrary_types_allowed
         :value: True

         
      .. py:attribute:: extra
         :value: 'allow'

         
      .. py:attribute:: validate_assignment
         :value: True

         
   .. py:attribute:: version
      :type: VersionInfo

      
   .. py:attribute:: cdb_maker
      :type: CDBMaker

      
   .. py:attribute:: annotation_output
      :type: AnnotationOutput

      
   .. py:attribute:: general
      :type: General

      
   .. py:attribute:: preprocessing
      :type: Preprocessing

      
   .. py:attribute:: ner
      :type: Ner

      
   .. py:attribute:: linking
      :type: Linking

      
   .. py:attribute:: word_skipper
      :type: re.Pattern

      
   .. py:attribute:: punct_checker
      :type: re.Pattern

      
   .. py:attribute:: hash
      :type: Optional[str]

      
   .. py:method:: __init__(*args, **kwargs)

      Create a new model by parsing and validating input data from keyword arguments.

      Raises [`ValidationError`][pydantic_core.ValidationError] if the input data cannot be
      validated to form a valid model.

      `self` is explicitly positional-only to allow `self` as a field name.


   .. py:method:: rebuild_re()


   .. py:method:: get_hash()


.. py:exception:: UseOfOldConfigOptionException(conf_type, arg_name, advice)


   Bases: :py:obj:`AttributeError`

   Attribute not found.

   .. py:method:: __init__(conf_type, arg_name, advice)

      Initialize self.  See help(type(self)) for accurate signature.


.. py:function:: _wrapper(func, check_type, advice, exp_type)


.. py:data:: _waf_advice
   :value: 'You can use `cat.cdb.weighted_average_function` to access it directly'