:py:mod:`medcat.stats.kfold`
============================

.. py:module:: medcat.stats.kfold


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   medcat.stats.kfold.CDBLike
   medcat.stats.kfold.CATLike
   medcat.stats.kfold.SplitType
   medcat.stats.kfold.FoldCreator
   medcat.stats.kfold.SimpleFoldCreator
   medcat.stats.kfold.PerDocsFoldCreator
   medcat.stats.kfold.PerAnnsFoldCreator
   medcat.stats.kfold.WeightedDocumentsCreator
   medcat.stats.kfold.PerCUIMetrics


Functions
~~~~~~~~~

.. autoapisummary::

   medcat.stats.kfold.get_fold_creator
   medcat.stats.kfold.get_per_fold_metrics
   medcat.stats.kfold._merge_examples
   medcat.stats.kfold._add_helper
   medcat.stats.kfold._add_weighted_helper
   medcat.stats.kfold.get_metrics_mean
   medcat.stats.kfold.get_k_fold_stats


Attributes
~~~~~~~~~~

.. autoapisummary::

   medcat.stats.kfold.IntValuedMetric
   medcat.stats.kfold.FloatValuedMetric


.. py:class:: CDBLike


   Bases: :py:obj:`Protocol`

   Base class for protocol classes.

   Protocol classes are defined as::

       class Proto(Protocol):
           def meth(self) -> int:
               ...

   Such classes are primarily used with static type checkers that recognize
   structural subtyping (static duck-typing), for example::

       class C:
           def meth(self) -> int:
               return 0

       def func(x: Proto) -> int:
           return x.meth()

       func(C())  # Passes static type check

   See PEP 544 for details. Protocol classes decorated with
   @typing.runtime_checkable act as simple-minded runtime protocols that check
   only the presence of given attributes, ignoring their type signatures.
   Protocol classes can be generic, they are defined as::

       class GenProto(Protocol[T]):
           def meth(self) -> T:
               ...


.. py:class:: CATLike


   Bases: :py:obj:`Protocol`

   Base class for protocol classes.

   Protocol classes are defined as::

       class Proto(Protocol):
           def meth(self) -> int:
               ...

   Such classes are primarily used with static type checkers that recognize
   structural subtyping (static duck-typing), for example::

       class C:
           def meth(self) -> int:
               return 0

       def func(x: Proto) -> int:
           return x.meth()

       func(C())  # Passes static type check

   See PEP 544 for details. Protocol classes decorated with
   @typing.runtime_checkable act as simple-minded runtime protocols that check
   only the presence of given attributes, ignoring their type signatures.
   Protocol classes can be generic, they are defined as::

       class GenProto(Protocol[T]):
           def meth(self) -> T:
               ...

   .. py:property:: cdb
      :type: CDBLike


   .. py:method:: train_supervised_raw(data, reset_cui_count = False, nepochs = 1, print_stats = 0, use_filters = False, terminate_last = False, use_overlaps = False, use_cui_doc_limit = False, test_size = 0, devalue_others = False, use_groups = False, never_terminate = False, train_from_false_positives = False, extra_cui_filter = None, retain_extra_cui_filter = False, checkpoint = None, retain_filters = False, is_resumed = False)


.. py:class:: SplitType


   Bases: :py:obj:`enum.Enum`

   The split type.

   .. py:attribute:: DOCUMENTS

      Split over number of documents.

   .. py:attribute:: ANNOTATIONS

      Split over number of annotations.

   .. py:attribute:: DOCUMENTS_WEIGHTED

      Split over number of documents based on the number of annotations.
      So essentially this ensures that the same document isn't in 2 folds
      while trying to more equally distribute documents with different number
      of annotations.
      For example:
          If we have 6 documents that we want to split into 3 folds.
          The number of annotations per document are as follows:
             [40, 40, 20, 10, 5, 5]
          If we were to split this trivially over documents, we'd end up
          with the 3 folds with number of annotations that are far from even:
             [80, 30, 10]
          However, if we use the annotations as weights, we would be able to
          create folds that have more evenly distributed annotations, e.g:
             [[D1,], [D2], [D3, D4, D5, D6]]
          where D# denotes the number of the documents, with the number of
          annotations being equal:
             [ 40, 40, 20 + 10 + 5 + 5 = 40]


.. py:class:: FoldCreator(mct_export, nr_of_folds)


   Bases: :py:obj:`abc.ABC`

   The FoldCreator based on a MCT export.

   :param mct_export: The MCT export dict.
   :type mct_export: MedCATTrainerExport
   :param nr_of_folds: Number of folds to create.
   :type nr_of_folds: int
   :param use_annotations: Whether to fold on number of annotations or documents.
   :type use_annotations: bool

   .. py:method:: __init__(mct_export, nr_of_folds)


   .. py:method:: _find_or_add_doc(project, orig_doc)


   .. py:method:: _create_new_project(proj_info)


   .. py:method:: _create_export_with_documents(relevant_docs)


   .. py:method:: create_folds()
      :abstractmethod:

      Create folds.

      :raises ValueError: If something went wrong.

      :Returns: **List[MedCATTrainerExport]** -- The created folds.


.. py:class:: SimpleFoldCreator(mct_export, nr_of_folds, counter)


   Bases: :py:obj:`FoldCreator`

   The FoldCreator based on a MCT export.

   :param mct_export: The MCT export dict.
   :type mct_export: MedCATTrainerExport
   :param nr_of_folds: Number of folds to create.
   :type nr_of_folds: int
   :param use_annotations: Whether to fold on number of annotations or documents.
   :type use_annotations: bool

   .. py:method:: __init__(mct_export, nr_of_folds, counter)


   .. py:method:: _init_per_fold()


   .. py:method:: _create_fold(fold_nr)
      :abstractmethod:


   .. py:method:: create_folds()

      Create folds.

      :raises ValueError: If something went wrong.

      :Returns: **List[MedCATTrainerExport]** -- The created folds.


.. py:class:: PerDocsFoldCreator(mct_export, nr_of_folds)


   Bases: :py:obj:`FoldCreator`

   The FoldCreator based on a MCT export.

   :param mct_export: The MCT export dict.
   :type mct_export: MedCATTrainerExport
   :param nr_of_folds: Number of folds to create.
   :type nr_of_folds: int
   :param use_annotations: Whether to fold on number of annotations or documents.
   :type use_annotations: bool

   .. py:method:: __init__(mct_export, nr_of_folds)


   .. py:method:: _create_fold(fold_nr)


   .. py:method:: create_folds()

      Create folds.

      :raises ValueError: If something went wrong.

      :Returns: **List[MedCATTrainerExport]** -- The created folds.


.. py:class:: PerAnnsFoldCreator(mct_export, nr_of_folds)


   Bases: :py:obj:`SimpleFoldCreator`

   The FoldCreator based on a MCT export.

   :param mct_export: The MCT export dict.
   :type mct_export: MedCATTrainerExport
   :param nr_of_folds: Number of folds to create.
   :type nr_of_folds: int
   :param use_annotations: Whether to fold on number of annotations or documents.
   :type use_annotations: bool

   .. py:method:: __init__(mct_export, nr_of_folds)


   .. py:method:: _add_target_ann(project, orig_doc, ann)


   .. py:method:: _targets(start_at)


   .. py:method:: _create_fold(fold_nr)


.. py:class:: WeightedDocumentsCreator(mct_export, nr_of_folds, weight_calculator)


   Bases: :py:obj:`FoldCreator`

   The FoldCreator based on a MCT export.

   :param mct_export: The MCT export dict.
   :type mct_export: MedCATTrainerExport
   :param nr_of_folds: Number of folds to create.
   :type nr_of_folds: int
   :param use_annotations: Whether to fold on number of annotations or documents.
   :type use_annotations: bool

   .. py:method:: __init__(mct_export, nr_of_folds, weight_calculator)


   .. py:method:: create_folds()

      Create folds.

      :raises ValueError: If something went wrong.

      :Returns: **List[MedCATTrainerExport]** -- The created folds.


.. py:function:: get_fold_creator(mct_export, nr_of_folds, split_type)

   Get the appropriate fold creator.

   :param mct_export: The MCT export.
   :type mct_export: MedCATTrainerExport
   :param nr_of_folds: Number of folds to use.
   :type nr_of_folds: int
   :param split_type: The type of split to use.
   :type split_type: SplitType

   :raises ValueError: In case of an unknown split type.

   :Returns: **FoldCreator** -- The corresponding fold creator.


.. py:function:: get_per_fold_metrics(cat, folds, *args, **kwargs)


.. py:function:: _merge_examples(all_examples, cur_examples)


.. py:data:: IntValuedMetric

   
.. py:data:: FloatValuedMetric

   
.. py:class:: PerCUIMetrics(/, **data)


   Bases: :py:obj:`pydantic.BaseModel`

   Usage docs: https://docs.pydantic.dev/2.10/concepts/models/

   A base class for creating Pydantic models.

   .. attribute:: __class_vars__

      The names of the class variables defined on the model.

   .. attribute:: __private_attributes__

      Metadata about the private attributes of the model.

   .. attribute:: __signature__

      The synthesized `__init__` [`Signature`][inspect.Signature] of the model.

   .. attribute:: __pydantic_complete__

      Whether model building is completed, or if there are still undefined fields.

   .. attribute:: __pydantic_core_schema__

      The core schema of the model.

   .. attribute:: __pydantic_custom_init__

      Whether the model has a custom `__init__` function.

   .. attribute:: __pydantic_decorators__

      Metadata containing the decorators defined on the model.
      This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1.

   .. attribute:: __pydantic_generic_metadata__

      Metadata for generic models; contains data used for a similar purpose to
      __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.

   .. attribute:: __pydantic_parent_namespace__

      Parent namespace of the model, used for automatic rebuilding of models.

   .. attribute:: __pydantic_post_init__

      The name of the post-init method for the model, if defined.

   .. attribute:: __pydantic_root_model__

      Whether the model is a [`RootModel`][pydantic.root_model.RootModel].

   .. attribute:: __pydantic_serializer__

      The `pydantic-core` `SchemaSerializer` used to dump instances of the model.

   .. attribute:: __pydantic_validator__

      The `pydantic-core` `SchemaValidator` used to validate instances of the model.

   .. attribute:: __pydantic_fields__

      A dictionary of field names and their corresponding [`FieldInfo`][pydantic.fields.FieldInfo] objects.

   .. attribute:: __pydantic_computed_fields__

      A dictionary of computed field names and their corresponding [`ComputedFieldInfo`][pydantic.fields.ComputedFieldInfo] objects.

   .. attribute:: __pydantic_extra__

      A dictionary containing extra values, if [`extra`][pydantic.config.ConfigDict.extra]
      is set to `'allow'`.

   .. attribute:: __pydantic_fields_set__

      The names of fields explicitly set during instantiation.

   .. attribute:: __pydantic_private__

      Values of private attributes set on the model instance.

   .. py:attribute:: weights
      :type: List[Union[int, float]]
      :value: []

      
   .. py:attribute:: vals
      :type: List[Union[int, float]]
      :value: []

      
   .. py:method:: add(val, weight = 1)


   .. py:method:: get_mean()


   .. py:method:: get_std()


.. py:function:: _add_helper(joined, single)


.. py:function:: _add_weighted_helper(joined, single, cui2count)


.. py:function:: get_metrics_mean(metrics, include_std)

   The the mean of the provided metrics.

   :param metrics: The metrics.
   :type metrics: List[Tuple[Dict, Dict, Dict, Dict, Dict, Dict, Dict, Dict]
   :param include_std: Whether to include the standard deviation.
   :type include_std: bool

   :Returns: * **fps** (*dict*) -- False positives for each CUI.
             * **fns** (*dict*) -- False negatives for each CUI.
             * **tps** (*dict*) -- True positives for each CUI.
             * **cui_prec** (*dict*) -- Precision for each CUI.
             * **cui_rec** (*dict*) -- Recall for each CUI.
             * **cui_f1** (*dict*) -- F1 for each CUI.
             * **cui_counts** (*dict*) -- Number of occurrence for each CUI.
             * **examples** (*dict*) -- Examples for each of the fp, fn, tp. Format will be examples['fp']['cui'][<list_of_examples>].


.. py:function:: get_k_fold_stats(cat, mct_export_data, k = 3, split_type = SplitType.DOCUMENTS_WEIGHTED, include_std = False, *args, **kwargs)

   Get the k-fold stats for the model with the specified data.

   First this will split the MCT export into `k` folds. You can do
   this either per document or per-annotation.

   For each of the `k` folds, it will start from the base model,
   train it with with the other `k-1` folds and record the metrics.
   After that the base model state is restored before doing the next fold.
   After all the folds have been done, the metrics are averaged.

   :param cat: The model pack.
   :type cat: CATLike
   :param mct_export_data: The MCT export.
   :type mct_export_data: MedCATTrainerExport
   :param k: The number of folds. Defaults to 3.
   :type k: int
   :param split_type: Whether to use annodations or docs. Defaults to DOCUMENTS_WEIGHTED.
   :type split_type: SplitType
   :param include_std: Whether to include stanrdard deviation. Defaults to False.
   :type include_std: bool
   :param \*args: Arguments passed to the `CAT.train_supervised_raw` method.
   :param \*\*kwargs: Keyword arguments passed to the `CAT.train_supervised_raw` method.

   :Returns: **Tuple** -- The averaged metrics. Potentially with their corresponding standard deviations.