:py:mod:`medcat.stats.kfold` ============================ .. py:module:: medcat.stats.kfold Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: medcat.stats.kfold.CDBLike medcat.stats.kfold.CATLike medcat.stats.kfold.SplitType medcat.stats.kfold.FoldCreator medcat.stats.kfold.SimpleFoldCreator medcat.stats.kfold.PerDocsFoldCreator medcat.stats.kfold.PerAnnsFoldCreator medcat.stats.kfold.WeightedDocumentsCreator medcat.stats.kfold.PerCUIMetrics Functions ~~~~~~~~~ .. autoapisummary:: medcat.stats.kfold.get_fold_creator medcat.stats.kfold.get_per_fold_metrics medcat.stats.kfold._merge_examples medcat.stats.kfold._add_helper medcat.stats.kfold._add_weighted_helper medcat.stats.kfold.get_metrics_mean medcat.stats.kfold.get_k_fold_stats Attributes ~~~~~~~~~~ .. autoapisummary:: medcat.stats.kfold.IntValuedMetric medcat.stats.kfold.FloatValuedMetric .. py:class:: CDBLike Bases: :py:obj:`Protocol` Base class for protocol classes. Protocol classes are defined as:: class Proto(Protocol): def meth(self) -> int: ... Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:: class C: def meth(self) -> int: return 0 def func(x: Proto) -> int: return x.meth() func(C()) # Passes static type check See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:: class GenProto(Protocol[T]): def meth(self) -> T: ... .. py:class:: CATLike Bases: :py:obj:`Protocol` Base class for protocol classes. Protocol classes are defined as:: class Proto(Protocol): def meth(self) -> int: ... Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:: class C: def meth(self) -> int: return 0 def func(x: Proto) -> int: return x.meth() func(C()) # Passes static type check See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:: class GenProto(Protocol[T]): def meth(self) -> T: ... .. py:property:: cdb :type: CDBLike .. py:method:: train_supervised_raw(data, reset_cui_count = False, nepochs = 1, print_stats = 0, use_filters = False, terminate_last = False, use_overlaps = False, use_cui_doc_limit = False, test_size = 0, devalue_others = False, use_groups = False, never_terminate = False, train_from_false_positives = False, extra_cui_filter = None, retain_extra_cui_filter = False, checkpoint = None, retain_filters = False, is_resumed = False) .. py:class:: SplitType Bases: :py:obj:`enum.Enum` The split type. .. py:attribute:: DOCUMENTS Split over number of documents. .. py:attribute:: ANNOTATIONS Split over number of annotations. .. py:attribute:: DOCUMENTS_WEIGHTED Split over number of documents based on the number of annotations. So essentially this ensures that the same document isn't in 2 folds while trying to more equally distribute documents with different number of annotations. For example: If we have 6 documents that we want to split into 3 folds. The number of annotations per document are as follows: [40, 40, 20, 10, 5, 5] If we were to split this trivially over documents, we'd end up with the 3 folds with number of annotations that are far from even: [80, 30, 10] However, if we use the annotations as weights, we would be able to create folds that have more evenly distributed annotations, e.g: [[D1,], [D2], [D3, D4, D5, D6]] where D# denotes the number of the documents, with the number of annotations being equal: [ 40, 40, 20 + 10 + 5 + 5 = 40] .. py:class:: FoldCreator(mct_export, nr_of_folds) Bases: :py:obj:`abc.ABC` The FoldCreator based on a MCT export. :param mct_export: The MCT export dict. :type mct_export: MedCATTrainerExport :param nr_of_folds: Number of folds to create. :type nr_of_folds: int :param use_annotations: Whether to fold on number of annotations or documents. :type use_annotations: bool .. py:method:: __init__(mct_export, nr_of_folds) .. py:method:: _find_or_add_doc(project, orig_doc) .. py:method:: _create_new_project(proj_info) .. py:method:: _create_export_with_documents(relevant_docs) .. py:method:: create_folds() :abstractmethod: Create folds. :raises ValueError: If something went wrong. :Returns: **List[MedCATTrainerExport]** -- The created folds. .. py:class:: SimpleFoldCreator(mct_export, nr_of_folds, counter) Bases: :py:obj:`FoldCreator` The FoldCreator based on a MCT export. :param mct_export: The MCT export dict. :type mct_export: MedCATTrainerExport :param nr_of_folds: Number of folds to create. :type nr_of_folds: int :param use_annotations: Whether to fold on number of annotations or documents. :type use_annotations: bool .. py:method:: __init__(mct_export, nr_of_folds, counter) .. py:method:: _init_per_fold() .. py:method:: _create_fold(fold_nr) :abstractmethod: .. py:method:: create_folds() Create folds. :raises ValueError: If something went wrong. :Returns: **List[MedCATTrainerExport]** -- The created folds. .. py:class:: PerDocsFoldCreator(mct_export, nr_of_folds) Bases: :py:obj:`FoldCreator` The FoldCreator based on a MCT export. :param mct_export: The MCT export dict. :type mct_export: MedCATTrainerExport :param nr_of_folds: Number of folds to create. :type nr_of_folds: int :param use_annotations: Whether to fold on number of annotations or documents. :type use_annotations: bool .. py:method:: __init__(mct_export, nr_of_folds) .. py:method:: _create_fold(fold_nr) .. py:method:: create_folds() Create folds. :raises ValueError: If something went wrong. :Returns: **List[MedCATTrainerExport]** -- The created folds. .. py:class:: PerAnnsFoldCreator(mct_export, nr_of_folds) Bases: :py:obj:`SimpleFoldCreator` The FoldCreator based on a MCT export. :param mct_export: The MCT export dict. :type mct_export: MedCATTrainerExport :param nr_of_folds: Number of folds to create. :type nr_of_folds: int :param use_annotations: Whether to fold on number of annotations or documents. :type use_annotations: bool .. py:method:: __init__(mct_export, nr_of_folds) .. py:method:: _add_target_ann(project, orig_doc, ann) .. py:method:: _targets(start_at) .. py:method:: _create_fold(fold_nr) .. py:class:: WeightedDocumentsCreator(mct_export, nr_of_folds, weight_calculator) Bases: :py:obj:`FoldCreator` The FoldCreator based on a MCT export. :param mct_export: The MCT export dict. :type mct_export: MedCATTrainerExport :param nr_of_folds: Number of folds to create. :type nr_of_folds: int :param use_annotations: Whether to fold on number of annotations or documents. :type use_annotations: bool .. py:method:: __init__(mct_export, nr_of_folds, weight_calculator) .. py:method:: create_folds() Create folds. :raises ValueError: If something went wrong. :Returns: **List[MedCATTrainerExport]** -- The created folds. .. py:function:: get_fold_creator(mct_export, nr_of_folds, split_type) Get the appropriate fold creator. :param mct_export: The MCT export. :type mct_export: MedCATTrainerExport :param nr_of_folds: Number of folds to use. :type nr_of_folds: int :param split_type: The type of split to use. :type split_type: SplitType :raises ValueError: In case of an unknown split type. :Returns: **FoldCreator** -- The corresponding fold creator. .. py:function:: get_per_fold_metrics(cat, folds, *args, **kwargs) .. py:function:: _merge_examples(all_examples, cur_examples) .. py:data:: IntValuedMetric .. py:data:: FloatValuedMetric .. py:class:: PerCUIMetrics(/, **data) Bases: :py:obj:`pydantic.BaseModel` Usage docs: https://docs.pydantic.dev/2.10/concepts/models/ A base class for creating Pydantic models. .. attribute:: __class_vars__ The names of the class variables defined on the model. .. attribute:: __private_attributes__ Metadata about the private attributes of the model. .. attribute:: __signature__ The synthesized `__init__` [`Signature`][inspect.Signature] of the model. .. attribute:: __pydantic_complete__ Whether model building is completed, or if there are still undefined fields. .. attribute:: __pydantic_core_schema__ The core schema of the model. .. attribute:: __pydantic_custom_init__ Whether the model has a custom `__init__` function. .. attribute:: __pydantic_decorators__ Metadata containing the decorators defined on the model. This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1. .. attribute:: __pydantic_generic_metadata__ Metadata for generic models; contains data used for a similar purpose to __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these. .. attribute:: __pydantic_parent_namespace__ Parent namespace of the model, used for automatic rebuilding of models. .. attribute:: __pydantic_post_init__ The name of the post-init method for the model, if defined. .. attribute:: __pydantic_root_model__ Whether the model is a [`RootModel`][pydantic.root_model.RootModel]. .. attribute:: __pydantic_serializer__ The `pydantic-core` `SchemaSerializer` used to dump instances of the model. .. attribute:: __pydantic_validator__ The `pydantic-core` `SchemaValidator` used to validate instances of the model. .. attribute:: __pydantic_fields__ A dictionary of field names and their corresponding [`FieldInfo`][pydantic.fields.FieldInfo] objects. .. attribute:: __pydantic_computed_fields__ A dictionary of computed field names and their corresponding [`ComputedFieldInfo`][pydantic.fields.ComputedFieldInfo] objects. .. attribute:: __pydantic_extra__ A dictionary containing extra values, if [`extra`][pydantic.config.ConfigDict.extra] is set to `'allow'`. .. attribute:: __pydantic_fields_set__ The names of fields explicitly set during instantiation. .. attribute:: __pydantic_private__ Values of private attributes set on the model instance. .. py:attribute:: weights :type: List[Union[int, float]] :value: [] .. py:attribute:: vals :type: List[Union[int, float]] :value: [] .. py:method:: add(val, weight = 1) .. py:method:: get_mean() .. py:method:: get_std() .. py:function:: _add_helper(joined, single) .. py:function:: _add_weighted_helper(joined, single, cui2count) .. py:function:: get_metrics_mean(metrics, include_std) The the mean of the provided metrics. :param metrics: The metrics. :type metrics: List[Tuple[Dict, Dict, Dict, Dict, Dict, Dict, Dict, Dict] :param include_std: Whether to include the standard deviation. :type include_std: bool :Returns: * **fps** (*dict*) -- False positives for each CUI. * **fns** (*dict*) -- False negatives for each CUI. * **tps** (*dict*) -- True positives for each CUI. * **cui_prec** (*dict*) -- Precision for each CUI. * **cui_rec** (*dict*) -- Recall for each CUI. * **cui_f1** (*dict*) -- F1 for each CUI. * **cui_counts** (*dict*) -- Number of occurrence for each CUI. * **examples** (*dict*) -- Examples for each of the fp, fn, tp. Format will be examples['fp']['cui'][]. .. py:function:: get_k_fold_stats(cat, mct_export_data, k = 3, split_type = SplitType.DOCUMENTS_WEIGHTED, include_std = False, *args, **kwargs) Get the k-fold stats for the model with the specified data. First this will split the MCT export into `k` folds. You can do this either per document or per-annotation. For each of the `k` folds, it will start from the base model, train it with with the other `k-1` folds and record the metrics. After that the base model state is restored before doing the next fold. After all the folds have been done, the metrics are averaged. :param cat: The model pack. :type cat: CATLike :param mct_export_data: The MCT export. :type mct_export_data: MedCATTrainerExport :param k: The number of folds. Defaults to 3. :type k: int :param split_type: Whether to use annodations or docs. Defaults to DOCUMENTS_WEIGHTED. :type split_type: SplitType :param include_std: Whether to include stanrdard deviation. Defaults to False. :type include_std: bool :param \*args: Arguments passed to the `CAT.train_supervised_raw` method. :param \*\*kwargs: Keyword arguments passed to the `CAT.train_supervised_raw` method. :Returns: **Tuple** -- The averaged metrics. Potentially with their corresponding standard deviations.