medcat.stats.kfold

Module Contents

Classes

CDBLike

Base class for protocol classes.

CATLike

Base class for protocol classes.

SplitType

The split type.

FoldCreator

The FoldCreator based on a MCT export.

SimpleFoldCreator

The FoldCreator based on a MCT export.

PerDocsFoldCreator

The FoldCreator based on a MCT export.

PerAnnsFoldCreator

The FoldCreator based on a MCT export.

WeightedDocumentsCreator

The FoldCreator based on a MCT export.

PerCUIMetrics

Functions

get_fold_creator(mct_export, nr_of_folds, split_type)

Get the appropriate fold creator.

get_per_fold_metrics(cat, folds, *args, **kwargs)

_merge_examples(all_examples, cur_examples)

_add_helper(joined, single)

_add_weighted_helper(joined, single, cui2count)

get_metrics_mean(metrics, include_std)

The the mean of the provided metrics.

get_k_fold_stats(cat, mct_export_data[, k, ...])

Get the k-fold stats for the model with the specified data.

Attributes

IntValuedMetric

FloatValuedMetric

class medcat.stats.kfold.CDBLike

Bases: Protocol

Base class for protocol classes.

Protocol classes are defined as:

class Proto(Protocol):
    def meth(self) -> int:
        ...

Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:

class C:
    def meth(self) -> int:
        return 0

def func(x: Proto) -> int:
    return x.meth()

func(C())  # Passes static type check

See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:

class GenProto(Protocol[T]):
    def meth(self) -> T:
        ...
class medcat.stats.kfold.CATLike

Bases: Protocol

Base class for protocol classes.

Protocol classes are defined as:

class Proto(Protocol):
    def meth(self) -> int:
        ...

Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:

class C:
    def meth(self) -> int:
        return 0

def func(x: Proto) -> int:
    return x.meth()

func(C())  # Passes static type check

See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:

class GenProto(Protocol[T]):
    def meth(self) -> T:
        ...
property cdb: CDBLike
Return type:

CDBLike

train_supervised_raw(data, reset_cui_count=False, nepochs=1, print_stats=0, use_filters=False, terminate_last=False, use_overlaps=False, use_cui_doc_limit=False, test_size=0, devalue_others=False, use_groups=False, never_terminate=False, train_from_false_positives=False, extra_cui_filter=None, retain_extra_cui_filter=False, checkpoint=None, retain_filters=False, is_resumed=False)
Parameters:
  • data (Dict[str, List[Dict[str, dict]]]) –

  • reset_cui_count (bool) –

  • nepochs (int) –

  • print_stats (int) –

  • use_filters (bool) –

  • terminate_last (bool) –

  • use_overlaps (bool) –

  • use_cui_doc_limit (bool) –

  • test_size (float) –

  • devalue_others (bool) –

  • use_groups (bool) –

  • never_terminate (bool) –

  • train_from_false_positives (bool) –

  • extra_cui_filter (Optional[Set]) –

  • retain_extra_cui_filter (bool) –

  • checkpoint (Optional[medcat.utils.checkpoint.Checkpoint]) –

  • retain_filters (bool) –

  • is_resumed (bool) –

Return type:

Tuple

class medcat.stats.kfold.SplitType

Bases: enum.Enum

The split type.

DOCUMENTS

Split over number of documents.

ANNOTATIONS

Split over number of annotations.

DOCUMENTS_WEIGHTED

Split over number of documents based on the number of annotations. So essentially this ensures that the same document isn’t in 2 folds while trying to more equally distribute documents with different number of annotations. For example:

If we have 6 documents that we want to split into 3 folds. The number of annotations per document are as follows:

[40, 40, 20, 10, 5, 5]

If we were to split this trivially over documents, we’d end up with the 3 folds with number of annotations that are far from even:

[80, 30, 10]

However, if we use the annotations as weights, we would be able to create folds that have more evenly distributed annotations, e.g:

[[D1,], [D2], [D3, D4, D5, D6]]

where D# denotes the number of the documents, with the number of annotations being equal:

[ 40, 40, 20 + 10 + 5 + 5 = 40]

class medcat.stats.kfold.FoldCreator(mct_export, nr_of_folds)

Bases: abc.ABC

The FoldCreator based on a MCT export.

Parameters:
  • mct_export (MedCATTrainerExport) – The MCT export dict.

  • nr_of_folds (int) – Number of folds to create.

  • use_annotations (bool) – Whether to fold on number of annotations or documents.

__init__(mct_export, nr_of_folds)
Parameters:
Return type:

None

_find_or_add_doc(project, orig_doc)
Parameters:
Return type:

medcat.stats.mctexport.MedCATTrainerExportDocument

_create_new_project(proj_info)
Parameters:

proj_info (medcat.stats.mctexport.MedCATTrainerExportProjectInfo) –

Return type:

medcat.stats.mctexport.MedCATTrainerExportProject

_create_export_with_documents(relevant_docs)
Parameters:

relevant_docs (Iterable[Tuple[medcat.stats.mctexport.MedCATTrainerExportProjectInfo, medcat.stats.mctexport.MedCATTrainerExportDocument]]) –

Return type:

medcat.stats.mctexport.MedCATTrainerExport

abstract create_folds()

Create folds.

Raises:

ValueError – If something went wrong.

Returns:

List[MedCATTrainerExport] – The created folds.

Return type:

List[medcat.stats.mctexport.MedCATTrainerExport]

class medcat.stats.kfold.SimpleFoldCreator(mct_export, nr_of_folds, counter)

Bases: FoldCreator

The FoldCreator based on a MCT export.

Parameters:
__init__(mct_export, nr_of_folds, counter)
Parameters:
Return type:

None

_init_per_fold()
Return type:

List[int]

abstract _create_fold(fold_nr)
Parameters:

fold_nr (int) –

Return type:

medcat.stats.mctexport.MedCATTrainerExport

create_folds()

Create folds.

Raises:

ValueError – If something went wrong.

Returns:

List[MedCATTrainerExport] – The created folds.

Return type:

List[medcat.stats.mctexport.MedCATTrainerExport]

class medcat.stats.kfold.PerDocsFoldCreator(mct_export, nr_of_folds)

Bases: FoldCreator

The FoldCreator based on a MCT export.

Parameters:
  • mct_export (MedCATTrainerExport) – The MCT export dict.

  • nr_of_folds (int) – Number of folds to create.

  • use_annotations (bool) – Whether to fold on number of annotations or documents.

__init__(mct_export, nr_of_folds)
Parameters:
Return type:

None

_create_fold(fold_nr)
Parameters:

fold_nr (int) –

Return type:

medcat.stats.mctexport.MedCATTrainerExport

create_folds()

Create folds.

Raises:

ValueError – If something went wrong.

Returns:

List[MedCATTrainerExport] – The created folds.

Return type:

List[medcat.stats.mctexport.MedCATTrainerExport]

class medcat.stats.kfold.PerAnnsFoldCreator(mct_export, nr_of_folds)

Bases: SimpleFoldCreator

The FoldCreator based on a MCT export.

Parameters:
  • mct_export (MedCATTrainerExport) – The MCT export dict.

  • nr_of_folds (int) – Number of folds to create.

  • use_annotations (bool) – Whether to fold on number of annotations or documents.

__init__(mct_export, nr_of_folds)
Parameters:
Return type:

None

_add_target_ann(project, orig_doc, ann)
Parameters:
Return type:

None

_targets(start_at)
Parameters:

start_at (int) –

Return type:

Iterable[Tuple[medcat.stats.mctexport.MedCATTrainerExportProjectInfo, medcat.stats.mctexport.MedCATTrainerExportDocument, medcat.stats.mctexport.MedCATTrainerExportAnnotation]]

_create_fold(fold_nr)
Parameters:

fold_nr (int) –

Return type:

medcat.stats.mctexport.MedCATTrainerExport

class medcat.stats.kfold.WeightedDocumentsCreator(mct_export, nr_of_folds, weight_calculator)

Bases: FoldCreator

The FoldCreator based on a MCT export.

Parameters:
__init__(mct_export, nr_of_folds, weight_calculator)
Parameters:
Return type:

None

create_folds()

Create folds.

Raises:

ValueError – If something went wrong.

Returns:

List[MedCATTrainerExport] – The created folds.

Return type:

List[medcat.stats.mctexport.MedCATTrainerExport]

medcat.stats.kfold.get_fold_creator(mct_export, nr_of_folds, split_type)

Get the appropriate fold creator.

Parameters:
  • mct_export (MedCATTrainerExport) – The MCT export.

  • nr_of_folds (int) – Number of folds to use.

  • split_type (SplitType) – The type of split to use.

Raises:

ValueError – In case of an unknown split type.

Returns:

FoldCreator – The corresponding fold creator.

Return type:

FoldCreator

medcat.stats.kfold.get_per_fold_metrics(cat, folds, *args, **kwargs)
Parameters:
Return type:

List[Tuple]

medcat.stats.kfold._merge_examples(all_examples, cur_examples)
Parameters:
  • all_examples (Dict) –

  • cur_examples (Dict) –

Return type:

None

medcat.stats.kfold.IntValuedMetric
medcat.stats.kfold.FloatValuedMetric
class medcat.stats.kfold.PerCUIMetrics

Bases: pydantic.BaseModel

weights: List[int | float] = []
vals: List[int | float] = []
add(val, weight=1)
Parameters:

weight (int) –

get_mean()
get_std()
medcat.stats.kfold._add_helper(joined, single)
Parameters:
  • joined (List[Dict[str, PerCUIMetrics]]) –

  • single (List[Dict[str, int]]) –

Return type:

None

medcat.stats.kfold._add_weighted_helper(joined, single, cui2count)
Parameters:
  • joined (List[Dict[str, PerCUIMetrics]]) –

  • single (List[Dict[str, float]]) –

  • cui2count (Dict[str, int]) –

Return type:

None

medcat.stats.kfold.get_metrics_mean(metrics, include_std)

The the mean of the provided metrics.

Parameters:
  • metrics (List[Tuple[Dict, Dict, Dict, Dict, Dict, Dict, Dict, Dict]) – The metrics.

  • include_std (bool) – Whether to include the standard deviation.

Returns:
  • fps (dict) – False positives for each CUI.

  • fns (dict) – False negatives for each CUI.

  • tps (dict) – True positives for each CUI.

  • cui_prec (dict) – Precision for each CUI.

  • cui_rec (dict) – Recall for each CUI.

  • cui_f1 (dict) – F1 for each CUI.

  • cui_counts (dict) – Number of occurrence for each CUI.

  • examples (dict) – Examples for each of the fp, fn, tp. Format will be examples[‘fp’][‘cui’][<list_of_examples>].

Return type:

Tuple[Dict, Dict, Dict, Dict, Dict, Dict, Dict, Dict]

medcat.stats.kfold.get_k_fold_stats(cat, mct_export_data, k=3, split_type=SplitType.DOCUMENTS_WEIGHTED, include_std=False, *args, **kwargs)

Get the k-fold stats for the model with the specified data.

First this will split the MCT export into k folds. You can do this either per document or per-annotation.

For each of the k folds, it will start from the base model, train it with with the other k-1 folds and record the metrics. After that the base model state is restored before doing the next fold. After all the folds have been done, the metrics are averaged.

Parameters:
  • cat (CATLike) – The model pack.

  • mct_export_data (MedCATTrainerExport) – The MCT export.

  • k (int) – The number of folds. Defaults to 3.

  • split_type (SplitType) – Whether to use annodations or docs. Defaults to DOCUMENTS_WEIGHTED.

  • include_std (bool) – Whether to include stanrdard deviation. Defaults to False.

  • *args – Arguments passed to the CAT.train_supervised_raw method.

  • **kwargs – Keyword arguments passed to the CAT.train_supervised_raw method.

Returns:

Tuple – The averaged metrics. Potentially with their corresponding standard deviations.

Return type:

Tuple