`medcat.stats.kfold`

Module Contents

Classes

`CDBLike`	Base class for protocol classes.
`CATLike`	Base class for protocol classes.
`SplitType`	The split type.
`FoldCreator`	The FoldCreator based on a MCT export.
`SimpleFoldCreator`	The FoldCreator based on a MCT export.
`PerDocsFoldCreator`	The FoldCreator based on a MCT export.
`PerAnnsFoldCreator`	The FoldCreator based on a MCT export.
`WeightedDocumentsCreator`	The FoldCreator based on a MCT export.
`PerCUIMetrics`

Functions

`get_fold_creator`(mct_export, nr_of_folds, split_type)	Get the appropriate fold creator.
`get_per_fold_metrics`(cat, folds, args, *kwargs)
`_merge_examples`(all_examples, cur_examples)
`_add_helper`(joined, single)
`_add_weighted_helper`(joined, single, cui2count)
`get_metrics_mean`(metrics, include_std)	The the mean of the provided metrics.
`get_k_fold_stats`(cat, mct_export_data[, k, ...])	Get the k-fold stats for the model with the specified data.

Attributes

`IntValuedMetric`
`FloatValuedMetric`

class medcat.stats.kfold.CDBLike

Bases: Protocol

Base class for protocol classes.

Protocol classes are defined as:

class Proto(Protocol):
    def meth(self) -> int:
        ...

Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:

class C:
    def meth(self) -> int:
        return 0

def func(x: Proto) -> int:
    return x.meth()

func(C())  # Passes static type check

See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:

class GenProto(Protocol[T]):
    def meth(self) -> T:
        ...

class medcat.stats.kfold.CATLike

Bases: Protocol

Base class for protocol classes.

Protocol classes are defined as:

class Proto(Protocol):
    def meth(self) -> int:
        ...

Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:

class C:
    def meth(self) -> int:
        return 0

def func(x: Proto) -> int:
    return x.meth()

func(C())  # Passes static type check

See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:

class GenProto(Protocol[T]):
    def meth(self) -> T:
        ...

property cdb: CDBLike

Return type:: CDBLike

train_supervised_raw(data, reset_cui_count=False, nepochs=1, print_stats=0, use_filters=False, terminate_last=False, use_overlaps=False, use_cui_doc_limit=False, test_size=0, devalue_others=False, use_groups=False, never_terminate=False, train_from_false_positives=False, extra_cui_filter=None, retain_extra_cui_filter=False, checkpoint=None, retain_filters=False, is_resumed=False)

Parameters:

data (Dict[str, List[Dict[str, dict]]]) –
reset_cui_count (bool) –
nepochs (int) –
print_stats (int) –
use_filters (bool) –
terminate_last (bool) –
use_overlaps (bool) –
use_cui_doc_limit (bool) –
test_size (float) –
devalue_others (bool) –
use_groups (bool) –
never_terminate (bool) –
train_from_false_positives (bool) –
extra_cui_filter (Optional[Set]) –
retain_extra_cui_filter (bool) –
checkpoint (Optional[medcat.utils.checkpoint.Checkpoint]) –
retain_filters (bool) –
is_resumed (bool) –

Return type:

Tuple

class medcat.stats.kfold.SplitType

Bases: enum.Enum

The split type.

DOCUMENTS: Split over number of documents.

ANNOTATIONS: Split over number of annotations.

DOCUMENTS_WEIGHTED: Split over number of documents based on the number of annotations. So essentially this ensures that the same document isn’t in 2 folds while trying to more equally distribute documents with different number of annotations. For example:

If we have 6 documents that we want to split into 3 folds. The number of annotations per document are as follows:

[40, 40, 20, 10, 5, 5]

If we were to split this trivially over documents, we’d end up with the 3 folds with number of annotations that are far from even:

[80, 30, 10]

However, if we use the annotations as weights, we would be able to create folds that have more evenly distributed annotations, e.g:

[[D1,], [D2], [D3, D4, D5, D6]]

where D# denotes the number of the documents, with the number of annotations being equal:

[ 40, 40, 20 + 10 + 5 + 5 = 40]

class medcat.stats.kfold.FoldCreator(mct_export, nr_of_folds)

Bases: abc.ABC

The FoldCreator based on a MCT export.

Parameters:

mct_export (MedCATTrainerExport) – The MCT export dict.
nr_of_folds (int) – Number of folds to create.
use_annotations (bool) – Whether to fold on number of annotations or documents.

__init__(mct_export, nr_of_folds)

Parameters:

mct_export (medcat.stats.mctexport.MedCATTrainerExport) –
nr_of_folds (int) –

Return type:

None

_find_or_add_doc(project, orig_doc)

Parameters:

project (medcat.stats.mctexport.MedCATTrainerExportProject) –
orig_doc (medcat.stats.mctexport.MedCATTrainerExportDocument) –

Return type:

medcat.stats.mctexport.MedCATTrainerExportDocument

_create_new_project(proj_info)

Parameters:: proj_info (medcat.stats.mctexport.MedCATTrainerExportProjectInfo) –
Return type:: medcat.stats.mctexport.MedCATTrainerExportProject

_create_export_with_documents(relevant_docs)

Parameters:: relevant_docs (Iterable[Tuple[medcat.stats.mctexport.MedCATTrainerExportProjectInfo, medcat.stats.mctexport.MedCATTrainerExportDocument]]) –
Return type:: medcat.stats.mctexport.MedCATTrainerExport

abstract create_folds()

Create folds.

Raises:: ValueError – If something went wrong.
Returns:: List[MedCATTrainerExport] – The created folds.
Return type:: List[medcat.stats.mctexport.MedCATTrainerExport]

class medcat.stats.kfold.SimpleFoldCreator(mct_export, nr_of_folds, counter)

Bases: FoldCreator

The FoldCreator based on a MCT export.

Parameters:

mct_export (MedCATTrainerExport) – The MCT export dict.
nr_of_folds (int) – Number of folds to create.
use_annotations (bool) – Whether to fold on number of annotations or documents.
counter (Callable[[medcat.stats.mctexport.MedCATTrainerExport], int]) –

__init__(mct_export, nr_of_folds, counter)

Parameters:

mct_export (medcat.stats.mctexport.MedCATTrainerExport) –
nr_of_folds (int) –
counter (Callable[[medcat.stats.mctexport.MedCATTrainerExport], int]) –

Return type:

None

_init_per_fold()

Return type:: List[int]

abstract _create_fold(fold_nr)

Parameters:: fold_nr (int) –
Return type:: medcat.stats.mctexport.MedCATTrainerExport

create_folds()

Create folds.

Raises:: ValueError – If something went wrong.
Returns:: List[MedCATTrainerExport] – The created folds.
Return type:: List[medcat.stats.mctexport.MedCATTrainerExport]

class medcat.stats.kfold.PerDocsFoldCreator(mct_export, nr_of_folds)

Bases: FoldCreator

The FoldCreator based on a MCT export.

Parameters:

mct_export (MedCATTrainerExport) – The MCT export dict.
nr_of_folds (int) – Number of folds to create.
use_annotations (bool) – Whether to fold on number of annotations or documents.

__init__(mct_export, nr_of_folds)

Parameters:

mct_export (medcat.stats.mctexport.MedCATTrainerExport) –
nr_of_folds (int) –

Return type:

None

_create_fold(fold_nr)

Parameters:: fold_nr (int) –
Return type:: medcat.stats.mctexport.MedCATTrainerExport

create_folds()

Create folds.

Raises:: ValueError – If something went wrong.
Returns:: List[MedCATTrainerExport] – The created folds.
Return type:: List[medcat.stats.mctexport.MedCATTrainerExport]

class medcat.stats.kfold.PerAnnsFoldCreator(mct_export, nr_of_folds)

Bases: SimpleFoldCreator

The FoldCreator based on a MCT export.

Parameters:

mct_export (MedCATTrainerExport) – The MCT export dict.
nr_of_folds (int) – Number of folds to create.
use_annotations (bool) – Whether to fold on number of annotations or documents.

__init__(mct_export, nr_of_folds)

Parameters:

mct_export (medcat.stats.mctexport.MedCATTrainerExport) –
nr_of_folds (int) –

Return type:

None

_add_target_ann(project, orig_doc, ann)

Parameters:

project (medcat.stats.mctexport.MedCATTrainerExportProject) –
orig_doc (medcat.stats.mctexport.MedCATTrainerExportDocument) –
ann (medcat.stats.mctexport.MedCATTrainerExportAnnotation) –

Return type:

None

_targets(start_at)

Parameters:: start_at (int) –
Return type:: Iterable[Tuple[medcat.stats.mctexport.MedCATTrainerExportProjectInfo, medcat.stats.mctexport.MedCATTrainerExportDocument, medcat.stats.mctexport.MedCATTrainerExportAnnotation]]

_create_fold(fold_nr)

Parameters:: fold_nr (int) –
Return type:: medcat.stats.mctexport.MedCATTrainerExport

class medcat.stats.kfold.WeightedDocumentsCreator(mct_export, nr_of_folds, weight_calculator)

Bases: FoldCreator

The FoldCreator based on a MCT export.

Parameters:

mct_export (MedCATTrainerExport) – The MCT export dict.
nr_of_folds (int) – Number of folds to create.
use_annotations (bool) – Whether to fold on number of annotations or documents.
weight_calculator (Callable[[medcat.stats.mctexport.MedCATTrainerExportDocument], int]) –

__init__(mct_export, nr_of_folds, weight_calculator)

Parameters:

mct_export (medcat.stats.mctexport.MedCATTrainerExport) –
nr_of_folds (int) –
weight_calculator (Callable[[medcat.stats.mctexport.MedCATTrainerExportDocument], int]) –

Return type:

None

create_folds()

Create folds.

Raises:: ValueError – If something went wrong.
Returns:: List[MedCATTrainerExport] – The created folds.
Return type:: List[medcat.stats.mctexport.MedCATTrainerExport]

medcat.stats.kfold.get_fold_creator(mct_export, nr_of_folds, split_type)

Get the appropriate fold creator.

Parameters:

mct_export (MedCATTrainerExport) – The MCT export.
nr_of_folds (int) – Number of folds to use.
split_type (SplitType) – The type of split to use.

Raises:

ValueError – In case of an unknown split type.

Returns:

FoldCreator – The corresponding fold creator.

Return type:

FoldCreator

medcat.stats.kfold.get_per_fold_metrics(cat, folds, *args, **kwargs)

Parameters:

cat (CATLike) –
folds (List[medcat.stats.mctexport.MedCATTrainerExport]) –

Return type:

List[Tuple]

medcat.stats.kfold._merge_examples(all_examples, cur_examples)

Parameters:

all_examples (Dict) –
cur_examples (Dict) –

Return type:

None

medcat.stats.kfold.IntValuedMetric

medcat.stats.kfold.FloatValuedMetric

class medcat.stats.kfold.PerCUIMetrics

Bases: pydantic.BaseModel

weights: List[int | float] = []

vals: List[int | float] = []

add(val, weight=1)

Parameters:: weight (int) –

get_mean()

get_std()

medcat.stats.kfold._add_helper(joined, single)

Parameters:

joined (List[Dict[str, PerCUIMetrics]]) –
single (List[Dict[str, int]]) –

Return type:

None

medcat.stats.kfold._add_weighted_helper(joined, single, cui2count)

Parameters:

joined (List[Dict[str, PerCUIMetrics]]) –
single (List[Dict[str, float]]) –
cui2count (Dict[str, int]) –

Return type:

None

medcat.stats.kfold.get_metrics_mean(metrics, include_std)

The the mean of the provided metrics.

Parameters:

metrics (List[Tuple[Dict, Dict, Dict, Dict, Dict, Dict, Dict, Dict]) – The metrics.
include_std (bool) – Whether to include the standard deviation.

Returns:

fps (dict) – False positives for each CUI.
fns (dict) – False negatives for each CUI.
tps (dict) – True positives for each CUI.
cui_prec (dict) – Precision for each CUI.
cui_rec (dict) – Recall for each CUI.
cui_f1 (dict) – F1 for each CUI.
cui_counts (dict) – Number of occurrence for each CUI.
examples (dict) – Examples for each of the fp, fn, tp. Format will be examples[‘fp’][‘cui’][<list_of_examples>].

Return type:

Tuple[Dict, Dict, Dict, Dict, Dict, Dict, Dict, Dict]

medcat.stats.kfold.get_k_fold_stats(cat, mct_export_data, k=3, split_type=SplitType.DOCUMENTS_WEIGHTED, include_std=False, *args, **kwargs)

Get the k-fold stats for the model with the specified data.

First this will split the MCT export into k folds. You can do this either per document or per-annotation.

For each of the k folds, it will start from the base model, train it with with the other k-1 folds and record the metrics. After that the base model state is restored before doing the next fold. After all the folds have been done, the metrics are averaged.

Parameters:

cat (CATLike) – The model pack.
mct_export_data (MedCATTrainerExport) – The MCT export.
k (int) – The number of folds. Defaults to 3.
split_type (SplitType) – Whether to use annodations or docs. Defaults to DOCUMENTS_WEIGHTED.
include_std (bool) – Whether to include stanrdard deviation. Defaults to False.
*args – Arguments passed to the CAT.train_supervised_raw method.
**kwargs – Keyword arguments passed to the CAT.train_supervised_raw method.

Returns:

Tuple – The averaged metrics. Potentially with their corresponding standard deviations.

Return type:

Tuple

medcat.stats.kfold

Module Contents

Classes

Functions

Attributes

`medcat.stats.kfold`