medcat.stats.kfold
Module Contents
Classes
Base class for protocol classes. |
|
Base class for protocol classes. |
|
The split type. |
|
The FoldCreator based on a MCT export. |
|
The FoldCreator based on a MCT export. |
|
The FoldCreator based on a MCT export. |
|
The FoldCreator based on a MCT export. |
|
The FoldCreator based on a MCT export. |
|
Functions
|
Get the appropriate fold creator. |
|
|
|
|
|
|
|
|
|
The the mean of the provided metrics. |
|
Get the k-fold stats for the model with the specified data. |
Attributes
- class medcat.stats.kfold.CDBLike
Bases:
ProtocolBase class for protocol classes.
Protocol classes are defined as:
class Proto(Protocol): def meth(self) -> int: ...
Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:
class C: def meth(self) -> int: return 0 def func(x: Proto) -> int: return x.meth() func(C()) # Passes static type check
See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:
class GenProto(Protocol[T]): def meth(self) -> T: ...
- class medcat.stats.kfold.CATLike
Bases:
ProtocolBase class for protocol classes.
Protocol classes are defined as:
class Proto(Protocol): def meth(self) -> int: ...
Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:
class C: def meth(self) -> int: return 0 def func(x: Proto) -> int: return x.meth() func(C()) # Passes static type check
See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:
class GenProto(Protocol[T]): def meth(self) -> T: ...
- train_supervised_raw(data, reset_cui_count=False, nepochs=1, print_stats=0, use_filters=False, terminate_last=False, use_overlaps=False, use_cui_doc_limit=False, test_size=0, devalue_others=False, use_groups=False, never_terminate=False, train_from_false_positives=False, extra_cui_filter=None, retain_extra_cui_filter=False, checkpoint=None, retain_filters=False, is_resumed=False)
- Parameters:
data (Dict[str, List[Dict[str, dict]]]) –
reset_cui_count (bool) –
nepochs (int) –
print_stats (int) –
use_filters (bool) –
terminate_last (bool) –
use_overlaps (bool) –
use_cui_doc_limit (bool) –
test_size (float) –
devalue_others (bool) –
use_groups (bool) –
never_terminate (bool) –
train_from_false_positives (bool) –
extra_cui_filter (Optional[Set]) –
retain_extra_cui_filter (bool) –
checkpoint (Optional[medcat.utils.checkpoint.Checkpoint]) –
retain_filters (bool) –
is_resumed (bool) –
- Return type:
Tuple
- class medcat.stats.kfold.SplitType
Bases:
enum.EnumThe split type.
- DOCUMENTS
Split over number of documents.
- ANNOTATIONS
Split over number of annotations.
- DOCUMENTS_WEIGHTED
Split over number of documents based on the number of annotations. So essentially this ensures that the same document isn’t in 2 folds while trying to more equally distribute documents with different number of annotations. For example:
If we have 6 documents that we want to split into 3 folds. The number of annotations per document are as follows:
[40, 40, 20, 10, 5, 5]
If we were to split this trivially over documents, we’d end up with the 3 folds with number of annotations that are far from even:
[80, 30, 10]
However, if we use the annotations as weights, we would be able to create folds that have more evenly distributed annotations, e.g:
[[D1,], [D2], [D3, D4, D5, D6]]
where D# denotes the number of the documents, with the number of annotations being equal:
[ 40, 40, 20 + 10 + 5 + 5 = 40]
- class medcat.stats.kfold.FoldCreator(mct_export, nr_of_folds)
Bases:
abc.ABCThe FoldCreator based on a MCT export.
- Parameters:
mct_export (MedCATTrainerExport) – The MCT export dict.
nr_of_folds (int) – Number of folds to create.
use_annotations (bool) – Whether to fold on number of annotations or documents.
- __init__(mct_export, nr_of_folds)
- Parameters:
mct_export (medcat.stats.mctexport.MedCATTrainerExport) –
nr_of_folds (int) –
- Return type:
None
- _find_or_add_doc(project, orig_doc)
- Parameters:
project (medcat.stats.mctexport.MedCATTrainerExportProject) –
orig_doc (medcat.stats.mctexport.MedCATTrainerExportDocument) –
- Return type:
- _create_new_project(proj_info)
- Parameters:
proj_info (medcat.stats.mctexport.MedCATTrainerExportProjectInfo) –
- Return type:
- _create_export_with_documents(relevant_docs)
- Parameters:
relevant_docs (Iterable[Tuple[medcat.stats.mctexport.MedCATTrainerExportProjectInfo, medcat.stats.mctexport.MedCATTrainerExportDocument]]) –
- Return type:
- abstract create_folds()
Create folds.
- Raises:
ValueError – If something went wrong.
- Returns:
List[MedCATTrainerExport] – The created folds.
- Return type:
- class medcat.stats.kfold.SimpleFoldCreator(mct_export, nr_of_folds, counter)
Bases:
FoldCreatorThe FoldCreator based on a MCT export.
- Parameters:
mct_export (MedCATTrainerExport) – The MCT export dict.
nr_of_folds (int) – Number of folds to create.
use_annotations (bool) – Whether to fold on number of annotations or documents.
counter (Callable[[medcat.stats.mctexport.MedCATTrainerExport], int]) –
- __init__(mct_export, nr_of_folds, counter)
- Parameters:
mct_export (medcat.stats.mctexport.MedCATTrainerExport) –
nr_of_folds (int) –
counter (Callable[[medcat.stats.mctexport.MedCATTrainerExport], int]) –
- Return type:
None
- _init_per_fold()
- Return type:
List[int]
- abstract _create_fold(fold_nr)
- Parameters:
fold_nr (int) –
- Return type:
- create_folds()
Create folds.
- Raises:
ValueError – If something went wrong.
- Returns:
List[MedCATTrainerExport] – The created folds.
- Return type:
- class medcat.stats.kfold.PerDocsFoldCreator(mct_export, nr_of_folds)
Bases:
FoldCreatorThe FoldCreator based on a MCT export.
- Parameters:
mct_export (MedCATTrainerExport) – The MCT export dict.
nr_of_folds (int) – Number of folds to create.
use_annotations (bool) – Whether to fold on number of annotations or documents.
- __init__(mct_export, nr_of_folds)
- Parameters:
mct_export (medcat.stats.mctexport.MedCATTrainerExport) –
nr_of_folds (int) –
- Return type:
None
- _create_fold(fold_nr)
- Parameters:
fold_nr (int) –
- Return type:
- create_folds()
Create folds.
- Raises:
ValueError – If something went wrong.
- Returns:
List[MedCATTrainerExport] – The created folds.
- Return type:
- class medcat.stats.kfold.PerAnnsFoldCreator(mct_export, nr_of_folds)
Bases:
SimpleFoldCreatorThe FoldCreator based on a MCT export.
- Parameters:
mct_export (MedCATTrainerExport) – The MCT export dict.
nr_of_folds (int) – Number of folds to create.
use_annotations (bool) – Whether to fold on number of annotations or documents.
- __init__(mct_export, nr_of_folds)
- Parameters:
mct_export (medcat.stats.mctexport.MedCATTrainerExport) –
nr_of_folds (int) –
- Return type:
None
- _add_target_ann(project, orig_doc, ann)
- Parameters:
- Return type:
None
- _targets(start_at)
- Parameters:
start_at (int) –
- Return type:
Iterable[Tuple[medcat.stats.mctexport.MedCATTrainerExportProjectInfo, medcat.stats.mctexport.MedCATTrainerExportDocument, medcat.stats.mctexport.MedCATTrainerExportAnnotation]]
- _create_fold(fold_nr)
- Parameters:
fold_nr (int) –
- Return type:
- class medcat.stats.kfold.WeightedDocumentsCreator(mct_export, nr_of_folds, weight_calculator)
Bases:
FoldCreatorThe FoldCreator based on a MCT export.
- Parameters:
mct_export (MedCATTrainerExport) – The MCT export dict.
nr_of_folds (int) – Number of folds to create.
use_annotations (bool) – Whether to fold on number of annotations or documents.
weight_calculator (Callable[[medcat.stats.mctexport.MedCATTrainerExportDocument], int]) –
- __init__(mct_export, nr_of_folds, weight_calculator)
- Parameters:
mct_export (medcat.stats.mctexport.MedCATTrainerExport) –
nr_of_folds (int) –
weight_calculator (Callable[[medcat.stats.mctexport.MedCATTrainerExportDocument], int]) –
- Return type:
None
- create_folds()
Create folds.
- Raises:
ValueError – If something went wrong.
- Returns:
List[MedCATTrainerExport] – The created folds.
- Return type:
- medcat.stats.kfold.get_fold_creator(mct_export, nr_of_folds, split_type)
Get the appropriate fold creator.
- Parameters:
mct_export (MedCATTrainerExport) – The MCT export.
nr_of_folds (int) – Number of folds to use.
split_type (SplitType) – The type of split to use.
- Raises:
ValueError – In case of an unknown split type.
- Returns:
FoldCreator – The corresponding fold creator.
- Return type:
- medcat.stats.kfold.get_per_fold_metrics(cat, folds, *args, **kwargs)
- Parameters:
cat (CATLike) –
folds (List[medcat.stats.mctexport.MedCATTrainerExport]) –
- Return type:
List[Tuple]
- medcat.stats.kfold._merge_examples(all_examples, cur_examples)
- Parameters:
all_examples (Dict) –
cur_examples (Dict) –
- Return type:
None
- medcat.stats.kfold.IntValuedMetric
- medcat.stats.kfold.FloatValuedMetric
- class medcat.stats.kfold.PerCUIMetrics
Bases:
pydantic.BaseModel- weights: List[int | float] = []
- vals: List[int | float] = []
- add(val, weight=1)
- Parameters:
weight (int) –
- get_mean()
- get_std()
- medcat.stats.kfold._add_helper(joined, single)
- Parameters:
joined (List[Dict[str, PerCUIMetrics]]) –
single (List[Dict[str, int]]) –
- Return type:
None
- medcat.stats.kfold._add_weighted_helper(joined, single, cui2count)
- Parameters:
joined (List[Dict[str, PerCUIMetrics]]) –
single (List[Dict[str, float]]) –
cui2count (Dict[str, int]) –
- Return type:
None
- medcat.stats.kfold.get_metrics_mean(metrics, include_std)
The the mean of the provided metrics.
- Parameters:
metrics (List[Tuple[Dict, Dict, Dict, Dict, Dict, Dict, Dict, Dict]) – The metrics.
include_std (bool) – Whether to include the standard deviation.
- Returns:
fps (dict) – False positives for each CUI.
fns (dict) – False negatives for each CUI.
tps (dict) – True positives for each CUI.
cui_prec (dict) – Precision for each CUI.
cui_rec (dict) – Recall for each CUI.
cui_f1 (dict) – F1 for each CUI.
cui_counts (dict) – Number of occurrence for each CUI.
examples (dict) – Examples for each of the fp, fn, tp. Format will be examples[‘fp’][‘cui’][<list_of_examples>].
- Return type:
Tuple[Dict, Dict, Dict, Dict, Dict, Dict, Dict, Dict]
- medcat.stats.kfold.get_k_fold_stats(cat, mct_export_data, k=3, split_type=SplitType.DOCUMENTS_WEIGHTED, include_std=False, *args, **kwargs)
Get the k-fold stats for the model with the specified data.
First this will split the MCT export into k folds. You can do this either per document or per-annotation.
For each of the k folds, it will start from the base model, train it with with the other k-1 folds and record the metrics. After that the base model state is restored before doing the next fold. After all the folds have been done, the metrics are averaged.
- Parameters:
cat (CATLike) – The model pack.
mct_export_data (MedCATTrainerExport) – The MCT export.
k (int) – The number of folds. Defaults to 3.
split_type (SplitType) – Whether to use annodations or docs. Defaults to DOCUMENTS_WEIGHTED.
include_std (bool) – Whether to include stanrdard deviation. Defaults to False.
*args – Arguments passed to the CAT.train_supervised_raw method.
**kwargs – Keyword arguments passed to the CAT.train_supervised_raw method.
- Returns:
Tuple – The averaged metrics. Potentially with their corresponding standard deviations.
- Return type:
Tuple