medcat.utils.data_utils

Module Contents

Classes

MetaAnnotationDS

An abstract class representing a Dataset.

Functions

set_all_seeds(seed)

count_annotations_project(project[, cnt_per_cui])

load_data(data_path[, require_annotations, ...])

Load data.

count_annotations(data_path)

get_doc_from_project(project, doc_id)

get_ann_from_doc(document, start, end)

meta_ann_from_ann(ann, meta_name)

are_anns_same(ann, ann2[, meta_names, ...])

get_same_anns(document, document2[, ...])

print_consolid_stats([ann_stats, meta_names])

check_differences(data_path, cat[, cntx_size, ...])

consolidate_double_annotations(data_path, out_path[, ...])

Consolidated a dataset that was multi-annotated (same documents two times).

validate_ner_data(data_path, cdb[, cntx_size, ...])

Please just ignore this function, I'm afraid to even look at it.

prepare_from_json_hf(data_path, cntx_left, cntx_right, ...)

prepare_from_json_chars(data, cntx_left, cntx_right, ...)

Convert the data from a json format into a CSV-like format for training.

make_mc_train_test(data, cdb[, test_size])

Make train set.

get_false_positives(doc, spacy_doc)

Attributes

logger

medcat.utils.data_utils.logger
medcat.utils.data_utils.set_all_seeds(seed)
Parameters:

seed (int) –

Return type:

None

medcat.utils.data_utils.count_annotations_project(project, cnt_per_cui=None)
Parameters:

project (Dict) –

Return type:

Tuple[int, Any]

medcat.utils.data_utils.load_data(data_path, require_annotations=True, order_by_num_ann=True)

Load data.

Parameters:
  • data_path (str) – The path to the data to load.

  • require_annotations (bool) – This will require anns but on project level, any doc in a project needs anns.

  • order_by_num_ann (bool) – Whether to order by number of annoations. Defaults to True.

Returns:

Dict – The loaded data.

Return type:

Dict

medcat.utils.data_utils.count_annotations(data_path)
Parameters:

data_path (str) –

Return type:

Dict

medcat.utils.data_utils.get_doc_from_project(project, doc_id)
Parameters:
  • project (Dict) –

  • doc_id (str) –

Return type:

Optional[Dict]

medcat.utils.data_utils.get_ann_from_doc(document, start, end)
Parameters:
  • document (Dict) –

  • start (int) –

  • end (int) –

Return type:

Optional[Dict]

medcat.utils.data_utils.meta_ann_from_ann(ann, meta_name)
Parameters:
  • ann (Dict) –

  • meta_name (Union[Dict, List]) –

Return type:

Optional[Dict]

medcat.utils.data_utils.are_anns_same(ann, ann2, meta_names=[], require_double_inner=True)
Parameters:
  • ann (Dict) –

  • ann2 (Dict) –

  • meta_names (List) –

  • require_double_inner (bool) –

Return type:

bool

medcat.utils.data_utils.get_same_anns(document, document2, require_double_inner=True, ann_stats=[], meta_names=[])
Parameters:
  • document (Dict) –

  • document2 (Dict) –

  • require_double_inner (bool) –

  • ann_stats (List) –

  • meta_names (List) –

Return type:

Dict

medcat.utils.data_utils.print_consolid_stats(ann_stats=[], meta_names=[])
Parameters:
  • ann_stats (List) –

  • meta_names (List) –

Return type:

None

medcat.utils.data_utils.check_differences(data_path, cat, cntx_size=30, min_acc=0.2, ignore_already_done=False, only_start=False, only_saved=False)
Parameters:
  • data_path (str) –

  • cat (Any) –

Return type:

None

medcat.utils.data_utils.consolidate_double_annotations(data_path, out_path, require_double=True, require_double_inner=False, meta_anns_to_match=[])

Consolidated a dataset that was multi-annotated (same documents two times).

Parameters:
  • data_path (str) – Output from MedCATtrainer - projects containig the same documents must have the same name.

  • out_path (str) – The consolidated data will be saved here - usually only annotations where both annotators agree

  • require_double (bool) –

    If True everything must be double annotated, meaning there have to be two projects of the same name for each name. Else, it will

    also use projects that do not have double annotiations. If this is False, projects that do not have double anns will be included as is, and projects that have will still be checked.

  • require_double_inner (bool) –

    If False - this will allow some entities to be annotated by only one annotator and not the other, while still requiring

    annotations to be the same if they exist.

  • meta_anns_to_match (List) –

    List of meta annotations that must match for two annotations to be the same. If empty only the mention

    level will be checked.

Returns:

Dict – The consolidated annoation.

Return type:

Dict

medcat.utils.data_utils.validate_ner_data(data_path, cdb, cntx_size=70, status_only=False, ignore_if_already_done=False)

Please just ignore this function, I’m afraid to even look at it.

Parameters:
  • data_path (str) – The data path.

  • cdb (CDB) – The concept database.

  • cntx_size (int) – The context size. Defaults to 70.

  • status_only (bool) – Whether to only consider status. Defaults to False.

  • ignore_if_already_done (bool) – Whether to ignore if already done. Defaults to False.

Return type:

None

class medcat.utils.data_utils.MetaAnnotationDS(data, category_map)

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

Parameters:
  • data (Dict) –

  • category_map (Dict) –

__init__(data, category_map)

Create MetaAnnotationDS.

Parameters:
  • data (Dict) – Dictionary of data values.

  • category_map (Dict) – Map from category naem to id.

__getitem__(idx)
Parameters:

idx (int) –

Return type:

Dict

__len__()
Return type:

int

medcat.utils.data_utils.prepare_from_json_hf(data_path, cntx_left, cntx_right, tokenizer, cui_filter=None, replace_center=None)
Parameters:
  • data_path (str) –

  • cntx_left (int) –

  • cntx_right (int) –

  • tokenizer (Any) –

  • cui_filter (Optional[Dict]) –

  • replace_center (Optional[Dict]) –

Return type:

Dict

medcat.utils.data_utils.prepare_from_json_chars(data, cntx_left, cntx_right, tokenizer, cui_filter=None, replace_center=None)

Convert the data from a json format into a CSV-like format for training.

Parameters:
  • data (Dict) – The json file from MedCAT.

  • cntx_left (int) – The size of the context.

  • cntx_right (int) – The size of the context.

  • tokenizer (Any) – The instance of the <FastTokenizer> class from huggingface.

  • cui_filter (Optional[Dict], optional) – The CUI filter. Defaults to None.

  • replace_center (Optional[Dict], optional) – If not None the center word (concept) will be replaced with whatever is set. Defaults to None.

Returns:

Dict – {‘category_name’: [(‘category_value’, ‘tokens’, ‘center_token’), …], …}

Return type:

Dict

medcat.utils.data_utils.make_mc_train_test(data, cdb, test_size=0.2)

Make train set.

This is a disaster.

Parameters:
  • data (Dict) – The data.

  • cdb (CDB) – The concept database.

  • test_size (float) – The test size. Defaults to 0.2.

Returns:

Tuple – The train set, the test set, the test annotations, and the total annotations

Return type:

Tuple

medcat.utils.data_utils.get_false_positives(doc, spacy_doc)
Parameters:
  • doc (Dict) –

  • spacy_doc (spacy.tokens.doc.Doc) –

Return type:

List[spacy.tokens.span.Span]