medcat.utils.data_utils

Module Contents

Classes

MetaAnnotationDS

An abstract class representing a Dataset.

Functions

set_all_seeds(seed)

count_annotations_project(project[, cnt_per_cui])

load_data(data_path[, require_annotations, ...])

Args:

count_annotations(data_path)

get_doc_from_project(project, doc_id)

get_ann_from_doc(document, start, end)

meta_ann_from_ann(ann, meta_name)

are_anns_same(ann, ann2[, meta_names, ...])

get_same_anns(document, document2[, ...])

print_consolid_stats([ann_stats, meta_names])

check_differences(data_path, cat[, cntx_size, ...])

consolidate_double_annotations(data_path, out_path[, ...])

Consolidated a dataset that was multi-annotated (same documents two times).

validate_ner_data(data_path, cdb[, cntx_size, ...])

Please just ignore this function, I'm afraid to even look at it.

prepare_from_json_hf(data_path, cntx_left, cntx_right, ...)

prepare_from_json_chars(data, cntx_left, cntx_right, ...)

Convert the data from a json format into a CSV-like format for training.

make_mc_train_test(data, cdb[, test_size])

This is a disaster.

get_false_positives(doc, spacy_doc)

Attributes

logger

medcat.utils.data_utils.logger
medcat.utils.data_utils.set_all_seeds(seed)
Parameters:

seed (int) –

Return type:

None

medcat.utils.data_utils.count_annotations_project(project, cnt_per_cui=None)
Parameters:

project (Dict) –

Return type:

Tuple[int, Any]

medcat.utils.data_utils.load_data(data_path, require_annotations=True, order_by_num_ann=True)

Args: require_annotations:

This will require anns but on project level, any doc in a project needs anns.

Parameters:
  • data_path (str) –

  • require_annotations (bool) –

  • order_by_num_ann (bool) –

Return type:

Dict

medcat.utils.data_utils.count_annotations(data_path)
Parameters:

data_path (str) –

Return type:

Dict

medcat.utils.data_utils.get_doc_from_project(project, doc_id)
Parameters:
  • project (Dict) –

  • doc_id (str) –

Return type:

Optional[Dict]

medcat.utils.data_utils.get_ann_from_doc(document, start, end)
Parameters:
  • document (Dict) –

  • start (int) –

  • end (int) –

Return type:

Optional[Dict]

medcat.utils.data_utils.meta_ann_from_ann(ann, meta_name)
Parameters:
  • ann (Dict) –

  • meta_name (Union[Dict, List]) –

Return type:

Optional[Dict]

medcat.utils.data_utils.are_anns_same(ann, ann2, meta_names=[], require_double_inner=True)
Parameters:
  • ann (Dict) –

  • ann2 (Dict) –

  • meta_names (List) –

  • require_double_inner (bool) –

Return type:

bool

medcat.utils.data_utils.get_same_anns(document, document2, require_double_inner=True, ann_stats=[], meta_names=[])
Parameters:
  • document (Dict) –

  • document2 (Dict) –

  • require_double_inner (bool) –

  • ann_stats (List) –

  • meta_names (List) –

Return type:

Dict

medcat.utils.data_utils.print_consolid_stats(ann_stats=[], meta_names=[])
Parameters:
  • ann_stats (List) –

  • meta_names (List) –

Return type:

None

medcat.utils.data_utils.check_differences(data_path, cat, cntx_size=30, min_acc=0.2, ignore_already_done=False, only_start=False, only_saved=False)
Parameters:
  • data_path (str) –

  • cat (Any) –

Return type:

None

medcat.utils.data_utils.consolidate_double_annotations(data_path, out_path, require_double=True, require_double_inner=False, meta_anns_to_match=[])

Consolidated a dataset that was multi-annotated (same documents two times).

data_path:

Output from MedCATtrainer - projects containig the same documents must have the same name.

out_path:

The consolidated data will be saved here - usually only annotations where both annotators agree

require_double (boolean):
If True everything must be double annotated, meaning there have to be two projects of the same name for each name. Else, it will

also use projects that do not have double annotiations. If this is False, projects that do not have double anns will be included as is, and projects that have will still be checked.

require_double_inner (boolean):
If False - this will allow some entities to be annotated by only one annotator and not the other, while still requiring

annotations to be the same if they exist.

meta_anns_to_match (boolean):
List of meta annotations that must match for two annotations to be the same. If empty only the mention

level will be checked.

Parameters:
  • data_path (str) –

  • out_path (str) –

  • require_double (bool) –

  • require_double_inner (bool) –

  • meta_anns_to_match (List) –

Return type:

Dict

medcat.utils.data_utils.validate_ner_data(data_path, cdb, cntx_size=70, status_only=False, ignore_if_already_done=False)

Please just ignore this function, I’m afraid to even look at it.

Parameters:
  • data_path (str) –

  • cdb (medcat.cdb.CDB) –

  • cntx_size (int) –

  • status_only (bool) –

  • ignore_if_already_done (bool) –

Return type:

None

class medcat.utils.data_utils.MetaAnnotationDS(data, category_map)

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

Parameters:
  • data (Dict) –

  • category_map (Dict) –

__init__(data, category_map)

Args: data:

Dictionary of data values.

category_map:

Map from category naem to id.

Parameters:
  • data (Dict) –

  • category_map (Dict) –

__getitem__(idx)
Parameters:

idx (int) –

Return type:

Dict

__len__()
Return type:

int

medcat.utils.data_utils.prepare_from_json_hf(data_path, cntx_left, cntx_right, tokenizer, cui_filter=None, replace_center=None)
Parameters:
  • data_path (str) –

  • cntx_left (int) –

  • cntx_right (int) –

  • tokenizer (Any) –

  • cui_filter (Optional[Dict]) –

  • replace_center (Optional[Dict]) –

Return type:

Dict

medcat.utils.data_utils.prepare_from_json_chars(data, cntx_left, cntx_right, tokenizer, cui_filter=None, replace_center=None)

Convert the data from a json format into a CSV-like format for training.

Parameters:
  • data (Dict) – The json file from MedCAT.

  • cntx_left (int) – The size of the context.

  • cntx_right (int) – The size of the context.

  • tokenizer (Any) – The instance of the <FastTokenizer> class from huggingface.

  • cui_filter (Optional[Dict], optional) – The CUI filter. Defaults to None.

  • replace_center (Optional[Dict], optional) – If not None the center word (concept) will be replaced with whatever is set. Defaults to None.

Returns:

Dict – {‘category_name’: [(‘category_value’, ‘tokens’, ‘center_token’), …], …}

Return type:

Dict

medcat.utils.data_utils.make_mc_train_test(data, cdb, test_size=0.2)

This is a disaster.

Parameters:
Return type:

Tuple

medcat.utils.data_utils.get_false_positives(doc, spacy_doc)
Parameters:
  • doc (Dict) –

  • spacy_doc (spacy.tokens.doc.Doc) –

Return type:

List[spacy.tokens.span.Span]