medcat.utils.data_utils
Module Contents
Classes
An abstract class representing a |
Functions
|
|
|
|
|
Load data. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Consolidated a dataset that was multi-annotated (same documents two times). |
|
Please just ignore this function, I'm afraid to even look at it. |
|
|
|
Convert the data from a json format into a CSV-like format for training. |
|
Make train set. |
|
Attributes
- medcat.utils.data_utils.logger
- medcat.utils.data_utils.set_all_seeds(seed)
- Parameters:
seed (int) –
- Return type:
None
- medcat.utils.data_utils.count_annotations_project(project, cnt_per_cui=None)
- Parameters:
project (Dict) –
- Return type:
Tuple[int, Any]
- medcat.utils.data_utils.load_data(data_path, require_annotations=True, order_by_num_ann=True)
Load data.
- Parameters:
data_path (str) – The path to the data to load.
require_annotations (bool) – This will require anns but on project level, any doc in a project needs anns.
order_by_num_ann (bool) – Whether to order by number of annoations. Defaults to True.
- Returns:
Dict – The loaded data.
- Return type:
Dict
- medcat.utils.data_utils.count_annotations(data_path)
- Parameters:
data_path (str) –
- Return type:
Dict
- medcat.utils.data_utils.get_doc_from_project(project, doc_id)
- Parameters:
project (Dict) –
doc_id (str) –
- Return type:
Optional[Dict]
- medcat.utils.data_utils.get_ann_from_doc(document, start, end)
- Parameters:
document (Dict) –
start (int) –
end (int) –
- Return type:
Optional[Dict]
- medcat.utils.data_utils.meta_ann_from_ann(ann, meta_name)
- Parameters:
ann (Dict) –
meta_name (Union[Dict, List]) –
- Return type:
Optional[Dict]
- medcat.utils.data_utils.are_anns_same(ann, ann2, meta_names=[], require_double_inner=True)
- Parameters:
ann (Dict) –
ann2 (Dict) –
meta_names (List) –
require_double_inner (bool) –
- Return type:
bool
- medcat.utils.data_utils.get_same_anns(document, document2, require_double_inner=True, ann_stats=[], meta_names=[])
- Parameters:
document (Dict) –
document2 (Dict) –
require_double_inner (bool) –
ann_stats (List) –
meta_names (List) –
- Return type:
Dict
- medcat.utils.data_utils.print_consolid_stats(ann_stats=[], meta_names=[])
- Parameters:
ann_stats (List) –
meta_names (List) –
- Return type:
None
- medcat.utils.data_utils.check_differences(data_path, cat, cntx_size=30, min_acc=0.2, ignore_already_done=False, only_start=False, only_saved=False)
- Parameters:
data_path (str) –
cat (Any) –
- Return type:
None
- medcat.utils.data_utils.consolidate_double_annotations(data_path, out_path, require_double=True, require_double_inner=False, meta_anns_to_match=[])
Consolidated a dataset that was multi-annotated (same documents two times).
- Parameters:
data_path (str) – Output from MedCATtrainer - projects containig the same documents must have the same name.
out_path (str) – The consolidated data will be saved here - usually only annotations where both annotators agree
require_double (bool) –
- If True everything must be double annotated, meaning there have to be two projects of the same name for each name. Else, it will
also use projects that do not have double annotiations. If this is False, projects that do not have double anns will be included as is, and projects that have will still be checked.
require_double_inner (bool) –
- If False - this will allow some entities to be annotated by only one annotator and not the other, while still requiring
annotations to be the same if they exist.
meta_anns_to_match (List) –
- List of meta annotations that must match for two annotations to be the same. If empty only the mention
level will be checked.
- Returns:
Dict – The consolidated annoation.
- Return type:
Dict
- medcat.utils.data_utils.validate_ner_data(data_path, cdb, cntx_size=70, status_only=False, ignore_if_already_done=False)
Please just ignore this function, I’m afraid to even look at it.
- Parameters:
data_path (str) – The data path.
cdb (CDB) – The concept database.
cntx_size (int) – The context size. Defaults to 70.
status_only (bool) – Whether to only consider status. Defaults to False.
ignore_if_already_done (bool) – Whether to ignore if already done. Defaults to False.
- Return type:
None
- class medcat.utils.data_utils.MetaAnnotationDS(data, category_map)
Bases:
torch.utils.data.Dataset
An abstract class representing a
Dataset
.All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite
__getitem__()
, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite__len__()
, which is expected to return the size of the dataset by manySampler
implementations and the default options ofDataLoader
. Subclasses could also optionally implement__getitems__()
, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.Note
DataLoader
by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.- Parameters:
data (Dict) –
category_map (Dict) –
- __init__(data, category_map)
Create MetaAnnotationDS.
- Parameters:
data (Dict) – Dictionary of data values.
category_map (Dict) – Map from category naem to id.
- __getitem__(idx)
- Parameters:
idx (int) –
- Return type:
Dict
- __len__()
- Return type:
int
- medcat.utils.data_utils.prepare_from_json_hf(data_path, cntx_left, cntx_right, tokenizer, cui_filter=None, replace_center=None)
- Parameters:
data_path (str) –
cntx_left (int) –
cntx_right (int) –
tokenizer (Any) –
cui_filter (Optional[Dict]) –
replace_center (Optional[Dict]) –
- Return type:
Dict
- medcat.utils.data_utils.prepare_from_json_chars(data, cntx_left, cntx_right, tokenizer, cui_filter=None, replace_center=None)
Convert the data from a json format into a CSV-like format for training.
- Parameters:
data (Dict) – The json file from MedCAT.
cntx_left (int) – The size of the context.
cntx_right (int) – The size of the context.
tokenizer (Any) – The instance of the <FastTokenizer> class from huggingface.
cui_filter (Optional[Dict], optional) – The CUI filter. Defaults to None.
replace_center (Optional[Dict], optional) – If not None the center word (concept) will be replaced with whatever is set. Defaults to None.
- Returns:
Dict – {‘category_name’: [(‘category_value’, ‘tokens’, ‘center_token’), …], …}
- Return type:
Dict
- medcat.utils.data_utils.make_mc_train_test(data, cdb, test_size=0.2)
Make train set.
This is a disaster.
- Parameters:
data (Dict) – The data.
cdb (CDB) – The concept database.
test_size (float) – The test size. Defaults to 0.2.
- Returns:
Tuple – The train set, the test set, the test annotations, and the total annotations
- Return type:
Tuple
- medcat.utils.data_utils.get_false_positives(doc, spacy_doc)
- Parameters:
doc (Dict) –
spacy_doc (spacy.tokens.doc.Doc) –
- Return type:
List[spacy.tokens.span.Span]