`medcat.utils.meta_cat.data_utils`

Module Contents

Classes

`Empty`
`Span`
`Doc`

Functions

`prepare_from_json`(data, cntx_left, cntx_right, tokenizer)	Convert the data from a json format into a CSV-like format for training. This function is not very efficient (the one
`prepare_for_oversampled_data`(data, tokenizer)	Convert the data from a json format into a CSV-like format for training. This function is not very efficient (the one
`encode_category_values`(data[, ...])	Converts the category values in the data outputed by prepare_from_json
`json_to_fake_spacy`(data, id2text)	Creates a generator of fake spacy documents, used for running

Attributes

logger

medcat.utils.meta_cat.data_utils.logger

medcat.utils.meta_cat.data_utils.prepare_from_json(data, cntx_left, cntx_right, tokenizer, cui_filter=None, replace_center=None, prerequisites={}, lowercase=True)

Convert the data from a json format into a CSV-like format for training. This function is not very efficient (the one working with spacy documents as part of the meta_cat.pipe method is much better). If your dataset is > 1M documents think about rewriting this function - but would be strange to have more than 1M manually annotated documents.

Parameters:

data (Dict) – Loaded output of MedCATtrainer. If we have a my_export.json from MedCATtrainer, than data = json.load(<my_export>).
cntx_left (int) – Size of context to get from the left of the concept
cntx_right (int) – Size of context to get from the right of the concept
tokenizer (TokenizerWrapperBase) – Something to split text into tokens for the LSTM/BERT/whatever meta models.
replace_center (Optional[str]) – If not None the center word (concept) will be replaced with whatever this is.
prerequisites (Dict) –
A map of prerequisities, for example our data has two meta-annotations (experiencer, negation). Assume I want to create a dataset for negation but only in those cases where experiencer=patient, my prerequisites would be:

{‘Experiencer’: ‘Patient’} - Take care that the CASE has to match whatever is in the data. Defaults to {}.
lowercase (bool) – Should the text be lowercased before tokenization. Defaults to True.
cui_filter (Optional[set]) – CUI filter if set. Defaults to None.

Returns:

out_data (dict) – Example: {‘category_name’: [(‘<category_value>’, ‘<[tokens]>’, ‘<center_token>’), …], …}

Return type:

Dict

medcat.utils.meta_cat.data_utils.prepare_for_oversampled_data(data, tokenizer)

Convert the data from a json format into a CSV-like format for training. This function is not very efficient (the one working with spacy documents as part of the meta_cat.pipe method is much better). If your dataset is > 1M documents think about rewriting this function - but would be strange to have more than 1M manually annotated documents.

Parameters:

data (List) –
Oversampled data expected in the following format: [[[‘text’,’of’,’the’,’document’], [index of medical entity], “label” ],

[‘text’,’of’,’the’,’document’], [index of medical entity], “label” ]]
tokenizer (TokenizerWrapperBase) – Something to split text into tokens for the LSTM/BERT/whatever meta models.

Returns:

data_sampled (list) – The processed data in the format that can be merged with the output from prepare_from_json. [[<[tokens]>, [index of medical entity], “label” ], <[tokens]>, [index of medical entity], “label” ]]

Return type:

List

medcat.utils.meta_cat.data_utils.encode_category_values(data, existing_category_value2id=None, category_undersample=None)

Converts the category values in the data outputed by prepare_from_json into integere values.

Parameters:

data (Dict) – Output of prepare_from_json.
existing_category_value2id (Optional[Dict]) – Map from category_value to id (old/existing).
category_undersample – Name of class that should be used to undersample the data (for 2 phase learning)

Returns:

dict – New underesampled data (for 2 phase learning) with integers inplace of strings for category values
dict – New data with integers inplace of strings for category values.
dict – Map rom category value to ID for all categories in the data.

Return type:

Tuple

medcat.utils.meta_cat.data_utils.json_to_fake_spacy(data, id2text)

Creates a generator of fake spacy documents, used for running meta_cat pipe separately from main cat pipeline.

Parameters:

data (Dict) – Output from cat formated as: {<id>: <output of get_entities, …}.
id2text (Dict) – Map from document id to text of that document.

Yields:

Generator – Generator of spacy like documents that can be feed into meta_cat.pipe.

Return type:

Iterable

class medcat.utils.meta_cat.data_utils.Empty

Bases: object

__init__()

Return type:: None

class medcat.utils.meta_cat.data_utils.Span(start_char, end_char, id_)

Bases: object

Parameters:

start_char (str) –
end_char (str) –
id_ (str) –

__init__(start_char, end_char, id_)

Parameters:

start_char (str) –
end_char (str) –
id_ (str) –

Return type:

None

class medcat.utils.meta_cat.data_utils.Doc(text, id_)

Bases: object

Parameters:

text (str) –
id_ (str) –

__init__(text, id_)

Parameters:

text (str) –
id_ (str) –

Return type:

None

medcat.utils.meta_cat.data_utils

Module Contents

Classes

Functions

Attributes

`medcat.utils.meta_cat.data_utils`