`medcat.meta_cat`

Module Contents

Classes

MetaCAT

The MetaCAT class used for training 'Meta-Annotation' models, i.e. annotations of clinical

Attributes

logger

medcat.meta_cat.logger

class medcat.meta_cat.MetaCAT(tokenizer=None, embeddings=None, config=None)

Bases: medcat.pipeline.pipe_runner.PipeRunner

The MetaCAT class used for training ‘Meta-Annotation’ models, i.e. annotations of clinical concept annotations. These are also known as properties or attributes of recognise entities in similar tools such as MetaMap and cTakes.

This is a flexible model agnostic class that can learns any meta-annotation task, i.e. any multi-class classification task for recognised terms.

Parameters:

tokenizer (TokenizerWrapperBase) – The Huggingface tokenizer instance. This can be a pre-trained tokenzier instance from a BERT-style model, or trained from scratch for the Bi-LSTM (w. attention) model that is currently used in most deployments.
embeddings (Tensor, numpy.ndarray) – embedding mapping (sub)word input id n-dim (sub)word embedding.
config (ConfigMetaCAT) – the configuration for MetaCAT. Param descriptions available in ConfigMetaCAT docs.

name = 'meta_cat'

_component_lock

__init__(tokenizer=None, embeddings=None, config=None)

Parameters:

tokenizer (Optional[medcat.tokenizers.meta_cat_tokenizers.TokenizerWrapperBase]) –
embeddings (Optional[Union[torch.Tensor, numpy.ndarray]]) –
config (Optional[medcat.config_meta_cat.ConfigMetaCAT]) –

Return type:

None

get_model(embeddings)

Get the model

Parameters:: embeddings (Optional[Tensor]) – The embedding densor
Raises:: ValueError – If the meta model is not LSTM or BERT
Returns:: nn.Module – The module
Return type:: torch.nn.Module

get_hash()

A partial hash trying to catch differences between models.

Returns:: str – The hex hash.
Return type:: str

train(json_path, save_dir_path=None, data_oversampled=None)

Train or continue training a model give a json_path containing a MedCATtrainer export. It will continue training if an existing model is loaded or start new training if the model is blank/new.

Parameters:

json_path (Union[str, list]) – Path/Paths to a MedCATtrainer export containing the meta_annotations we want to train for.
save_dir_path (Optional[str]) – In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to None.
data_oversampled (Optional[list]) – In case of oversampling being performed, the data will be passed in the parameter

Returns:

Dict – The resulting report.

Return type:

Dict

train_from_json(json_path, save_dir_path=None, data_oversampled=None)

Train or continue training a model give a json_path containing a MedCATtrainer export. It will continue training if an existing model is loaded or start new training if the model is blank/new.

Parameters:

json_path (Union[str, list]) – Path/Paths to a MedCATtrainer export containing the meta_annotations we want to train for.
save_dir_path (Optional[str]) – In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to None.
data_oversampled (Optional[list]) – In case of oversampling being performed, the data will be passed in the parameter

Returns:

Dict – The resulting report.

Return type:

Dict

train_raw(data_loaded, save_dir_path=None, data_oversampled=None)

Train or continue training a model given raw data. It will continue training if an existing model is loaded or start new training if the model is blank/new.

The raw data is expected in the following format: {‘projects’:

[ # list of projects

{ # project 1
‘name’: ‘<some name>’, # list of documents ‘documents’: [{‘name’: ‘<some name>’, # document 1

‘text’: ‘<text of the document>’, # list of annotations ‘annotations’: [{‘start’: -1, # annotation 1

‘end’: 1, ‘cui’: ‘cui’, ‘value’: ‘<text value>’}, …],

}, …]

}, …

]

}

Parameters:

data_loaded (Dict) – The raw data we want to train for.
save_dir_path (Optional[str]) – In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to None.
data_oversampled (Optional[list]) – In case of oversampling being performed, the data will be passed in the parameter The format of which is expected: [[[‘text’,’of’,’the’,’document’], [index of medical entity], “label” ], [‘text’,’of’,’the’,’document’], [index of medical entity], “label” ]]

Returns:

Dict – The resulting report.

Raises:

Exception – If no save path is specified, or category name not in data.
AssertionError – If no tokeniser is set
FileNotFoundError – If phase_number is set to 2 and model.dat file is not found
KeyError – If phase_number is set to 2 and model.dat file contains mismatched architecture

Return type:

Dict

eval(json_path)

Evaluate from json.

Parameters:

json_path (str) – The json file ath

Returns:

Dict – The resulting model dict

Raises:

AssertionError – If self.tokenizer
Exception – If the category name does not exist

Return type:

Dict

save(save_dir_path)

Save all components of this class to a file

Parameters:: save_dir_path (str) – Path to the directory where everything will be saved.
Raises:: AssertionError – If self.tokenizer is None
Return type:: None

classmethod load(save_dir_path, config_dict=None)

Load a meta_cat object.

Parameters:

save_dir_path (str) – The directory where all was saved.
config_dict (Optional[Dict], optional) – This can be used to overwrite saved parameters for this meta_cat instance. Why? It is needed in certain cases where we autodeploy stuff. (Default value = None)

Returns:

MetaCAT – The MetaCAT instance

Return type:

MetaCAT

get_ents(doc)

Parameters:: doc (spacy.tokens.Doc) –
Return type:: Iterable[spacy.tokens.Span]

prepare_document(doc, input_ids, offset_mapping, lowercase)

Prepares document.

Parameters:

doc (Doc) – The document
input_ids (List) – Input ids
offset_mapping (List) – Offset mapings
lowercase (bool) – Whether to use lower case replace center

Returns:

Dict – Entity id to index mapping
List – Samples

Return type:

Tuple

static batch_generator(stream, batch_size_chars)

Generator for batch of documents.

Parameters:

stream (Iterable[Doc]) – The document stream
batch_size_chars (int) – Number of characters per batch

Yields:

List[Doc] – The batch of documents.

Return type:

Iterable[List[spacy.tokens.Doc]]

pipe(stream, *args, **kwargs)

Process many documents at once.

Parameters:

stream (Iterable[Union[Doc, FakeDoc]]) – List of spacy documents.
*args – Unused arguments (due to override)
**kwargs – Unused keyword arguments (due to override)

Yields:

Doc – The document.

Returns:

Iterator[Doc] – stream is None or empty.

Return type:

Iterator[spacy.tokens.Doc]

_set_meta_anns(stream, batch_size_chars, config, id2category_value)

Parameters:

stream (Iterable[Union[spacy.tokens.Doc, medcat.utils.meta_cat.data_utils.Doc]]) –
batch_size_chars (int) –
config (medcat.config_meta_cat.ConfigMetaCAT) –
id2category_value (Dict) –

Return type:

Iterator[Optional[spacy.tokens.Doc]]

__call__(doc)

Process one document, used in the spacy pipeline for sequential document processing.

Parameters:: doc (Doc) – A spacy document
Returns:: Doc – The same spacy document.
Return type:: spacy.tokens.Doc

get_model_card(as_dict=False)

A minimal model card.

Parameters:: as_dict (bool) – Return the model card as a dictionary instead of a str. (Default value = False)
Returns:: Union[str, dict] – An indented JSON object. OR A JSON object in dict form.
Return type:: Union[str, dict]

__repr__()

Prints the model_card for this MetaCAT instance.

Returns:: the ‘Model Card’ for this MetaCAT instance. This includes NER+L config and any MetaCATs

medcat.meta_cat

Module Contents

Classes

Attributes

`medcat.meta_cat`