medcat.meta_cat

Module Contents

Classes

MetaCAT

The MetaCAT class used for training 'Meta-Annotation' models, i.e. annotations of clinical

Attributes

logger

medcat.meta_cat.logger
class medcat.meta_cat.MetaCAT(tokenizer=None, embeddings=None, config=None)

Bases: medcat.pipeline.pipe_runner.PipeRunner

The MetaCAT class used for training ‘Meta-Annotation’ models, i.e. annotations of clinical concept annotations. These are also known as properties or attributes of recognise entities in similar tools such as MetaMap and cTakes.

This is a flexible model agnostic class that can learns any meta-annotation task, i.e. any multi-class classification task for recognised terms.

Parameters:
  • tokenizer (TokenizerWrapperBase) – The Huggingface tokenizer instance. This can be a pre-trained tokenzier instance from a BERT-style model, or trained from scratch for the Bi-LSTM (w. attention) model that is currently used in most deployments.

  • embeddings (Tensor, numpy.ndarray) – embedding mapping (sub)word input id n-dim (sub)word embedding.

  • config (ConfigMetaCAT) – the configuration for MetaCAT. Param descriptions available in ConfigMetaCAT docs.

name = 'meta_cat'
_component_lock
__init__(tokenizer=None, embeddings=None, config=None)
Parameters:
Return type:

None

get_model(embeddings)

Get the model

Parameters:

embeddings (Optional[Tensor]) – The embedding densor

Raises:

ValueError – If the meta model is not LSTM or BERT

Returns:

nn.Module – The module

Return type:

torch.nn.Module

get_hash()

A partial hash trying to catch differences between models.

Returns:

str – The hex hash.

Return type:

str

train(json_path, save_dir_path=None, data_oversampled=None)

Train or continue training a model give a json_path containing a MedCATtrainer export. It will continue training if an existing model is loaded or start new training if the model is blank/new.

Parameters:
  • json_path (Union[str, list]) – Path/Paths to a MedCATtrainer export containing the meta_annotations we want to train for.

  • save_dir_path (Optional[str]) – In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to None.

  • data_oversampled (Optional[list]) – In case of oversampling being performed, the data will be passed in the parameter

Returns:

Dict – The resulting report.

Return type:

Dict

train_from_json(json_path, save_dir_path=None, data_oversampled=None)

Train or continue training a model give a json_path containing a MedCATtrainer export. It will continue training if an existing model is loaded or start new training if the model is blank/new.

Parameters:
  • json_path (Union[str, list]) – Path/Paths to a MedCATtrainer export containing the meta_annotations we want to train for.

  • save_dir_path (Optional[str]) – In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to None.

  • data_oversampled (Optional[list]) – In case of oversampling being performed, the data will be passed in the parameter

Returns:

Dict – The resulting report.

Return type:

Dict

train_raw(data_loaded, save_dir_path=None, data_oversampled=None)

Train or continue training a model given raw data. It will continue training if an existing model is loaded or start new training if the model is blank/new.

The raw data is expected in the following format: {‘projects’:

[ # list of projects
{ # project 1

‘name’: ‘<some name>’, # list of documents ‘documents’: [{‘name’: ‘<some name>’, # document 1

‘text’: ‘<text of the document>’, # list of annotations ‘annotations’: [{‘start’: -1, # annotation 1

‘end’: 1, ‘cui’: ‘cui’, ‘value’: ‘<text value>’}, …],

}, …]

}, …

]

}

Parameters:
  • data_loaded (Dict) – The raw data we want to train for.

  • save_dir_path (Optional[str]) – In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to None.

  • data_oversampled (Optional[list]) – In case of oversampling being performed, the data will be passed in the parameter The format of which is expected: [[[‘text’,’of’,’the’,’document’], [index of medical entity], “label” ], [‘text’,’of’,’the’,’document’], [index of medical entity], “label” ]]

Returns:

Dict – The resulting report.

Raises:
  • Exception – If no save path is specified, or category name not in data.

  • AssertionError – If no tokeniser is set

  • FileNotFoundError – If phase_number is set to 2 and model.dat file is not found

  • KeyError – If phase_number is set to 2 and model.dat file contains mismatched architecture

Return type:

Dict

eval(json_path)

Evaluate from json.

Parameters:

json_path (str) – The json file ath

Returns:

Dict – The resulting model dict

Raises:
  • AssertionError – If self.tokenizer

  • Exception – If the category name does not exist

Return type:

Dict

save(save_dir_path)

Save all components of this class to a file

Parameters:

save_dir_path (str) – Path to the directory where everything will be saved.

Raises:

AssertionError – If self.tokenizer is None

Return type:

None

classmethod load(save_dir_path, config_dict=None)

Load a meta_cat object.

Parameters:
  • save_dir_path (str) – The directory where all was saved.

  • config_dict (Optional[Dict], optional) – This can be used to overwrite saved parameters for this meta_cat instance. Why? It is needed in certain cases where we autodeploy stuff. (Default value = None)

Returns:

MetaCAT – The MetaCAT instance

Return type:

MetaCAT

get_ents(doc)
Parameters:

doc (spacy.tokens.Doc) –

Return type:

Iterable[spacy.tokens.Span]

prepare_document(doc, input_ids, offset_mapping, lowercase)

Prepares document.

Parameters:
  • doc (Doc) – The document

  • input_ids (List) – Input ids

  • offset_mapping (List) – Offset mapings

  • lowercase (bool) – Whether to use lower case replace center

Returns:
  • Dict – Entity id to index mapping

  • List – Samples

Return type:

Tuple

static batch_generator(stream, batch_size_chars)

Generator for batch of documents.

Parameters:
  • stream (Iterable[Doc]) – The document stream

  • batch_size_chars (int) – Number of characters per batch

Yields:

List[Doc] – The batch of documents.

Return type:

Iterable[List[spacy.tokens.Doc]]

pipe(stream, *args, **kwargs)

Process many documents at once.

Parameters:
  • stream (Iterable[Union[Doc, FakeDoc]]) – List of spacy documents.

  • *args – Unused arguments (due to override)

  • **kwargs – Unused keyword arguments (due to override)

Yields:

Doc – The document.

Returns:

Iterator[Doc] – stream is None or empty.

Return type:

Iterator[spacy.tokens.Doc]

_set_meta_anns(stream, batch_size_chars, config, id2category_value)
Parameters:
Return type:

Iterator[Optional[spacy.tokens.Doc]]

__call__(doc)

Process one document, used in the spacy pipeline for sequential document processing.

Parameters:

doc (Doc) – A spacy document

Returns:

Doc – The same spacy document.

Return type:

spacy.tokens.Doc

get_model_card(as_dict=False)

A minimal model card.

Parameters:

as_dict (bool) – Return the model card as a dictionary instead of a str. (Default value = False)

Returns:

Union[str, dict] – An indented JSON object. OR A JSON object in dict form.

Return type:

Union[str, dict]

__repr__()

Prints the model_card for this MetaCAT instance.

Returns:

the ‘Model Card’ for this MetaCAT instance. This includes NER+L config and any MetaCATs