medcat.meta_cat
Module Contents
Classes
The MetaCAT class used for training 'Meta-Annotation' models, i.e. annotations of clinical |
Attributes
- medcat.meta_cat.logger
- class medcat.meta_cat.MetaCAT(tokenizer=None, embeddings=None, config=None)
Bases:
medcat.pipeline.pipe_runner.PipeRunner
The MetaCAT class used for training ‘Meta-Annotation’ models, i.e. annotations of clinical concept annotations. These are also known as properties or attributes of recognise entities in similar tools such as MetaMap and cTakes.
This is a flexible model agnostic class that can learns any meta-annotation task, i.e. any multi-class classification task for recognised terms.
- Parameters:
tokenizer (TokenizerWrapperBase) – The Huggingface tokenizer instance. This can be a pre-trained tokenzier instance from a BERT-style model, or trained from scratch for the Bi-LSTM (w. attention) model that is currently used in most deployments.
embeddings (Tensor, numpy.ndarray) – embedding mapping (sub)word input id n-dim (sub)word embedding.
config (ConfigMetaCAT) – the configuration for MetaCAT. Param descriptions available in ConfigMetaCAT docs.
- name = 'meta_cat'
- _component_lock
- __init__(tokenizer=None, embeddings=None, config=None)
- Parameters:
tokenizer (Optional[medcat.tokenizers.meta_cat_tokenizers.TokenizerWrapperBase]) –
embeddings (Optional[Union[torch.Tensor, numpy.ndarray]]) –
config (Optional[medcat.config_meta_cat.ConfigMetaCAT]) –
- Return type:
None
- get_model(embeddings)
Get the model
- Parameters:
embeddings (Optional[Tensor]) – The embedding densor
- Raises:
ValueError – If the meta model is not LSTM or BERT
- Returns:
nn.Module – The module
- Return type:
torch.nn.Module
- get_hash()
A partial hash trying to catch differences between models.
- Returns:
str – The hex hash.
- Return type:
str
- train(json_path, save_dir_path=None, data_oversampled=None)
Train or continue training a model give a json_path containing a MedCATtrainer export. It will continue training if an existing model is loaded or start new training if the model is blank/new.
- Parameters:
json_path (Union[str, list]) – Path/Paths to a MedCATtrainer export containing the meta_annotations we want to train for.
save_dir_path (Optional[str]) – In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to None.
data_oversampled (Optional[list]) – In case of oversampling being performed, the data will be passed in the parameter
- Returns:
Dict – The resulting report.
- Return type:
Dict
- train_from_json(json_path, save_dir_path=None, data_oversampled=None)
Train or continue training a model give a json_path containing a MedCATtrainer export. It will continue training if an existing model is loaded or start new training if the model is blank/new.
- Parameters:
json_path (Union[str, list]) – Path/Paths to a MedCATtrainer export containing the meta_annotations we want to train for.
save_dir_path (Optional[str]) – In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to None.
data_oversampled (Optional[list]) – In case of oversampling being performed, the data will be passed in the parameter
- Returns:
Dict – The resulting report.
- Return type:
Dict
- train_raw(data_loaded, save_dir_path=None, data_oversampled=None)
Train or continue training a model given raw data. It will continue training if an existing model is loaded or start new training if the model is blank/new.
The raw data is expected in the following format: {‘projects’:
- [ # list of projects
- { # project 1
‘name’: ‘<some name>’, # list of documents ‘documents’: [{‘name’: ‘<some name>’, # document 1
‘text’: ‘<text of the document>’, # list of annotations ‘annotations’: [{‘start’: -1, # annotation 1
‘end’: 1, ‘cui’: ‘cui’, ‘value’: ‘<text value>’}, …],
}, …]
}, …
]
}
- Parameters:
data_loaded (Dict) – The raw data we want to train for.
save_dir_path (Optional[str]) – In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to None.
data_oversampled (Optional[list]) – In case of oversampling being performed, the data will be passed in the parameter The format of which is expected: [[[‘text’,’of’,’the’,’document’], [index of medical entity], “label” ], [‘text’,’of’,’the’,’document’], [index of medical entity], “label” ]]
- Returns:
Dict – The resulting report.
- Raises:
Exception – If no save path is specified, or category name not in data.
AssertionError – If no tokeniser is set
FileNotFoundError – If phase_number is set to 2 and model.dat file is not found
KeyError – If phase_number is set to 2 and model.dat file contains mismatched architecture
- Return type:
Dict
- eval(json_path)
Evaluate from json.
- Parameters:
json_path (str) – The json file ath
- Returns:
Dict – The resulting model dict
- Raises:
AssertionError – If self.tokenizer
Exception – If the category name does not exist
- Return type:
Dict
- save(save_dir_path)
Save all components of this class to a file
- Parameters:
save_dir_path (str) – Path to the directory where everything will be saved.
- Raises:
AssertionError – If self.tokenizer is None
- Return type:
None
- classmethod load(save_dir_path, config_dict=None)
Load a meta_cat object.
- Parameters:
save_dir_path (str) – The directory where all was saved.
config_dict (Optional[Dict], optional) – This can be used to overwrite saved parameters for this meta_cat instance. Why? It is needed in certain cases where we autodeploy stuff. (Default value = None)
- Returns:
MetaCAT – The MetaCAT instance
- Return type:
- get_ents(doc)
- Parameters:
doc (spacy.tokens.Doc) –
- Return type:
Iterable[spacy.tokens.Span]
- prepare_document(doc, input_ids, offset_mapping, lowercase)
Prepares document.
- Parameters:
doc (Doc) – The document
input_ids (List) – Input ids
offset_mapping (List) – Offset mapings
lowercase (bool) – Whether to use lower case replace center
- Returns:
Dict – Entity id to index mapping
List – Samples
- Return type:
Tuple
- static batch_generator(stream, batch_size_chars)
Generator for batch of documents.
- Parameters:
stream (Iterable[Doc]) – The document stream
batch_size_chars (int) – Number of characters per batch
- Yields:
List[Doc] – The batch of documents.
- Return type:
Iterable[List[spacy.tokens.Doc]]
- pipe(stream, *args, **kwargs)
Process many documents at once.
- Parameters:
stream (Iterable[Union[Doc, FakeDoc]]) – List of spacy documents.
*args – Unused arguments (due to override)
**kwargs – Unused keyword arguments (due to override)
- Yields:
Doc – The document.
- Returns:
Iterator[Doc] – stream is None or empty.
- Return type:
Iterator[spacy.tokens.Doc]
- _set_meta_anns(stream, batch_size_chars, config, id2category_value)
- Parameters:
stream (Iterable[Union[spacy.tokens.Doc, medcat.utils.meta_cat.data_utils.Doc]]) –
batch_size_chars (int) –
config (medcat.config_meta_cat.ConfigMetaCAT) –
id2category_value (Dict) –
- Return type:
Iterator[Optional[spacy.tokens.Doc]]
- __call__(doc)
Process one document, used in the spacy pipeline for sequential document processing.
- Parameters:
doc (Doc) – A spacy document
- Returns:
Doc – The same spacy document.
- Return type:
spacy.tokens.Doc
- get_model_card(as_dict=False)
A minimal model card.
- Parameters:
as_dict (bool) – Return the model card as a dictionary instead of a str. (Default value = False)
- Returns:
Union[str, dict] – An indented JSON object. OR A JSON object in dict form.
- Return type:
Union[str, dict]
- __repr__()
Prints the model_card for this MetaCAT instance.
- Returns:
the ‘Model Card’ for this MetaCAT instance. This includes NER+L config and any MetaCATs