:py:mod:`medcat.meta_cat` ========================= .. py:module:: medcat.meta_cat Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: medcat.meta_cat.MetaCAT Attributes ~~~~~~~~~~ .. autoapisummary:: medcat.meta_cat.logger .. py:data:: logger .. py:class:: MetaCAT(tokenizer = None, embeddings = None, config = None) Bases: :py:obj:`medcat.pipeline.pipe_runner.PipeRunner` The MetaCAT class used for training 'Meta-Annotation' models, i.e. annotations of clinical concept annotations. These are also known as properties or attributes of recognise entities in similar tools such as MetaMap and cTakes. This is a flexible model agnostic class that can learns any meta-annotation task, i.e. any multi-class classification task for recognised terms. :param tokenizer: The Huggingface tokenizer instance. This can be a pre-trained tokenzier instance from a BERT-style model, or trained from scratch for the Bi-LSTM (w. attention) model that is currently used in most deployments. :type tokenizer: TokenizerWrapperBase :param embeddings: embedding mapping (sub)word input id n-dim (sub)word embedding. :type embeddings: Tensor, numpy.ndarray :param config: the configuration for MetaCAT. Param descriptions available in ConfigMetaCAT docs. :type config: ConfigMetaCAT .. py:attribute:: name :value: 'meta_cat' .. py:attribute:: _component_lock .. py:method:: __init__(tokenizer = None, embeddings = None, config = None) .. py:method:: get_model(embeddings) Get the model :param embeddings: The embedding densor :type embeddings: Optional[Tensor] :raises ValueError: If the meta model is not LSTM or BERT :Returns: **nn.Module** -- The module .. py:method:: get_hash() A partial hash trying to catch differences between models. :Returns: **str** -- The hex hash. .. py:method:: train_from_json(json_path, save_dir_path = None, data_oversampled = None) Train or continue training a model give a json_path containing a MedCATtrainer export. It will continue training if an existing model is loaded or start new training if the model is blank/new. :param json_path: Path/Paths to a MedCATtrainer export containing the meta_annotations we want to train for. :type json_path: Union[str, list] :param save_dir_path: In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to `None`. :type save_dir_path: Optional[str] :param data_oversampled: In case of oversampling being performed, the data will be passed in the parameter :type data_oversampled: Optional[list] :Returns: **Dict** -- The resulting report. .. py:method:: train_raw(data_loaded, save_dir_path = None, data_oversampled = None) Train or continue training a model given raw data. It will continue training if an existing model is loaded or start new training if the model is blank/new. The raw data is expected in the following format: {'projects': [ # list of projects { # project 1 'name': '', # list of documents 'documents': [{'name': '', # document 1 'text': '', # list of annotations 'annotations': [{'start': -1, # annotation 1 'end': 1, 'cui': 'cui', 'value': ''}, ...], }, ...] }, ... ] } :param data_loaded: The raw data we want to train for. :type data_loaded: Dict :param save_dir_path: In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to `None`. :type save_dir_path: Optional[str] :param data_oversampled: In case of oversampling being performed, the data will be passed in the parameter The format of which is expected: [[['text','of','the','document'], [index of medical entity], "label" ], ['text','of','the','document'], [index of medical entity], "label" ]] :type data_oversampled: Optional[list] :Returns: **Dict** -- The resulting report. :raises Exception: If no save path is specified, or category name not in data. :raises AssertionError: If no tokeniser is set :raises FileNotFoundError: If phase_number is set to 2 and model.dat file is not found :raises KeyError: If phase_number is set to 2 and model.dat file contains mismatched architecture .. py:method:: eval(json_path) Evaluate from json. :param json_path: The json file ath :type json_path: str :Returns: **Dict** -- The resulting model dict :raises AssertionError: If self.tokenizer :raises Exception: If the category name does not exist .. py:method:: save(save_dir_path) Save all components of this class to a file :param save_dir_path: Path to the directory where everything will be saved. :type save_dir_path: str :raises AssertionError: If self.tokenizer is None .. py:method:: load(save_dir_path, config_dict = None) :classmethod: Load a meta_cat object. :param save_dir_path: The directory where all was saved. :type save_dir_path: str :param config_dict: This can be used to overwrite saved parameters for this meta_cat instance. Why? It is needed in certain cases where we autodeploy stuff. (Default value = None) :type config_dict: Optional[Dict], optional :Returns: **MetaCAT** -- The MetaCAT instance .. py:method:: get_ents(doc) .. py:method:: prepare_document(doc, input_ids, offset_mapping, lowercase) Prepares document. :param doc: The document :type doc: Doc :param input_ids: Input ids :type input_ids: List :param offset_mapping: Offset mapings :type offset_mapping: List :param lowercase: Whether to use lower case replace center :type lowercase: bool :Returns: * **Dict** -- Entity id to index mapping * **List** -- Samples .. py:method:: batch_generator(stream, batch_size_chars) :staticmethod: Generator for batch of documents. :param stream: The document stream :type stream: Iterable[Doc] :param batch_size_chars: Number of characters per batch :type batch_size_chars: int :Yields: *List[Doc]* -- The batch of documents. .. py:method:: pipe(stream, *args, **kwargs) Process many documents at once. :param stream: List of spacy documents. :type stream: Iterable[Union[Doc, FakeDoc]] :param \*args: Unused arguments (due to override) :param \*\*kwargs: Unused keyword arguments (due to override) :Yields: *Doc* -- The document. :Returns: **Iterator[Doc]** -- stream is None or empty. .. py:method:: _set_meta_anns(stream, batch_size_chars, config, id2category_value) .. py:method:: __call__(doc) Process one document, used in the spacy pipeline for sequential document processing. :param doc: A spacy document :type doc: Doc :Returns: **Doc** -- The same spacy document. .. py:method:: get_model_card(as_dict = False) A minimal model card. :param as_dict: Return the model card as a dictionary instead of a str. (Default value = False) :type as_dict: bool :Returns: **Union[str, dict]** -- An indented JSON object. OR A JSON object in dict form. .. py:method:: __repr__() Prints the model_card for this MetaCAT instance. :Returns: **the 'Model Card' for this MetaCAT instance. This includes NER+L config and any MetaCATs**