:py:mod:`medcat.cat` ==================== .. py:module:: medcat.cat Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: medcat.cat.CAT Attributes ~~~~~~~~~~ .. autoapisummary:: medcat.cat.logger medcat.cat.HAS_NEW_SPACY medcat.cat.MIN_GEN_LEN_FOR_WARN .. py:data:: logger .. py:data:: HAS_NEW_SPACY .. py:data:: MIN_GEN_LEN_FOR_WARN :value: 10000 .. py:class:: CAT(cdb, vocab = None, config = None, meta_cats = [], rel_cats = [], addl_ner = []) Bases: :py:obj:`object` The main MedCAT class used to annotate documents, it is built on top of spaCy and works as a spaCy pipeline. Creates an instance of a spaCy pipeline that can be used as a spacy nlp model. :param cdb: The concept database that will be used for NER+L :type cdb: medcat.cdb.CDB :param config: Global configuration for medcat :type config: medcat.config.Config :param vocab: Vocabulary used for vector embeddings and spelling. Default: None :type vocab: medcat.vocab.Vocab, optional :param meta_cats: A list of models that will be applied sequentially on each detected annotation. :type meta_cats: list of medcat.meta_cat.MetaCAT, optional :param rel_cats: List of models applied sequentially on all detected annotations. :type rel_cats: list of medcat.rel_cat.RelCAT, optional Attributes (limited): cdb (medcat.cdb.CDB): Concept database used with this CAT instance, please do not assign this value directly. config (medcat.config.Config): The global configuration for medcat. Usually cdb.config will be used for this field. WILL BE REMOVED - TEMPORARY PLACEHOLDER vocab (medcat.utils.vocab.Vocab): The vocabulary object used with this instance, please do not assign this value directly. .. rubric:: Examples >>> cat = CAT(cdb, vocab) >>> spacy_doc = cat("Put some text here") >>> print(spacy_doc.ents) # Detected entities .. py:attribute:: DEFAULT_MODEL_PACK_NAME :value: 'medcat_model_pack' .. py:method:: __init__(cdb, vocab = None, config = None, meta_cats = [], rel_cats = [], addl_ner = []) .. py:method:: _create_pipeline(config) .. py:method:: get_hash(force_recalc = False) Will not be a deep hash but will try to catch all the changing parts during training. Able to force recalculation of hash. This is relevant for CDB the hash for which is otherwise only recalculated if it has changed. :param force_recalc: Whether to force recalculation. Defaults to False. :type force_recalc: bool :Returns: **str** -- The resulting hash .. py:method:: get_model_card(as_dict = False) A minimal model card for MedCAT model packs. :param as_dict: Whether to return the model card as a dictionary instead of a str (Default value False). :type as_dict: bool :Returns: * **str** -- The string representation of the JSON object. * **OR** * **dict** -- The dict JSON object. .. py:method:: _versioning(force_rehash = False) .. py:method:: create_model_pack(save_dir_path, model_pack_name = DEFAULT_MODEL_PACK_NAME, force_rehash = False, cdb_format = 'dill') Will crete a .zip file containing all the models in the current running instance of MedCAT. This is not the most efficient way, for sure, but good enough for now. :param save_dir_path: An id will be appended to this name :type save_dir_path: str :param model_pack_name: The model pack name. Defaults to DEFAULT_MODEL_PACK_NAME. :type model_pack_name: str :param force_rehash: Force recalculation of hash. Defaults to `False`. :type force_rehash: bool :param cdb_format: The format of the saved CDB in the model pack. The available formats are: - dill - json Defaults to 'dill' :type cdb_format: str :Returns: **str** -- Model pack name .. py:method:: attempt_unpack(zip_path) :classmethod: Attempt unpack the zip to a folder and get the model pack path. If the folder already exists, no unpacking is done. :param zip_path: The ZIP path :type zip_path: str :Returns: **str** -- The model pack path .. py:method:: load_model_pack(zip_path, meta_cat_config_dict = None, ner_config_dict = None, medcat_config_dict = None, load_meta_models = True, load_addl_ner = True, load_rel_models = True) :classmethod: Load everything within the 'model pack', i.e. the CDB, config, vocab and any MetaCAT models (if present) :param zip_path: The path to model pack zip. :type zip_path: str :param meta_cat_config_dict: A config dict that will overwrite existing configs in meta_cat. e.g. meta_cat_config_dict = {'general': {'device': 'cpu'}}. Defaults to None. :type meta_cat_config_dict: Optional[Dict] :param ner_config_dict: A config dict that will overwrite existing configs in transformers ner. e.g. ner_config_dict = {'general': {'chunking_overlap_window': 6}. Defaults to None. :type ner_config_dict: Optional[Dict] :param medcat_config_dict: A config dict that will overwrite existing configs in the main medcat config before pipe initialisation. This can be useful if wanting to change something that only takes effect at init time (e.g spacy model). Defaults to None. :type medcat_config_dict: Optional[Dict] :param load_meta_models: Whether to load MetaCAT models if present (Default value True). :type load_meta_models: bool :param load_addl_ner: Whether to load additional NER models if present (Default value True). :type load_addl_ner: bool :param load_rel_models: Whether to load RelCAT models if present (Default value True). :type load_rel_models: bool :Returns: **CAT** -- The resulting CAT object. .. py:method:: load_cdb(model_pack_path) :classmethod: Loads the concept database from the provided model pack path :param model_pack_path: path to model pack, zip or dir. :type model_pack_path: str :Returns: **CDB** -- The loaded concept database .. py:method:: load_meta_cats(model_pack_path, meta_cat_config_dict = None) :classmethod: :param model_pack_path: path to model pack, zip or dir. :type model_pack_path: str :param meta_cat_config_dict: A config dict that will overwrite existing configs in meta_cat. e.g. meta_cat_config_dict = {'general': {'device': 'cpu'}}. Defaults to None. :type meta_cat_config_dict: Optional[Dict] :Returns: **List[Tuple** (*str, MetaCAT*) -- list of pairs of meta cat model names (i.e. the task name) and the MetaCAT models. .. py:method:: __call__(text, do_train = False) Push the text through the pipeline. :param text: The text to be annotated, if the text length is longer than self.config.preprocessing['max_document_length'] it will be trimmed to that length. :type text: Optional[str] :param do_train: This causes so many screwups when not there, so I'll force training to False. To run training it is much better to use the self.train() function but for some special cases I'm leaving it here also. Defaults to `False`. :type do_train: bool :Returns: **Optional[Doc]** -- A single spacy document or multiple spacy documents with the extracted entities .. py:method:: __repr__() Prints the model_card for this CAT instance. :Returns: **str** -- the 'Model Card' for this CAT instance. This includes NER+L config and any MetaCATs .. py:method:: _print_stats(data, epoch = 0, use_project_filters = False, use_overlaps = False, use_cui_doc_limit = False, use_groups = False, extra_cui_filter = None, do_print = True) TODO: Refactor and make nice Print metrics on a dataset (F1, P, R), it will also print the concepts that have the most FP,FN,TP. :param data: The json object that we get from MedCATtrainer on export. :type data: Dict :param epoch: Used during training, so we know what epoch is it. :type epoch: int :param use_project_filters: Each project in MedCATtrainer can have filters, do we want to respect those filters when calculating metrics. :type use_project_filters: bool :param use_overlaps: Allow overlapping entities, nearly always False as it is very difficult to annotate overlapping entities. :type use_overlaps: bool :param use_cui_doc_limit: If True the metrics for a CUI will be only calculated if that CUI appears in a document, in other words if the document was annotated for that CUI. Useful in very specific situations when during the annotation process the set of CUIs changed. :type use_cui_doc_limit: bool :param use_groups: If True concepts that have groups will be combined and stats will be reported on groups. :type use_groups: bool :param extra_cui_filter: This filter will be intersected with all other filters, or if all others are not set then only this one will be used. :type extra_cui_filter: Optional[Set] :param do_print: Whether to print stats out. Defaults to True. :type do_print: bool :Returns: * **fps** (*dict*) -- False positives for each CUI. * **fns** (*dict*) -- False negatives for each CUI. * **tps** (*dict*) -- True positives for each CUI. * **cui_prec** (*dict*) -- Precision for each CUI. * **cui_rec** (*dict*) -- Recall for each CUI. * **cui_f1** (*dict*) -- F1 for each CUI. * **cui_counts** (*dict*) -- Number of occurrence for each CUI. * **examples** (*dict*) -- Examples for each of the fp, fn, tp. Format will be examples['fp']['cui'][]. .. py:method:: _init_ckpts(is_resumed, checkpoint) .. py:method:: train(data_iterator, nepochs = 1, fine_tune = True, progress_print = 1000, checkpoint = None, is_resumed = False) Runs training on the data, note that the maximum length of a line or document is 1M characters. Anything longer will be trimmed. :param data_iterator: Simple iterator over sentences/documents, e.g. a open file or an array or anything that we can use in a for loop. :type data_iterator: Iterable :param nepochs: Number of epochs for which to run the training. :type nepochs: int :param fine_tune: If False old training will be removed. :type fine_tune: bool :param progress_print: Print progress after N lines. :type progress_print: int :param checkpoint: The MedCAT checkpoint object :type checkpoint: Optional[medcat.utils.checkpoint.CheckpointUT] :param is_resumed: If True resume the previous training; If False, start a fresh new training. :type is_resumed: bool .. py:method:: add_cui_to_group(cui, group_name) Adds a CUI to a group, will appear in cdb.addl_info['cui2group'] :param cui: The concept to be added. :type cui: str :param group_name: The group to which the concept will be added. :type group_name: str .. rubric:: Examples >>> cat.add_cui_to_group("S-17", 'pain') .. py:method:: unlink_concept_name(cui, name, preprocessed_name = False) Unlink a concept name from the CUI (or all CUIs if full_unlink), removes the link from the Concept Database (CDB). As a consequence medcat will never again link the `name` to this CUI - meaning the name will not be detected as a concept in the future. :param cui: The CUI from which the `name` will be removed. :type cui: str :param name: The span of text to be removed from the linking dictionary. :type name: str :param preprocessed_name: Whether the name being used is preprocessed. :type preprocessed_name: bool .. rubric:: Examples >>> # To never again link C0020538 to HTN >>> cat.unlink_concept_name('C0020538', 'htn', False) .. py:method:: add_and_train_concept(cui, name, spacy_doc = None, spacy_entity = None, ontologies = set(), name_status = 'A', type_ids = set(), description = '', full_build = True, negative = False, devalue_others = False, do_add_concept = True) Add a name to an existing concept, or add a new concept, or do not do anything if the name or concept already exists. Perform training if spacy_entity and spacy_doc are set. :param cui: CUI of the concept. :type cui: str :param name: Name to be linked to the concept (in the case of MedCATtrainer this is simply the selected value in text, no preprocessing or anything needed). :type name: str :param spacy_doc: Spacy representation of the document that was manually annotated. :type spacy_doc: spacy.tokens.Doc :param spacy_entity: Given the spacy document, this is the annotated span of text - list of annotated tokens that are marked with this CUI. :type spacy_entity: Optional[Union[List[Token], Span]] :param ontologies: ontologies in which the concept exists (e.g. SNOMEDCT, HPO) :type ontologies: Set[str] :param name_status: One of `P`, `N`, `A` :type name_status: str :param type_ids: Semantic type identifier (have a look at TUIs in UMLS or SNOMED-CT) :type type_ids: Set[str] :param description: Description of this concept. :type description: str :param full_build: If True the dictionary self.addl_info will also be populated, contains a lot of extra information about concepts, but can be very memory consuming. This is not necessary for normal functioning of MedCAT (Default Value `False`). :type full_build: bool :param negative: Is this a negative or positive example. :type negative: bool :param devalue_others: If set, cuis to which this name is assigned and are not `cui` will receive negative training given that negative=False. :type devalue_others: bool :param do_add_concept: Whether to add concept to CDB. :type do_add_concept: bool .. py:method:: train_supervised_from_json(data_path, reset_cui_count = False, nepochs = 1, print_stats = 0, use_filters = False, terminate_last = False, use_overlaps = False, use_cui_doc_limit = False, test_size = 0, devalue_others = False, use_groups = False, never_terminate = False, train_from_false_positives = False, extra_cui_filter = None, retain_extra_cui_filter = False, checkpoint = None, retain_filters = False, is_resumed = False) Run supervised training on a dataset from MedCATtrainer in JSON format. Refer to `train_supervised_raw` for more details. # noqa: DAR101 # noqa: DAR201 .. py:method:: train_supervised_raw(data, reset_cui_count = False, nepochs = 1, print_stats = 0, use_filters = False, terminate_last = False, use_overlaps = False, use_cui_doc_limit = False, test_size = 0, devalue_others = False, use_groups = False, never_terminate = False, train_from_false_positives = False, extra_cui_filter = None, retain_extra_cui_filter = False, checkpoint = None, retain_filters = False, is_resumed = False) Train supervised based on the raw data provided. The raw data is expected in the following format: {'projects': [ # list of projects { # project 1 'name': '', # list of documents 'documents': [{'name': '', # document 1 'text': '', # list of annotations 'annotations': [{'start': -1, # annotation 1 'end': 1, 'cui': 'cui', 'value': ''}, ...], }, ...] }, ... ] } Please take care that this is more a simulated online training then supervised. When filtering, the filters within the CAT model are used first, then the ones from MedCATtrainer (MCT) export filters, and finally the extra_cui_filter (if set). That is to say, the expectation is: extra_cui_filter ⊆ MCT filter ⊆ Model/config filter. :param data: The raw data, e.g from MedCATtrainer on export. :type data: Dict[str, List[Dict[str, dict]]] :param reset_cui_count: Used for training with weight_decay (annealing). Each concept has a count that is there from the beginning of the CDB, that count is used for annealing. Resetting the count will significantly increase the training impact. This will reset the count only for concepts that exist in the the training data. :type reset_cui_count: bool :param nepochs: Number of epochs for which to run the training. :type nepochs: int :param print_stats: If > 0 it will print stats every print_stats epochs. :type print_stats: int :param use_filters: Each project in medcattrainer can have filters, do we want to respect those filters when calculating metrics. :type use_filters: bool :param terminate_last: If true, concept termination will be done after all training. :type terminate_last: bool :param use_overlaps: Allow overlapping entities, nearly always False as it is very difficult to annotate overlapping entities. :type use_overlaps: bool :param use_cui_doc_limit: If True the metrics for a CUI will be only calculated if that CUI appears in a document, in other words if the document was annotated for that CUI. Useful in very specific situations when during the annotation process the set of CUIs changed. :type use_cui_doc_limit: bool :param test_size: If > 0 the data set will be split into train test based on this ration. Should be between 0 and 1. Usually 0.1 is fine. :type test_size: float :param devalue_others: Check add_name for more details. :type devalue_others: bool :param use_groups: If True concepts that have groups will be combined and stats will be reported on groups. :type use_groups: bool :param never_terminate: If True no termination will be applied :type never_terminate: bool :param train_from_false_positives: If True it will use false positive examples detected by medcat and train from them as negative examples. :type train_from_false_positives: bool :param extra_cui_filter: This filter will be intersected with all other filters, or if all others are not set then only this one will be used. :type extra_cui_filter: Optional[Set] :param retain_extra_cui_filter: Whether to retain the extra filters instead of the MedCATtrainer export filters. This will only have an effect if/when retain_filters is set to True. Defaults to False. :type retain_extra_cui_filter: bool :param checkpoint: The MedCAT CheckpointST object :type checkpoint: Optional[Optional[medcat.utils.checkpoint.CheckpointST] :param retain_filters: If True, retain the filters in the MedCATtrainer export within this CAT instance. In other words, the filters defined in the input file will henseforth be saved within config.linking.filters . This only makes sense if there is only one project in the input data. If that is not the case, a ValueError is raised. The merging is done in the first epoch. :type retain_filters: bool :param is_resumed: If True resume the previous training; If False, start a fresh new training. :type is_resumed: bool :raises ValueError: If attempting to retain filters with while training over multiple projects. :Returns: **Tuple** -- Consisting of the following parts fp (dict): False positives for each CUI. fn (dict): False negatives for each CUI. tp (dict): True positives for each CUI. p (dict): Precision for each CUI. r (dict): Recall for each CUI. f1 (dict): F1 for each CUI. cui_counts (dict): Number of occurrence for each CUI. examples (dict): FP/FN examples of sentences for each CUI. .. py:method:: get_entities(text, only_cui = False, addl_info = ['cui2icd10', 'cui2ontologies', 'cui2snomed']) .. py:method:: get_entities_multi_texts(texts, only_cui = False, addl_info = ['cui2icd10', 'cui2ontologies', 'cui2snomed'], n_process = None, batch_size = None) Get entities :param texts: Text to be annotated :type texts: Union[Iterable[str], Iterable[Tuple]] :param only_cui: Whether to only return CUIs. Defaults to False. :type only_cui: bool :param addl_info: Additional info. Defaults to ['cui2icd10', 'cui2ontologies', 'cui2snomed']. :type addl_info: List[str] :param n_process: Number of processes. Defaults to None. :type n_process: Optional[int] :param batch_size: The size of a batch. Defaults to None. :type batch_size: Optional[int] :raises ValueError: If there's a known issue with multiprocessing. :raises RuntimeError: If there's an unknown issue with multprocessing. :Returns: **List[Dict]** -- List of entity documents. .. py:method:: get_json(text, only_cui = False, addl_info = ['cui2icd10', 'cui2ontologies']) Get output in json format :param text: Text to be annotated :type text: str :param only_cui: Whether to only get CUIs. Defaults to False. :type only_cui: bool :param addl_info: Additional info. Defaults to ['cui2icd10', 'cui2ontologies']. :type addl_info: List[str] :Returns: **str** -- Json with fields {'entities': <>, 'text': text}. .. py:method:: _get_training_start(train_set, latest_trained_step) :staticmethod: .. py:method:: _separate_nn_components() .. py:method:: _run_nn_components(docs, nn_components, id2text) This will add meta_anns in-place to the docs dict. # noqa: DAR101 .. py:method:: _batch_generator(data, batch_size_chars, skip_ids = set()) .. py:method:: _save_docs_to_file(docs, annotated_ids, save_dir_path, annotated_ids_path, part_counter = 0) .. py:method:: multiprocessing_batch_char_size(data, nproc = 2, batch_size_chars = 5000 * 1000, only_cui = False, addl_info = [], separate_nn_components = True, out_split_size_chars = None, save_dir_path = os.path.abspath(os.getcwd()), min_free_memory=0.1, min_free_memory_size = None, enabled_progress_bar = True) Run multiprocessing for inference, if out_save_path and out_split_size_chars is used this will also continue annotating documents if something is saved in that directory. This method batches the data based on the number of characters as specified by user. PS: This method is unlikely to work on a Windows machine. :param data: Iterator or array with format: [(id, text), (id, text), ...] :param nproc: Number of processors. Defaults to 8. :type nproc: int :param batch_size_chars: Size of a batch in number of characters, this should be around: NPROC * average_document_length * 200. Defaults to 1000000. :type batch_size_chars: int :param only_cui: Whether to only return the CUIs rather than the full annotations. Dedfaults to False. :type only_cui: bool :param addl_info: The additional information. Defaults to []. :type addl_info: List[str] :param separate_nn_components: If set the medcat pipe will be broken up into NN and not-NN components and they will be run sequentially. This is useful as the NN components have batching and like to process many docs at once, while the rest of the pipeline runs the documents one by one. Defaults to True. :type separate_nn_components: bool :param out_split_size_chars: If set once more than out_split_size_chars are annotated they will be saved to a file (save_dir_path) and the memory cleared. Recommended value is 20*batch_size_chars. :type out_split_size_chars: Optional[int] :param save_dir_path: Where to save the annotated documents if splitting. Defaults to the current working directory. :type save_dir_path: str :param min_free_memory: If set a process will not start unless there is at least this much RAM memory left, should be a range between [0, 1] meaning how much of the memory has to be free. Helps when annotating very large datasets because spacy is not the best with memory management and multiprocessing. If both `min_free_memory` and `min_free_memory_size` are set, a ValueError is raised. Defaults to 0.1. :type min_free_memory: float :param min_free_memory_size: If set, the process will not start unless there's the specified amount of memory available. For reference, we would recommend at least 5GB of memory for a full SNOMED model. You can use human readable sizes (e.g 2GB, 2000MB and so on). If both `min_free_memory` and `min_free_memory_size` are set, a ValueError is raised. Defaults to None. :type min_free_memory_size: Optional[str] :param enabled_progress_bar: Whether to enabled the progress bar. Defaults to True. :type enabled_progress_bar: bool :raises Exception: If multiprocessing cannot be done. :raises ValueError: If both free memory specifiers are provided. :Returns: **Dict** -- {id: doc_json, id2: doc_json2, ...}, in case out_split_size_chars is used the last batch will be returned while that and all previous batches will be written to disk (out_save_dir). .. py:method:: _multiprocessing_batch(data, nproc = 8, batch_size_chars = 1000000, only_cui = False, addl_info = [], nn_components = [], min_free_memory = 0.1, min_free_memory_size = None) Run multiprocessing on one batch. :param data: Iterator or array with format: [(id, text), (id, text), ...]. :param nproc: Number of processors. Defaults to 8. :type nproc: int :param batch_size_chars: Size of a batch in number of characters. Fefaults to 1 000 000. :type batch_size_chars: int :param only_cui: Whether to get only CUIs. Defaults to False. :type only_cui: bool :param addl_info: Additional info. Defaults to []. :type addl_info: List[str] :param nn_components: NN components in case there's a separation. Defaults to []. :type nn_components: List :param min_free_memory: If set a process will not start unless there is at least this much RAM memory left, should be a range between [0, 1] meaning how much of the memory has to be free. Helps when annotating very large datasets because spacy is not the best with memory management and multiprocessing. Defaults to 0. :type min_free_memory: float :param min_free_memory_size: The minimum human readable memory size required. :type min_free_memory_size: Optional[int] :Returns: **Dict** -- {id: doc_json, id2: doc_json2, ...} .. py:method:: multiprocessing_batch_docs_size(in_data, nproc = None, batch_size = None, only_cui = False, addl_info = ['cui2icd10', 'cui2ontologies', 'cui2snomed'], return_dict = True, batch_factor = 2) Run multiprocessing NOT FOR TRAINING. This method batches the data based on the number of documents as specified by the user. NOTE: When providing a generator for `data`, the generator is evaluated (`list(in_data)`) and thus all the data is kept in memory and (potentially) duplicated for use in multiple threads. So if you're using a lot of data, it may be better to use `CAT.multiprocessing_batch_char_size` instead. PS: This method supports Windows. :param in_data: List with format: [(id, text), (id, text), ...] :type in_data: Union[List[Tuple], Iterable[Tuple]] :param nproc: The number of processors. Defaults to None. :type nproc: Optional[int] :param batch_size: The number of texts to buffer. Defaults to None. :type batch_size: Optional[int] :param only_cui: Whether to get only CUIs. Defaults to False. :type only_cui: bool :param addl_info: Additional info. Defaults to []. :type addl_info: List[str] :param return_dict: Flag for returning either a dict or a list of tuples. Defaults to True. :type return_dict: bool :param batch_factor: Batch factor. Defaults to 2. :type batch_factor: int :raises ValueError: When number of processes is 0. :Returns: **Union[List[Tuple], Dict]** -- {id: doc_json, id: doc_json, ...} or if return_dict is False, a list of tuples: [(id, doc_json), (id, doc_json), ...] .. py:method:: _mp_cons(in_q, out_list, min_free_memory, lock, min_free_memory_size = None, pid = 0, only_cui = False, addl_info = []) .. py:method:: _add_nested_ent(doc, _ents, _ent) .. py:method:: _doc_to_out(doc, only_cui, addl_info, out_with_text = False) .. py:method:: _get_trimmed_text(text) .. py:method:: _generate_trimmed_texts(texts) .. py:method:: _get_trimmed_texts(texts) .. py:method:: _pipe_error_handler(proc_name, proc, docs, e) :staticmethod: .. py:method:: _get_doc_annotations(doc) :staticmethod: .. py:method:: destroy_pipe()