:py:mod:`medcat.cdb` ==================== .. py:module:: medcat.cdb .. autoapi-nested-parse:: Representation class for CDB data Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: medcat.cdb.CDB Attributes ~~~~~~~~~~ .. autoapisummary:: medcat.cdb.logger .. py:data:: logger .. py:class:: CDB(config = None) Bases: :py:obj:`object` Concept DataBase - holds all information necessary for NER+L. Properties: name2cuis (Dict[str, List[str]]): Map fro concept name to CUIs - one name can map to multiple CUIs. name2cuis2status (Dict[str, Dict[str, str]]): What is the status for a given name and cui pair - each name can be: P - Preferred, A - Automatic (e.g. let medcat decide), N - Not common. snames (Set[str]): All possible subnames for all concepts cui2names (Dict[str, Set[str]]): From cui to all names assigned to it. Mainly used for subsetting (maybe even only). cui2snames (Dict[str, Set[str]]): From cui to all sub-names assigned to it. Only used for subsetting. cui2context_vectors (Dict[str, Dict[str, np.array]]): From cui to a dictionary of different kinds of context vectors. Normally you would have here a short and a long context vector - they are calculated separately. cui2count_train (Dict[str, int]): From CUI to the number of training examples seen. cui2tags (Dict[str, List[str]]): From CUI to a list of tags. This can be used to tag concepts for grouping of whatever. cui2type_ids (Dict[str, Set[str]]): From CUI to type id (e.g. TUI in UMLS). cui2preferred_name (Dict[str, str]): From CUI to the preferred name for this concept. cui2average_confidence (Dict[str, str]): Used for dynamic thresholding. Holds the average confidence for this CUI given the training examples. name2count_train (Dict[str, str]): Counts how often did a name appear during training. addl_info (Dict[str, Dict[]]): Any additional maps that are not part of the core CDB. These are usually not needed for the base NER+L use-case, but can be useufl for Debugging or some special stuff. vocab (Dict[str, int]): Stores all the words tha appear in this CDB and the count for each one. is_dirty (bool): Whether or not the CDB has been changed since it was loaded or created .. py:method:: __init__(config = None) .. py:method:: _init_waf_from_config() .. py:method:: get_name(cui) Returns preferred name if it exists, otherwise it will return the longest name assigned to the concept. :param cui: Concept ID or unique identifer in this database. :type cui: str :Returns: **str** -- The name of the concept. .. py:method:: update_cui2average_confidence(cui, new_sim) .. py:method:: remove_names(cui, names) .. py:method:: _remove_names(cui, names) Remove names from an existing concept - effect is this name will never again be used to link to this concept. This will only remove the name from the linker (namely name2cuis and name2cuis2status), the name will still be present everywhere else. Why? Because it is bothersome to remove it from everywhere, but could also be useful to keep the removed names in e.g. cui2names. :param cui: Concept ID or unique identifer in this database. :type cui: str :param names: Names to be removed (e.g list, set, or even a dict (in which case keys will be used)). :type names: Iterable[str] .. py:method:: remove_cui(cui) This function takes a `CUI` as an argument and removes it from all the internal objects that reference it. :param cui: Concept ID or unique identifer in this database. :type cui: str .. py:method:: add_names(cui, names, name_status = 'A', full_build = False) Adds a name to an existing concept. :param cui: Concept ID or unique identifer in this database, all concepts that have the same CUI will be merged internally. :type cui: str :param names: Names for this concept, or the value that if found in free text can be linked to this concept. Names is an dict like: `{name: {'tokens': tokens, 'snames': snames, 'raw_name': raw_name}, ...}` :type names: Dict[str, Dict] :param name_status: One of `P`, `N`, `A`. :type name_status: str :param full_build: If True the dictionary self.addl_info will also be populated, contains a lot of extra information about concepts, but can be very memory consuming. This is not necessary for normal functioning of MedCAT (Default value `False`). :type full_build: bool .. py:method:: _add_concept(cui, names, ontologies, name_status, type_ids, description, full_build = False) Add a concept to internal Concept Database (CDB). Depending on what you are providing this will add a large number of properties for each concept. :param cui: Concept ID or unique identifier in this database, all concepts that have the same CUI will be merged internally. :type cui: str :param names: Names for this concept, or the value that if found in free text can be linked to this concept. Names is a dict like: `{name: {'tokens': tokens, 'snames': snames, 'raw_name': raw_name}, ...}` Names should be generated by helper function 'medcat.preprocessing.cleaners.prepare_name' :type names: Dict[str, Dict] :param ontologies: ontologies in which the concept exists (e.g. SNOMEDCT, HPO) :type ontologies: Set[str] :param name_status: One of `P`, `N`, `A` :type name_status: str :param type_ids: Semantic type identifier (have a look at TUIs in UMLS or SNOMED-CT) :type type_ids: Set[str] :param description: Description of this concept. :type description: str :param full_build: If True the dictionary self.addl_info will also be populated, contains a lot of extra information about concepts, but can be very memory consuming. This is not necessary for normal functioning of MedCAT (Default Value `False`). :type full_build: bool :raises ValueError: If there is no name info yet `names` dict is not empty. .. py:method:: add_addl_info(name, data, reset_existing = False) Add data to the addl_info dictionary. This is done in a function to not directly access the addl_info dictionary. :param name: What key should be used in the `addl_info` dictionary. :type name: str :param data: What will be added as the value for the key `name` :type data: Dict :param reset_existing: Should old data be removed if it exists :type reset_existing: bool .. py:method:: update_context_vector(cui, vectors, negative = False, lr = None, cui_count = 0) Add the vector representation of a context for this CUI. :param cui: The concept in question. :type cui: str :param vectors: Vector represenation of the context, must have the format: {'context_type': np.array(), ...} context_type - is usually one of: ['long', 'medium', 'short'] :type vectors: Dict[str, np.ndarray] :param negative: Is this negative context of positive (Default Value `False`). :type negative: bool :param lr: If set it will override the base value from the config file. :type lr: Optional[float] :param cui_count: The learning rate will be calculated based on the count for the provided CUI + cui_count. Defaults to 0. :type cui_count: int .. py:method:: save(path, json_path = None, overwrite = True, calc_hash_if_missing = False) Saves model to file (in fact it saves variables of this class). If a `json_path` is specified, the JSON serialization is used for some of the data. :param path: Path to a file where the model will be saved :type path: str :param json_path: If specified, json serialisation is used. Defaults to None. :type json_path: Optional[str] :param overwrite: Whether or not to overwrite existing file(s). :type overwrite: bool :param calc_hash_if_missing: Calculate the hash if it's missing. Defaults to `False` :type calc_hash_if_missing: bool .. py:method:: save_async(path) :async: Async version of saving model to file (in fact it saves variables of this class). This method does not (currently) support the new JSON serialization. :param path: Path to a file where the model will be saved :type path: str .. py:method:: load_config(config_path) .. py:method:: load(path, json_path = None, config_dict = None) :classmethod: Load and return a CDB. This allows partial loads in probably not the right way at all. If `json_path` is specified, the JSON serialization is assumed to be present. Otherwise, it is assumed not to be present. :param path: Path to a `cdb.dat` from which to load data. :type path: str :param json_path: Path to the JSON serialized folder :type json_path: str :param config_dict: A dictionary that will be used to overwrite existing fields in the config of this CDB :Returns: **CDB** -- The resulting concept database. .. py:method:: import_training(cdb, overwrite = True) This will import vector embeddings from another CDB. No new concepts will be added. IMPORTANT it will not import name maps (cui2names, name2cuis or anything else) only vectors. :param cdb: Concept database from which to import training vectors :type cdb: CDB :param overwrite: If True all training data in the existing CDB will be overwritten, else the average between the two training vectors will be taken (Default value `True`). :type overwrite: bool .. rubric:: Examples >>> new_cdb.import_traininig(cdb=old_cdb, owerwrite=True) .. py:method:: reset_cui_count(n = 10) Reset the CUI count for all concepts that received training, used when starting new unsupervised training or for suppervised with annealing. :param n: This will be set as the CUI count for all cuis in this CDB (Default value 10). :type n: int .. rubric:: Examples >>> cdb.reset_cui_count() .. py:method:: reset_training() Will remove all training efforts - in other words all embeddings that are learnt for concepts in the current CDB. Please note that this does not remove synonyms (names) that were potentially added during supervised/online learning. .. py:method:: populate_cui2snames(force = True) Populate the cui2snames dict if it's empty. If the dict is not empty and the population is not force, nothing will happen. For now, this method simply populates all the names form cui2names into cui2snames. :param force: Whether to force the (re-)population. Defaults to True. :type force: bool .. py:method:: filter_by_cui(cuis_to_keep) Subset the core CDB fields (dictionaries/maps). Note that this will potenitally keep a bit more CUIs then in cuis_to_keep. It will first find all names that link to the cuis_to_keep and then find all CUIs that link to those names and keep all of them. This also will not remove any data from cdb.addl_info - as this field can contain data of unknown structure. As a side note, if the CDB has been memory-optimised, filtering will undo this memory optimisation. This is because the dicts being involved will be rewritten. However, the memory optimisation can be performed again afterwards. :param cuis_to_keep: CUIs that will be kept, the rest will be removed (not completely, look above). :type cuis_to_keep: Union[List[str], Set[str]] :raises Exception: If no snames and subsetting is not possible. .. py:method:: make_stats() .. py:method:: print_stats() Print basic statistics for the CDB. .. py:method:: reset_concept_similarity() Reset concept similarity matrix. .. py:method:: most_similar(cui, context_type, type_id_filter = [], min_cnt = 0, topn = 50, force_build = False) Given a concept it will calculate what other concepts in this CDB have the most similar embedding. :param cui: The concept ID for the base concept for which you want to get the most similar concepts. :type cui: str :param context_type: On what vector type from the cui2context_vectors map will the similarity be calculated. :type context_type: str :param type_id_filter: A list of type_ids that will be used to filterout the returned results. Using this it is possible to limit the similarity calculation to only disorders/symptoms/drugs/... :type type_id_filter: List[str] :param min_cnt: Minimum training examples (unsupervised+supervised) that a concept must have to be considered for the similarity calculation. :type min_cnt: int :param topn: How many results to return :type topn: int :param force_build: Do not use cached sim matrix (Default value False) :type force_build: bool :Returns: **Dict** -- A dictionary with top results like: {: {'name': , 'sim': , 'type_name': , 'type_id': , 'cnt': }, ...} .. py:method:: _check_medcat_version(config_data) :classmethod: .. py:method:: _should_recalc_hash(force_recalc) .. py:method:: get_hash(force_recalc = False) .. py:method:: calculate_hash()