medcat.cdb

Representation class for CDB data

Module Contents

Classes

CDB

Concept DataBase - holds all information necessary for NER+L.

Attributes

logger

medcat.cdb.logger
class medcat.cdb.CDB(config=None)

Bases: object

Concept DataBase - holds all information necessary for NER+L.

Properties:
name2cuis (Dict[str, List[str]]):

Map fro concept name to CUIs - one name can map to multiple CUIs.

name2cuis2status (Dict[str, Dict[str, str]]):
What is the status for a given name and cui pair - each name can be:

P - Preferred, A - Automatic (e.g. let medcat decide), N - Not common.

snames (Set[str]):

All possible subnames for all concepts

cui2names (Dict[str, Set[str]]):

From cui to all names assigned to it. Mainly used for subsetting (maybe even only).

cui2snames (Dict[str, Set[str]]):

From cui to all sub-names assigned to it. Only used for subsetting.

cui2context_vectors (Dict[str, Dict[str, np.array]]):

From cui to a dictionary of different kinds of context vectors. Normally you would have here a short and a long context vector - they are calculated separately.

cui2count_train (Dict[str, int]):

From CUI to the number of training examples seen.

cui2tags (Dict[str, List[str]]):

From CUI to a list of tags. This can be used to tag concepts for grouping of whatever.

cui2type_ids (Dict[str, Set[str]]):

From CUI to type id (e.g. TUI in UMLS).

cui2preferred_name (Dict[str, str]):

From CUI to the preferred name for this concept.

cui2average_confidence (Dict[str, str]):

Used for dynamic thresholding. Holds the average confidence for this CUI given the training examples.

name2count_train (Dict[str, str]):

Counts how often did a name appear during training.

addl_info (Dict[str, Dict[]]):

Any additional maps that are not part of the core CDB. These are usually not needed for the base NER+L use-case, but can be useufl for Debugging or some special stuff.

vocab (Dict[str, int]):

Stores all the words tha appear in this CDB and the count for each one.

is_dirty (bool):

Whether or not the CDB has been changed since it was loaded or created

Parameters:

config (Union[medcat.config.Config, None]) –

__init__(config=None)
Parameters:

config (Union[medcat.config.Config, None]) –

Return type:

None

get_name(cui)

Returns preferred name if it exists, otherwise it will return the longest name assigned to the concept.

Parameters:

cui (str) – Concept ID or unique identifer in this database.

Returns:

str – The name of the concept.

Return type:

str

update_cui2average_confidence(cui, new_sim)
Parameters:
  • cui (str) –

  • new_sim (float) –

Return type:

None

remove_names(cui, names)

Remove names from an existing concept - effect is this name will never again be used to link to this concept. This will only remove the name from the linker (namely name2cuis and name2cuis2status), the name will still be present everywhere else. Why? Because it is bothersome to remove it from everywhere, but could also be useful to keep the removed names in e.g. cui2names.

Parameters:
  • cui (str) – Concept ID or unique identifer in this database.

  • names (Dict[str, Dict]) – Names to be removed, should look like: {‘name’: {‘tokens’: tokens, ‘snames’: snames, ‘raw_name’: raw_name}, …}

Return type:

None

remove_cui(cui)

This function takes a CUI as an argument and removes it from all the internal objects that reference it.

Parameters:

cui (str) – Concept ID or unique identifer in this database.

Return type:

None

add_names(cui, names, name_status='A', full_build=False)

Adds a name to an existing concept.

Parameters:
  • cui (str) – Concept ID or unique identifer in this database, all concepts that have the same CUI will be merged internally.

  • names (Dict[str, Dict]) – Names for this concept, or the value that if found in free text can be linked to this concept. Names is an dict like: {name: {‘tokens’: tokens, ‘snames’: snames, ‘raw_name’: raw_name}, …}

  • name_status (str) – One of P, N, A.

  • full_build (bool) – If True the dictionary self.addl_info will also be populated, contains a lot of extra information about concepts, but can be very memory consuming. This is not necessary for normal functioning of MedCAT (Default value False).

Return type:

None

add_concept(cui, names, ontologies, name_status, type_ids, description, full_build=False)

Deprecated: Use cdb._add_concept as this will be removed in a future release.

Add a concept to internal Concept Database (CDB). Depending on what you are providing this will add a large number of properties for each concept.

Parameters:
  • cui (str) – Concept ID or unique identifier in this database, all concepts that have the same CUI will be merged internally.

  • names (Dict[str, Dict]) – Names for this concept, or the value that if found in free text can be linked to this concept. Names is a dict like: {name: {‘tokens’: tokens, ‘snames’: snames, ‘raw_name’: raw_name}, …} Names should be generated by helper function ‘medcat.preprocessing.cleaners.prepare_name’

  • ontologies (Set[str]) – ontologies in which the concept exists (e.g. SNOMEDCT, HPO)

  • name_status (str) – One of P, N, A

  • type_ids (Set[str]) – Semantic type identifier (have a look at TUIs in UMLS or SNOMED-CT)

  • description (str) – Description of this concept.

  • full_build (bool) – If True the dictionary self.addl_info will also be populated, contains a lot of extra information about concepts, but can be very memory consuming. This is not necessary for normal functioning of MedCAT (Default Value False).

Return type:

None

_add_concept(cui, names, ontologies, name_status, type_ids, description, full_build=False)

Add a concept to internal Concept Database (CDB). Depending on what you are providing this will add a large number of properties for each concept.

Parameters:
  • cui (str) – Concept ID or unique identifier in this database, all concepts that have the same CUI will be merged internally.

  • names (Dict[str, Dict]) – Names for this concept, or the value that if found in free text can be linked to this concept. Names is a dict like: {name: {‘tokens’: tokens, ‘snames’: snames, ‘raw_name’: raw_name}, …} Names should be generated by helper function ‘medcat.preprocessing.cleaners.prepare_name’

  • ontologies (Set[str]) – ontologies in which the concept exists (e.g. SNOMEDCT, HPO)

  • name_status (str) – One of P, N, A

  • type_ids (Set[str]) – Semantic type identifier (have a look at TUIs in UMLS or SNOMED-CT)

  • description (str) – Description of this concept.

  • full_build (bool) – If True the dictionary self.addl_info will also be populated, contains a lot of extra information about concepts, but can be very memory consuming. This is not necessary for normal functioning of MedCAT (Default Value False).

Raises:

ValueError – If there is no name info yet names dict is not empty.

Return type:

None

add_addl_info(name, data, reset_existing=False)

Add data to the addl_info dictionary. This is done in a function to not directly access the addl_info dictionary.

Parameters:
  • name (str) – What key should be used in the addl_info dictionary.

  • data (Dict) – What will be added as the value for the key name

  • reset_existing (bool) – Should old data be removed if it exists

Return type:

None

update_context_vector(cui, vectors, negative=False, lr=None, cui_count=0)

Add the vector representation of a context for this CUI.

Parameters:
  • cui (str) – The concept in question.

  • vectors (Dict[str, np.ndarray]) – Vector represenation of the context, must have the format: {‘context_type’: np.array(<vector>), …} context_type - is usually one of: [‘long’, ‘medium’, ‘short’]

  • negative (bool) – Is this negative context of positive (Default Value False).

  • lr (Optional[float]) – If set it will override the base value from the config file.

  • cui_count (int) – The learning rate will be calculated based on the count for the provided CUI + cui_count. Defaults to 0.

Return type:

None

save(path, json_path=None, overwrite=True, calc_hash_if_missing=False)

Saves model to file (in fact it saves variables of this class).

If a json_path is specified, the JSON serialization is used for some of the data.

Parameters:
  • path (str) – Path to a file where the model will be saved

  • json_path (Optional[str]) – If specified, json serialisation is used. Defaults to None.

  • overwrite (bool) – Whether or not to overwrite existing file(s).

  • calc_hash_if_missing (bool) – Calculate the hash if it’s missing. Defaults to False

Return type:

None

async save_async(path)

Async version of saving model to file (in fact it saves variables of this class).

This method does not (currently) support the new JSON serialization.

Parameters:

path (str) – Path to a file where the model will be saved

Return type:

None

load_config(config_path)
Parameters:

config_path (str) –

Return type:

None

classmethod load(path, json_path=None, config_dict=None)

Load and return a CDB. This allows partial loads in probably not the right way at all.

If json_path is specified, the JSON serialization is assumed to be present. Otherwise, it is assumed not to be present.

Parameters:
  • path (str) – Path to a cdb.dat from which to load data.

  • json_path (str) – Path to the JSON serialized folder

  • config_dict (Optional[Dict]) – A dictionary that will be used to overwrite existing fields in the config of this CDB

Returns:

CDB – The resulting concept database.

Return type:

CDB

import_training(cdb, overwrite=True)

This will import vector embeddings from another CDB. No new concepts will be added. IMPORTANT it will not import name maps (cui2names, name2cuis or anything else) only vectors.

Parameters:
  • cdb (CDB) – Concept database from which to import training vectors

  • overwrite (bool) – If True all training data in the existing CDB will be overwritten, else the average between the two training vectors will be taken (Default value True).

Return type:

None

Examples

>>> new_cdb.import_traininig(cdb=old_cdb, owerwrite=True)
reset_cui_count(n=10)

Reset the CUI count for all concepts that received training, used when starting new unsupervised training or for suppervised with annealing.

Parameters:

n (int) – This will be set as the CUI count for all cuis in this CDB (Default value 10).

Return type:

None

Examples

>>> cdb.reset_cui_count()
reset_training()

Will remove all training efforts - in other words all embeddings that are learnt for concepts in the current CDB. Please note that this does not remove synonyms (names) that were potentially added during supervised/online learning.

Return type:

None

populate_cui2snames(force=True)

Populate the cui2snames dict if it’s empty.

If the dict is not empty and the population is not force, nothing will happen.

For now, this method simply populates all the names form cui2names into cui2snames.

Parameters:

force (bool) – Whether to force the (re-)population. Defaults to True.

Return type:

None

filter_by_cui(cuis_to_keep)

Subset the core CDB fields (dictionaries/maps). Note that this will potenitally keep a bit more CUIs then in cuis_to_keep. It will first find all names that link to the cuis_to_keep and then find all CUIs that link to those names and keep all of them. This also will not remove any data from cdb.addl_info - as this field can contain data of unknown structure.

As a side note, if the CDB has been memory-optimised, filtering will undo this memory optimisation. This is because the dicts being involved will be rewritten. However, the memory optimisation can be performed again afterwards.

Parameters:

cuis_to_keep (Union[List[str], Set[str]]) – CUIs that will be kept, the rest will be removed (not completely, look above).

Raises:

Exception – If no snames and subsetting is not possible.

Return type:

None

make_stats()
print_stats()

Print basic statistics for the CDB.

Return type:

None

reset_concept_similarity()

Reset concept similarity matrix.

Return type:

None

most_similar(cui, context_type, type_id_filter=[], min_cnt=0, topn=50, force_build=False)

Given a concept it will calculate what other concepts in this CDB have the most similar embedding.

Parameters:
  • cui (str) – The concept ID for the base concept for which you want to get the most similar concepts.

  • context_type (str) – On what vector type from the cui2context_vectors map will the similarity be calculated.

  • type_id_filter (List[str]) – A list of type_ids that will be used to filterout the returned results. Using this it is possible to limit the similarity calculation to only disorders/symptoms/drugs/…

  • min_cnt (int) – Minimum training examples (unsupervised+supervised) that a concept must have to be considered for the similarity calculation.

  • topn (int) – How many results to return

  • force_build (bool) – Do not use cached sim matrix (Default value False)

Returns:

Dict

A dictionary with top results like: {<cui>: {‘name’: <name>, ‘sim’: <similarity>, ‘type_name’: <type_name>,

‘type_id’: <type_id>, ‘cnt’: <number of training examples the concept has seen>}, …}

Return type:

Dict

static _ensure_backward_compatibility(config)
Parameters:

config (medcat.config.Config) –

Return type:

None

classmethod _check_medcat_version(config_data)
Parameters:

config_data (Dict) –

Return type:

None

_should_recalc_hash(force_recalc)
Parameters:

force_recalc (bool) –

Return type:

bool

get_hash(force_recalc=False)
Parameters:

force_recalc (bool) –

calculate_hash()