medcat.config

Module Contents

Classes

FakeDict

FakeDict that allows the use of the __getitem__ and __setitem__ method for legacy access.

ValueExtractor

The current example config has a value for an empty set as '{}'.

MixingConfig

Config that is able to saved and loaded from disk as well as mixed with other configs.

VersionInfo

The version info part of the config

CDBMaker

The Context Database (CDB) making part of the config

AnnotationOutput

The annotation output part of the config

CheckPoint

The checkpoint part of the config

General

The general part of the config

Preprocessing

The preprocessing part of the config

Ner

The NER part of the config

_DefPartial

This is a helper class to make it possible to check equality of two default Linking instances

LinkingFilters

These describe the linking filters used alongside the model.

Linking

The linking part of the config

Config

The MedCAT config

Functions

workers([workers_override])

_set_value_or_alt(conf, key, value, alt_values[, err])

Attributes

logger

_EMPTY_DICT_2_EMPTY_SET

_DEFAULT_EXTRACTOR

_DEFAULT_PARTIAL

medcat.config.logger
medcat.config.workers(workers_override=None)
Parameters:

workers_override (Optional[int]) –

Return type:

int

class medcat.config.FakeDict

FakeDict that allows the use of the __getitem__ and __setitem__ method for legacy access.

__getitem__(arg)
Parameters:

arg (str) –

Return type:

Any

__setattr__(arg, val)

Implement setattr(self, name, value).

Parameters:

arg (str) –

Return type:

None

__setitem__(arg, val)
Parameters:

arg (str) –

Return type:

None

get(key, default=None)
Return type:

Any

medcat.config._EMPTY_DICT_2_EMPTY_SET: Callable[[str, Any], set | None]
class medcat.config.ValueExtractor(alt_generators=[_EMPTY_DICT_2_EMPTY_SET])

The current example config has a value for an empty set as ‘{}’. However, that evaluates to an empty dictionary instead. In case there are other such examples, this allows adding other alternatives as well.

Parameters:

alt_generators (List[Callable[[str, Any], Optional[Any]]]) –

__init__(alt_generators=[_EMPTY_DICT_2_EMPTY_SET])
Parameters:

alt_generators (List[Callable[[str, Any], Optional[Any]]]) –

Return type:

None

extract(rhs)

Extracts value and its alternatives based on the alternative generators defined.

Parameters:

rhs (str) – The parsable right hand side

Returns:

Tuple[str, List[str]] – The main value and the (potentially many) alternatives

Return type:

Tuple[str, List[str]]

medcat.config._DEFAULT_EXTRACTOR
medcat.config._set_value_or_alt(conf, key, value, alt_values, err=None)
Parameters:
  • conf (MixingConfig) –

  • key (str) –

  • value (Any) –

  • alt_values (List[Any]) –

  • err (Optional[pydantic.ValidationError]) –

Return type:

None

class medcat.config.MixingConfig

Bases: FakeDict

Config that is able to saved and loaded from disk as well as mixed with other configs. It is not intended to be initialised directly and it is assumed that instances also inherit from pydantic’s BaseModel.

save(save_path)

Save the config into a .json file

Parameters:

save_path (str) – Where to save the created json file

Return type:

None

merge_config(config_dict)

Merge a config_dict with the existing config object.

Parameters:

config_dict (Dict) – A dictionary which key/values should be added to this class.

Return type:

None

parse_config_file(path, extractor=_DEFAULT_EXTRACTOR)
Parses a configuration file in text format. Must be like:

cat.<variable>.<key> = <value> …

  • variable: linking, general, ner, …

  • key: a key in the config dict e.g. subsample_after for linking

  • value: the value for the key, will be parsed with eval

Parameters:
  • path (str) – the path to the config file

  • extractor (ValueExtractor) – (Default value = _DEFAULT_EXTRACTOR)

Raises:

ValueError – In case of unknown attribute.

Return type:

None

rebuild_re()
Return type:

None

_calc_hash(hasher=None)
Parameters:

hasher (Optional[medcat.utils.hasher.Hasher]) –

Return type:

medcat.utils.hasher.Hasher

get_hash(hasher=None)
Parameters:

hasher (Optional[medcat.utils.hasher.Hasher]) –

__str__()

Return str(self).

Return type:

str

classmethod load(save_path)

Load config from a json file, note that fields that did not exist in the old config but do exist in the current version of the ConfigMetaCAT class will be kept.

Parameters:

save_path (str) – Path to the json file to load

Returns:

MixingConfig – The loaded config

Return type:

MixingConfig

classmethod from_dict(config_dict)

Generate a MixingConfig (of an extending type) from a a dictionary.

Parameters:

config_dict (Dict) – The dictionary to create the config from

Returns:

MixingConfig – The resulting config

Return type:

MixingConfig

asdict()

Get the config as a dictionary.

Returns:

Dict[str, Any] – The dictionary associated with this config

Return type:

Dict[str, Any]

fields()

Get the fields associated with this config.

Returns:

Dict[str, ModelField] – The dictionary of the field names and fields

Return type:

Dict[str, pydantic.fields.ModelField]

class medcat.config.VersionInfo

Bases: MixingConfig, pydantic.BaseModel

The version info part of the config

class Config
extra
validate_assignment = True
history: list = []

Populated automatically

meta_cats: Any

Populated automatically

cdb_info: dict

Populated automatically, output from cdb.print_stats

performance: dict

{‘model_name’: {‘f1’: <>, ‘p’: <>, …}, …}}

Type:

NER general performance, meta should be

Type:

{‘meta’

description: str = 'No description'

General description and what it was trained on

id: Any

hash of most things

Type:

Will be

last_modified: int | datetime.datetime | str | None
location: str | None

Path/URL/Whatever to where is this CDB located

ontology: str | List[str] | None

What was used to build the CDB, e.g. SNOMED_202009

medcat_version: str | None

Which version of medcat was used to build the CDB

class medcat.config.CDBMaker

Bases: MixingConfig, pydantic.BaseModel

The Context Database (CDB) making part of the config

class Config
extra
validate_assignment = True
name_versions: list = ['LOWER', 'CLEAN']

Name versions to be generated.

multi_separator: str = '|'

If multiple names or type_ids for a concept present in one row of a CSV, they are separted by the character below.

remove_parenthesis: int = 5

Should preferred names with parenthesis be cleaned 0 means no, else it means if longer than or equal e.g. Head (Body part) -> Head

min_letters_required: int = 2

Minimum number of letters required in a name to be accepted for a concept

class medcat.config.AnnotationOutput

Bases: MixingConfig, pydantic.BaseModel

The annotation output part of the config

class Config
extra
validate_assignment = True
doc_extended_info: bool = False
context_left: int
context_right: int
lowercase_context: bool = True
include_text_in_output: bool = False
class medcat.config.CheckPoint

Bases: MixingConfig, pydantic.BaseModel

The checkpoint part of the config

class Config
extra
validate_assignment = True
output_dir: str = 'checkpoints'

When doing training this is the name of the directory where checkpoints will be saved

steps: int | None

When training how often to save the checkpoint (one step represents one document), if None no ckpts will be created

max_to_keep: int = 1

When training the maximum checkpoints will be kept on the disk

class medcat.config.General

Bases: MixingConfig, pydantic.BaseModel

The general part of the config

class Config
extra
validate_assignment = True
spacy_disabled_components: list = ['ner', 'parser', 'vectors', 'textcat', 'entity_linker', 'sentencizer', 'entity_ruler',...
checkpoint: CheckPoint

Checkpointing config

log_level: int

Logging config for everything | ‘tagger’ can be disabled, but will cause a drop in performance

log_format: str = '%(levelname)s:%(name)s: %(message)s'
log_path: str = './medcat.log'
spacy_model: str = 'en_core_web_md'

What model will be used for tokenization

separator: str = '~'

Separator that will be used to merge tokens of a name. Once a CDB is built this should always stay the same.

spell_check: bool = True

Should we check spelling - note that this makes things much slower, use only if necessary. The only thing necessary for the spell checker to work is vocab.dat and cdb.dat built with concepts in the respective language.

diacritics: bool = False

Should we process diacritics - for languages other than English, symbols such as ‘é, ë, ö’ can be relevant. Note that this makes spell_check slower.

spell_check_deep: bool = False

If True the spell checker will try harder to find mistakes, this can slow down things drastically.

spell_check_len_limit: int = 7

Spelling will not be checked for words with length less than this

show_nested_entities: bool = False

If set to True functions like get_entities and get_json will return nested_entities and overlaps

When unlinking a name from a concept should we do full_unlink (means unlink a name from all concepts, not just the one in question)

workers: int

Number of workers used by a parallelizable pipeline component

make_pretty_labels: str | None

Should the labels of entities (shown in displacy) be pretty or just ‘concept’. Slows down the annotation pipeline should not be used when annotating millions of documents. If None it will be the string “concept”, if short it will be CUI, if long it will be CUI | Name | Confidence

map_cui_to_group: bool = False

If the cdb.addl_info[‘cui2group’] is provided and this option enabled, each CUI will be maped to the group

class medcat.config.Preprocessing

Bases: MixingConfig, pydantic.BaseModel

The preprocessing part of the config

class Config
extra
validate_assignment = True
words_to_skip: set

This words will be completly ignored from concepts and from the text (must be a Set)

keep_punct: set

All punct will be skipped by default, here you can set what will be kept

do_not_normalize: set

e.g. running -> run Values are detailed part-of-speech tags. See: - https://spacy.io/usage/linguistic-features#pos-tagging - Label scheme section per model at https://spacy.io/models/en

Type:

Should specific word types be normalized

skip_stopwords: bool = False

Should stopwords be skipped/ingored when processing input

min_len_normalize: int = 5

Nothing below this length will ever be normalized (input tokens or concept names), normalized means lemmatized in this case

stopwords: set | None

If None the default set of stowords from spacy will be used. This must be a Set.

max_document_length: int = 1000000

Documents longer than this will be trimmed

class medcat.config.Ner

Bases: MixingConfig, pydantic.BaseModel

The NER part of the config

class Config
extra
validate_assignment = True
min_name_len: int = 3

Do not detect names below this limit, skip them

max_skip_tokens: int = 2

When checkng tokens for concepts you can have skipped tokens inbetween used ones (usually spaces, new lines etc). This number tells you how many skipped can you have.

check_upper_case_names: bool = False

Check uppercase to distinguish uppercase and lowercase words that have a different meaning.

upper_case_limit_len: int = 4

Any name shorter than this must be uppercase in the text to be considered. If it is not uppercase it will be skipped.

try_reverse_word_order: bool = False

Try reverse word order for short concepts (2 words max), e.g. heart disease -> disease heart

class medcat.config._DefPartial

This is a helper class to make it possible to check equality of two default Linking instances

__init__()
__call__(*args, **kwargs)
__eq__(other)

Return self==value.

medcat.config._DEFAULT_PARTIAL
class medcat.config.LinkingFilters(**data)

Bases: MixingConfig, pydantic.BaseModel

These describe the linking filters used alongside the model.

When no CUIs nor exlcuded CUIs are specified (the sets are empty), all CUIs are accepted. If there are CUIs specified then only those will be accepted. If there are excluded CUIs specified, they are excluded.

In some cases, there are extra filters as well as MedCATtrainer (MCT) export filters. These are expcted to follow the following: extra_cui_filter ⊆ MCT filter ⊆ Model/config filter

While any other CUIs can be included in the the extra CUI filter or the MCT filter, they would not have any real effect.

cuis: Set[str]
cuis_exclude: Set[str]
__init__(**data)
check_filters(cui)

Checks is a CUI in the filters

Parameters:

cui (str) – The CUI in question

Returns:

bool – True if the CUI is allowed

Return type:

bool

merge_with(other)

Merge CUIs and excluded CUIs within two filters. The data will be kept within this filter (and not the other).

Parameters:

other (LinkingFilters) – The other filter to merge with

Return type:

None

copy_of()

Create a copy of this LinkingFilters. This copy will describe an identical filter but will refer to different sets so they can be mutated separately.

Returns:

LinkingFilters – A copy of the original filters.

Return type:

LinkingFilters

class medcat.config.Linking

Bases: MixingConfig, pydantic.BaseModel

The linking part of the config

class Config
extra
validate_assignment = True
optim: dict

Linear anneal

context_vector_sizes: dict

Context vector sizes that will be calculated and used for linking

context_vector_weights: dict

Weight of each vector in the similarity score - make trainable at some point. Should add up to 1.

filters: LinkingFilters

Filters

train: bool = True

Should it train or not, this is set automatically ignore in 99% of cases and do not set manually

random_replacement_unsupervised: float = 0.8

If <1 during unsupervised training the detected term will be randomly replaced with a probability of 1 - random_replacement_unsupervised Replaced with a synonym used for that term

disamb_length_limit: int = 3

All concepts below this will always be disambiguated

filter_before_disamb: bool = False

If True it will filter before doing disamb. Useful for the trainer.

train_count_threshold: int = 1

Concepts that have seen less training examples than this will not be used for similarity calculation and will have a similarity of -1.

always_calculate_similarity: bool = False

Do we want to calculate context similarity even for concepts that are not ambigous.

weighted_average_function: Callable[Ellipsis, Any]

Weights for a weighted average ‘weighted_average_function’: partial(weighted_average, factor=0.02),

calculate_dynamic_threshold: bool = False

Concepts below this similarity will be ignored. Type can be static/dynamic - if dynamic each CUI has a different TH and it is calcualted as the average confidence for that CUI * similarity_threshold. Take care that dynamic works only if the cdb was trained with calculate_dynamic_threshold = True.

similarity_threshold_type: str = 'static'
similarity_threshold: float = 0.25
negative_probability: float = 0.5

Probability for the negative context to be added for each positive addition

negative_ignore_punct_and_num: bool = True

Do we ignore punct/num when negative sampling

prefer_primary_name: float = 0.35

If >0 concepts for which a detection is its primary name will be preferred by that amount (0 to 1)

prefer_frequent_concepts: float = 0.35

If >0 concepts that are more frequent will be prefered by a multiply of this amount

subsample_after: int = 30000

Subsample during unsupervised training if a concept has received more than

Type:

DISABLED in code permanetly

devalue_linked_concepts: bool = False

When adding a positive example, should it also be treated as Negative for concepts which link to the postive one via names (ambigous names).

context_ignore_center_tokens: bool = False

If true when the context of a concept is calculated (embedding) the words making that concept are not taken into accout

class medcat.config.Config(*args, **kwargs)

Bases: MixingConfig, pydantic.BaseModel

The MedCAT config

class Config
arbitrary_types_allowed = True
extra
validate_assignment = True
version: VersionInfo
cdb_maker: CDBMaker
annotation_output: AnnotationOutput
general: General
preprocessing: Preprocessing
ner: Ner
linking: Linking
word_skipper: re.Pattern
punct_checker: re.Pattern
hash: str | None
__init__(*args, **kwargs)
rebuild_re()
Return type:

None

get_hash()