medcat.config
Module Contents
Classes
FakeDict that allows the use of the __getitem__ and __setitem__ method for legacy access. |
|
The current example config has a value for an empty set as '{}'. |
|
Config that is able to saved and loaded from disk as well as mixed with other configs. |
|
The version info part of the config |
|
The Context Database (CDB) making part of the config |
|
The annotation output part of the config |
|
The checkpoint part of the config |
|
The general part of the config |
|
The preprocessing part of the config |
|
The NER part of the config |
|
This is a helper class to make it possible to check equality of two default Linking instances |
|
These describe the linking filters used alongside the model. |
|
The linking part of the config |
|
The MedCAT config |
Functions
|
|
|
Attributes
- medcat.config.logger
- medcat.config.workers(workers_override=None)
- Parameters:
workers_override (Optional[int]) –
- Return type:
int
- class medcat.config.FakeDict
FakeDict that allows the use of the __getitem__ and __setitem__ method for legacy access.
- __getitem__(arg)
- Parameters:
arg (str) –
- Return type:
Any
- __setattr__(arg, val)
Implement setattr(self, name, value).
- Parameters:
arg (str) –
- Return type:
None
- __setitem__(arg, val)
- Parameters:
arg (str) –
- Return type:
None
- get(key, default=None)
- Return type:
Any
- medcat.config._EMPTY_DICT_2_EMPTY_SET: Callable[[str, Any], set | None]
- class medcat.config.ValueExtractor(alt_generators=[_EMPTY_DICT_2_EMPTY_SET])
The current example config has a value for an empty set as ‘{}’. However, that evaluates to an empty dictionary instead. In case there are other such examples, this allows adding other alternatives as well.
- Parameters:
alt_generators (List[Callable[[str, Any], Optional[Any]]]) –
- __init__(alt_generators=[_EMPTY_DICT_2_EMPTY_SET])
- Parameters:
alt_generators (List[Callable[[str, Any], Optional[Any]]]) –
- Return type:
None
- extract(rhs)
Extracts value and its alternatives based on the alternative generators defined.
- Parameters:
rhs (str) – The parsable right hand side
- Returns:
Tuple[str, List[str]] – The main value and the (potentially many) alternatives
- Return type:
Tuple[str, List[str]]
- medcat.config._DEFAULT_EXTRACTOR
- medcat.config._set_value_or_alt(conf, key, value, alt_values, err=None)
- Parameters:
conf (MixingConfig) –
key (str) –
value (Any) –
alt_values (List[Any]) –
err (Optional[pydantic.ValidationError]) –
- Return type:
None
- class medcat.config.MixingConfig
Bases:
FakeDict
Config that is able to saved and loaded from disk as well as mixed with other configs. It is not intended to be initialised directly and it is assumed that instances also inherit from pydantic’s BaseModel.
- save(save_path)
Save the config into a .json file
- Parameters:
save_path (str) – Where to save the created json file
- Return type:
None
- merge_config(config_dict)
Merge a config_dict with the existing config object.
- Parameters:
config_dict (Dict) – A dictionary which key/values should be added to this class.
- Return type:
None
- parse_config_file(path, extractor=_DEFAULT_EXTRACTOR)
- Parses a configuration file in text format. Must be like:
cat.<variable>.<key> = <value> …
variable: linking, general, ner, …
key: a key in the config dict e.g. subsample_after for linking
value: the value for the key, will be parsed with eval
- Parameters:
path (str) – the path to the config file
extractor (ValueExtractor) – (Default value = _DEFAULT_EXTRACTOR)
- Raises:
ValueError – In case of unknown attribute.
- Return type:
None
- rebuild_re()
- Return type:
None
- _calc_hash(hasher=None)
- Parameters:
hasher (Optional[medcat.utils.hasher.Hasher]) –
- Return type:
- get_hash(hasher=None)
- Parameters:
hasher (Optional[medcat.utils.hasher.Hasher]) –
- __str__()
Return str(self).
- Return type:
str
- classmethod load(save_path)
Load config from a json file, note that fields that did not exist in the old config but do exist in the current version of the ConfigMetaCAT class will be kept.
- Parameters:
save_path (str) – Path to the json file to load
- Returns:
MixingConfig – The loaded config
- Return type:
- classmethod from_dict(config_dict)
Generate a MixingConfig (of an extending type) from a a dictionary.
- Parameters:
config_dict (Dict) – The dictionary to create the config from
- Returns:
MixingConfig – The resulting config
- Return type:
- asdict()
Get the config as a dictionary.
- Returns:
Dict[str, Any] – The dictionary associated with this config
- Return type:
Dict[str, Any]
- fields()
Get the fields associated with this config.
- Returns:
Dict[str, ModelField] – The dictionary of the field names and fields
- Return type:
Dict[str, pydantic.fields.ModelField]
- class medcat.config.VersionInfo
Bases:
MixingConfig
,pydantic.BaseModel
The version info part of the config
- history: list = []
Populated automatically
- meta_cats: Any
Populated automatically
- cdb_info: dict
Populated automatically, output from cdb.print_stats
- performance: dict
{‘model_name’: {‘f1’: <>, ‘p’: <>, …}, …}}
- Type:
NER general performance, meta should be
- Type:
{‘meta’
- description: str = 'No description'
General description and what it was trained on
- id: Any
hash of most things
- Type:
Will be
- last_modified: int | datetime.datetime | str | None
- location: str | None
Path/URL/Whatever to where is this CDB located
- ontology: str | List[str] | None
What was used to build the CDB, e.g. SNOMED_202009
- medcat_version: str | None
Which version of medcat was used to build the CDB
- class medcat.config.CDBMaker
Bases:
MixingConfig
,pydantic.BaseModel
The Context Database (CDB) making part of the config
- name_versions: list = ['LOWER', 'CLEAN']
Name versions to be generated.
- multi_separator: str = '|'
If multiple names or type_ids for a concept present in one row of a CSV, they are separted by the character below.
- remove_parenthesis: int = 5
Should preferred names with parenthesis be cleaned 0 means no, else it means if longer than or equal e.g. Head (Body part) -> Head
- min_letters_required: int = 2
Minimum number of letters required in a name to be accepted for a concept
- class medcat.config.AnnotationOutput
Bases:
MixingConfig
,pydantic.BaseModel
The annotation output part of the config
- doc_extended_info: bool = False
- context_left: int
- context_right: int
- lowercase_context: bool = True
- include_text_in_output: bool = False
- class medcat.config.CheckPoint
Bases:
MixingConfig
,pydantic.BaseModel
The checkpoint part of the config
- output_dir: str = 'checkpoints'
When doing training this is the name of the directory where checkpoints will be saved
- steps: int | None
When training how often to save the checkpoint (one step represents one document), if None no ckpts will be created
- max_to_keep: int = 1
When training the maximum checkpoints will be kept on the disk
- class medcat.config.General
Bases:
MixingConfig
,pydantic.BaseModel
The general part of the config
- spacy_disabled_components: list = ['ner', 'parser', 'vectors', 'textcat', 'entity_linker', 'sentencizer', 'entity_ruler',...
- checkpoint: CheckPoint
Checkpointing config
- log_level: int
Logging config for everything | ‘tagger’ can be disabled, but will cause a drop in performance
- log_format: str = '%(levelname)s:%(name)s: %(message)s'
- log_path: str = './medcat.log'
- spacy_model: str = 'en_core_web_md'
What model will be used for tokenization
- separator: str = '~'
Separator that will be used to merge tokens of a name. Once a CDB is built this should always stay the same.
- spell_check: bool = True
Should we check spelling - note that this makes things much slower, use only if necessary. The only thing necessary for the spell checker to work is vocab.dat and cdb.dat built with concepts in the respective language.
- diacritics: bool = False
Should we process diacritics - for languages other than English, symbols such as ‘é, ë, ö’ can be relevant. Note that this makes spell_check slower.
- spell_check_deep: bool = False
If True the spell checker will try harder to find mistakes, this can slow down things drastically.
- spell_check_len_limit: int = 7
Spelling will not be checked for words with length less than this
- show_nested_entities: bool = False
If set to True functions like get_entities and get_json will return nested_entities and overlaps
- full_unlink: bool = False
When unlinking a name from a concept should we do full_unlink (means unlink a name from all concepts, not just the one in question)
- workers: int
Number of workers used by a parallelizable pipeline component
- make_pretty_labels: str | None
Should the labels of entities (shown in displacy) be pretty or just ‘concept’. Slows down the annotation pipeline should not be used when annotating millions of documents. If None it will be the string “concept”, if short it will be CUI, if long it will be CUI | Name | Confidence
- map_cui_to_group: bool = False
If the cdb.addl_info[‘cui2group’] is provided and this option enabled, each CUI will be maped to the group
- class medcat.config.Preprocessing
Bases:
MixingConfig
,pydantic.BaseModel
The preprocessing part of the config
- words_to_skip: set
This words will be completly ignored from concepts and from the text (must be a Set)
- keep_punct: set
All punct will be skipped by default, here you can set what will be kept
- do_not_normalize: set
e.g. running -> run Values are detailed part-of-speech tags. See: - https://spacy.io/usage/linguistic-features#pos-tagging - Label scheme section per model at https://spacy.io/models/en
- Type:
Should specific word types be normalized
- skip_stopwords: bool = False
Should stopwords be skipped/ingored when processing input
- min_len_normalize: int = 5
Nothing below this length will ever be normalized (input tokens or concept names), normalized means lemmatized in this case
- stopwords: set | None
If None the default set of stowords from spacy will be used. This must be a Set.
- max_document_length: int = 1000000
Documents longer than this will be trimmed
- class medcat.config.Ner
Bases:
MixingConfig
,pydantic.BaseModel
The NER part of the config
- min_name_len: int = 3
Do not detect names below this limit, skip them
- max_skip_tokens: int = 2
When checkng tokens for concepts you can have skipped tokens inbetween used ones (usually spaces, new lines etc). This number tells you how many skipped can you have.
- check_upper_case_names: bool = False
Check uppercase to distinguish uppercase and lowercase words that have a different meaning.
- upper_case_limit_len: int = 4
Any name shorter than this must be uppercase in the text to be considered. If it is not uppercase it will be skipped.
- try_reverse_word_order: bool = False
Try reverse word order for short concepts (2 words max), e.g. heart disease -> disease heart
- class medcat.config._DefPartial
This is a helper class to make it possible to check equality of two default Linking instances
- __init__()
- __call__(*args, **kwargs)
- __eq__(other)
Return self==value.
- medcat.config._DEFAULT_PARTIAL
- class medcat.config.LinkingFilters(**data)
Bases:
MixingConfig
,pydantic.BaseModel
These describe the linking filters used alongside the model.
When no CUIs nor exlcuded CUIs are specified (the sets are empty), all CUIs are accepted. If there are CUIs specified then only those will be accepted. If there are excluded CUIs specified, they are excluded.
In some cases, there are extra filters as well as MedCATtrainer (MCT) export filters. These are expcted to follow the following: extra_cui_filter ⊆ MCT filter ⊆ Model/config filter
While any other CUIs can be included in the the extra CUI filter or the MCT filter, they would not have any real effect.
- cuis: Set[str]
- cuis_exclude: Set[str]
- __init__(**data)
- check_filters(cui)
Checks is a CUI in the filters
- Parameters:
cui (str) – The CUI in question
- Returns:
bool – True if the CUI is allowed
- Return type:
bool
- merge_with(other)
Merge CUIs and excluded CUIs within two filters. The data will be kept within this filter (and not the other).
- Parameters:
other (LinkingFilters) – The other filter to merge with
- Return type:
None
- copy_of()
Create a copy of this LinkingFilters. This copy will describe an identical filter but will refer to different sets so they can be mutated separately.
- Returns:
LinkingFilters – A copy of the original filters.
- Return type:
- class medcat.config.Linking
Bases:
MixingConfig
,pydantic.BaseModel
The linking part of the config
- optim: dict
Linear anneal
- context_vector_sizes: dict
Context vector sizes that will be calculated and used for linking
- context_vector_weights: dict
Weight of each vector in the similarity score - make trainable at some point. Should add up to 1.
- filters: LinkingFilters
Filters
- train: bool = True
Should it train or not, this is set automatically ignore in 99% of cases and do not set manually
- random_replacement_unsupervised: float = 0.8
If <1 during unsupervised training the detected term will be randomly replaced with a probability of 1 - random_replacement_unsupervised Replaced with a synonym used for that term
- disamb_length_limit: int = 3
All concepts below this will always be disambiguated
- filter_before_disamb: bool = False
If True it will filter before doing disamb. Useful for the trainer.
- train_count_threshold: int = 1
Concepts that have seen less training examples than this will not be used for similarity calculation and will have a similarity of -1.
- always_calculate_similarity: bool = False
Do we want to calculate context similarity even for concepts that are not ambigous.
- weighted_average_function: Callable[Ellipsis, Any]
Weights for a weighted average ‘weighted_average_function’: partial(weighted_average, factor=0.02),
- calculate_dynamic_threshold: bool = False
Concepts below this similarity will be ignored. Type can be static/dynamic - if dynamic each CUI has a different TH and it is calcualted as the average confidence for that CUI * similarity_threshold. Take care that dynamic works only if the cdb was trained with calculate_dynamic_threshold = True.
- similarity_threshold_type: str = 'static'
- similarity_threshold: float = 0.25
- negative_probability: float = 0.5
Probability for the negative context to be added for each positive addition
- negative_ignore_punct_and_num: bool = True
Do we ignore punct/num when negative sampling
- prefer_primary_name: float = 0.35
If >0 concepts for which a detection is its primary name will be preferred by that amount (0 to 1)
- prefer_frequent_concepts: float = 0.35
If >0 concepts that are more frequent will be prefered by a multiply of this amount
- subsample_after: int = 30000
Subsample during unsupervised training if a concept has received more than
- Type:
DISABLED in code permanetly
- devalue_linked_concepts: bool = False
When adding a positive example, should it also be treated as Negative for concepts which link to the postive one via names (ambigous names).
- context_ignore_center_tokens: bool = False
If true when the context of a concept is calculated (embedding) the words making that concept are not taken into accout
- class medcat.config.Config(*args, **kwargs)
Bases:
MixingConfig
,pydantic.BaseModel
The MedCAT config
- version: VersionInfo
- annotation_output: AnnotationOutput
- preprocessing: Preprocessing
- word_skipper: re.Pattern
- punct_checker: re.Pattern
- hash: str | None
- __init__(*args, **kwargs)
- rebuild_re()
- Return type:
None
- get_hash()