medcat.cat

Module Contents

Classes

CAT

The main MedCAT class used to annotate documents, it is built on top of spaCy

Attributes

logger

HAS_NEW_SPACY

medcat.cat.logger
medcat.cat.HAS_NEW_SPACY
class medcat.cat.CAT(cdb, vocab=None, config=None, meta_cats=[], rel_cats=[], addl_ner=[])

Bases: object

The main MedCAT class used to annotate documents, it is built on top of spaCy and works as a spaCy pipline. Creates an instance of a spaCy pipline that can be used as a spacy nlp model.

Parameters:
Attributes (limited):
cdb (medcat.cdb.CDB):

Concept database used with this CAT instance, please do not assign this value directly.

config (medcat.config.Config):

The global configuration for medcat. Usually cdb.config will be used for this field. WILL BE REMOVED - TEMPORARY PLACEHOLDER

vocab (medcat.utils.vocab.Vocab):

The vocabulary object used with this instance, please do not assign this value directly.

Examples

>>> cat = CAT(cdb, vocab)
>>> spacy_doc = cat("Put some text here")
>>> print(spacy_doc.ents) # Detected entities
DEFAULT_MODEL_PACK_NAME = 'medcat_model_pack'
__init__(cdb, vocab=None, config=None, meta_cats=[], rel_cats=[], addl_ner=[])
Parameters:
Return type:

None

_create_pipeline(config)
Parameters:

config (medcat.config.Config) –

get_spacy_nlp()

Returns the spacy pipeline with MedCAT

Returns:

Language – The spacy Language being used.

Return type:

spacy.language.Language

get_hash(force_recalc=False)

Will not be a deep hash but will try to catch all the changing parts during training.

Able to force recalculation of hash. This is relevant for CDB the hash for which is otherwise only recalculated if it has changed.

Parameters:

force_recalc (bool) – Whether to force recalculation. Defaults to False.

Returns:

str – The resulting hash

Return type:

str

get_model_card(as_dict=False)

A minimal model card for MedCAT model packs.

Parameters:

as_dict (bool) – Whether to return the model card as a dictionary instead of a str (Default value False).

Returns:
  • str – The string representation of the JSON object.

  • OR

  • dict – The dict JSON object.

_versioning(force_rehash=False)
Parameters:

force_rehash (bool) –

create_model_pack(save_dir_path, model_pack_name=DEFAULT_MODEL_PACK_NAME, force_rehash=False, cdb_format='dill')

Will crete a .zip file containing all the models in the current running instance of MedCAT. This is not the most efficient way, for sure, but good enough for now.

Parameters:
  • save_dir_path (str) – An id will be appended to this name

  • model_pack_name (str) – The model pack name. Defaults to DEFAULT_MODEL_PACK_NAME.

  • force_rehash (bool) – Force recalculation of hash. Defaults to False.

  • cdb_format (str) – The format of the saved CDB in the model pack. The available formats are: - dill - json Defaults to ‘dill’

Returns:

str – Model pack name

Return type:

str

classmethod attempt_unpack(zip_path)

Attempt unpack the zip to a folder and get the model pack path.

If the folder already exists, no unpacking is done.

Parameters:

zip_path (str) – The ZIP path

Returns:

str – The model pack path

Return type:

str

classmethod load_model_pack(zip_path, meta_cat_config_dict=None, ner_config_dict=None, load_meta_models=True, load_addl_ner=True, load_rel_models=True)

Load everything within the ‘model pack’, i.e. the CDB, config, vocab and any MetaCAT models (if present)

Parameters:
  • zip_path (str) – The path to model pack zip.

  • meta_cat_config_dict (Optional[Dict]) – A config dict that will overwrite existing configs in meta_cat. e.g. meta_cat_config_dict = {‘general’: {‘device’: ‘cpu’}}. Defaults to None.

  • ner_config_dict (Optional[Dict]) – A config dict that will overwrite existing configs in transformers ner. e.g. ner_config_dict = {‘general’: {‘chunking_overlap_window’: 6}. Defaults to None.

  • load_meta_models (bool) – Whether to load MetaCAT models if present (Default value True).

  • load_addl_ner (bool) – Whether to load additional NER models if present (Default value True).

  • load_rel_models (bool) – Whether to load RelCAT models if present (Default value True).

Returns:

CAT – The resulting CAT object.

Return type:

CAT

__call__(text, do_train=False)

Push the text through the pipeline.

Parameters:
  • text (Optional[str]) – The text to be annotated, if the text length is longer than self.config.preprocessing[‘max_document_length’] it will be trimmed to that length.

  • do_train (bool) – This causes so many screwups when not there, so I’ll force training to False. To run training it is much better to use the self.train() function but for some special cases I’m leaving it here also. Defaults to False.

Returns:

Optional[Doc] – A single spacy document or multiple spacy documents with the extracted entities

Return type:

Optional[spacy.tokens.Doc]

__repr__()

Prints the model_card for this CAT instance.

Returns:

str – the ‘Model Card’ for this CAT instance. This includes NER+L config and any MetaCATs

Return type:

str

_print_stats(data, epoch=0, use_project_filters=False, use_overlaps=False, use_cui_doc_limit=False, use_groups=False, extra_cui_filter=None, do_print=True)

TODO: Refactor and make nice Print metrics on a dataset (F1, P, R), it will also print the concepts that have the most FP,FN,TP.

Parameters:
  • data (Dict) – The json object that we get from MedCATtrainer on export.

  • epoch (int) – Used during training, so we know what epoch is it.

  • use_project_filters (bool) – Each project in MedCATtrainer can have filters, do we want to respect those filters when calculating metrics.

  • use_overlaps (bool) – Allow overlapping entities, nearly always False as it is very difficult to annotate overlapping entites.

  • use_cui_doc_limit (bool) – If True the metrics for a CUI will be only calculated if that CUI appears in a document, in other words if the document was annotated for that CUI. Useful in very specific situations when during the annotation process the set of CUIs changed.

  • use_groups (bool) – If True concepts that have groups will be combined and stats will be reported on groups.

  • extra_cui_filter (Optional[Set]) – This filter will be intersected with all other filters, or if all others are not set then only this one will be used.

  • do_print (bool) – Whether to print stats out. Defaults to True.

Returns:
  • fps (dict) – False positives for each CUI.

  • fns (dict) – False negatives for each CUI.

  • tps (dict) – True positives for each CUI.

  • cui_prec (dict) – Precision for each CUI.

  • cui_rec (dict) – Recall for each CUI.

  • cui_f1 (dict) – F1 for each CUI.

  • cui_counts (dict) – Number of occurrence for each CUI.

  • examples (dict) – Examples for each of the fp, fn, tp. Format will be examples[‘fp’][‘cui’][<list_of_examples>].

Return type:

Tuple

_init_ckpts(is_resumed, checkpoint)
train(data_iterator, nepochs=1, fine_tune=True, progress_print=1000, checkpoint=None, is_resumed=False)

Runs training on the data, note that the maximum length of a line or document is 1M characters. Anything longer will be trimmed.

Parameters:
  • data_iterator (Iterable) – Simple iterator over sentences/documents, e.g. a open file or an array or anything that we can use in a for loop.

  • nepochs (int) – Number of epochs for which to run the training.

  • fine_tune (bool) – If False old training will be removed.

  • progress_print (int) – Print progress after N lines.

  • checkpoint (Optional[medcat.utils.checkpoint.CheckpointUT]) – The MedCAT checkpoint object

  • is_resumed (bool) – If True resume the previous training; If False, start a fresh new training.

Return type:

None

add_cui_to_group(cui, group_name)

Adds a CUI to a group, will appear in cdb.addl_info[‘cui2group’]

Parameters:
  • cui (str) – The concept to be added.

  • group_name (str) – The group to whcih the concept will be added.

Return type:

None

Examples

>>> cat.add_cui_to_group("S-17", 'pain')

Unlink a concept name from the CUI (or all CUIs if full_unlink), removes the link from the Concept Database (CDB). As a consequence medcat will never again link the name to this CUI - meaning the name will not be detected as a concept in the future.

Parameters:
  • cui (str) – The CUI from which the name will be removed.

  • name (str) – The span of text to be removed from the linking dictionary.

  • preprocessed_name (bool) – Whether the name being used is preprocessed.

Return type:

None

Examples

>>> # To never again link C0020538 to HTN
>>> cat.unlink_concept_name('C0020538', 'htn', False)
add_and_train_concept(cui, name, spacy_doc=None, spacy_entity=None, ontologies=set(), name_status='A', type_ids=set(), description='', full_build=True, negative=False, devalue_others=False, do_add_concept=True)

Add a name to an existing concept, or add a new concept, or do not do anything if the name or concept already exists. Perform training if spacy_entity and spacy_doc are set.

Parameters:
  • cui (str) – CUI of the concept.

  • name (str) – Name to be linked to the concept (in the case of MedCATtrainer this is simply the selected value in text, no preprocessing or anything needed).

  • spacy_doc (spacy.tokens.Doc) – Spacy representation of the document that was manually annotated.

  • spacy_entity (Optional[Union[List[Token], Span]]) – Given the spacy document, this is the annotated span of text - list of annotated tokens that are marked with this CUI.

  • ontologies (Set[str]) – ontologies in which the concept exists (e.g. SNOMEDCT, HPO)

  • name_status (str) – One of P, N, A

  • type_ids (Set[str]) – Semantic type identifier (have a look at TUIs in UMLS or SNOMED-CT)

  • description (str) – Description of this concept.

  • full_build (bool) – If True the dictionary self.addl_info will also be populated, contains a lot of extra information about concepts, but can be very memory consuming. This is not necessary for normal functioning of MedCAT (Default Value False).

  • negative (bool) – Is this a negative or positive example.

  • devalue_others (bool) – If set, cuis to which this name is assigned and are not cui will receive negative training given that negative=False.

  • do_add_concept (bool) – Whether to add concept to CDB.

Return type:

None

train_supervised(data_path, reset_cui_count=False, nepochs=1, print_stats=0, use_filters=False, terminate_last=False, use_overlaps=False, use_cui_doc_limit=False, test_size=0, devalue_others=False, use_groups=False, never_terminate=False, train_from_false_positives=False, extra_cui_filter=None, retain_extra_cui_filter=False, checkpoint=None, retain_filters=False, is_resumed=False)

Train supervised by reading data from a json file.

Refer to train_supervvised_from_json and/or train_supervised_raw for further details.

# noqa: DAR101 # noqa: DAR201

Parameters:
  • data_path (str) –

  • reset_cui_count (bool) –

  • nepochs (int) –

  • print_stats (int) –

  • use_filters (bool) –

  • terminate_last (bool) –

  • use_overlaps (bool) –

  • use_cui_doc_limit (bool) –

  • test_size (int) –

  • devalue_others (bool) –

  • use_groups (bool) –

  • never_terminate (bool) –

  • train_from_false_positives (bool) –

  • extra_cui_filter (Optional[Set]) –

  • retain_extra_cui_filter (bool) –

  • checkpoint (Optional[medcat.utils.checkpoint.Checkpoint]) –

  • retain_filters (bool) –

  • is_resumed (bool) –

Return type:

Tuple

train_supervised_from_json(data_path, reset_cui_count=False, nepochs=1, print_stats=0, use_filters=False, terminate_last=False, use_overlaps=False, use_cui_doc_limit=False, test_size=0, devalue_others=False, use_groups=False, never_terminate=False, train_from_false_positives=False, extra_cui_filter=None, retain_extra_cui_filter=False, checkpoint=None, retain_filters=False, is_resumed=False)

Run supervised training on a dataset from MedCATtrainer in JSON format.

Refer to train_supervised_raw for more details.

# noqa: DAR101 # noqa: DAR201

Parameters:
  • data_path (str) –

  • reset_cui_count (bool) –

  • nepochs (int) –

  • print_stats (int) –

  • use_filters (bool) –

  • terminate_last (bool) –

  • use_overlaps (bool) –

  • use_cui_doc_limit (bool) –

  • test_size (int) –

  • devalue_others (bool) –

  • use_groups (bool) –

  • never_terminate (bool) –

  • train_from_false_positives (bool) –

  • extra_cui_filter (Optional[Set]) –

  • retain_extra_cui_filter (bool) –

  • checkpoint (Optional[medcat.utils.checkpoint.Checkpoint]) –

  • retain_filters (bool) –

  • is_resumed (bool) –

Return type:

Tuple

train_supervised_raw(data, reset_cui_count=False, nepochs=1, print_stats=0, use_filters=False, terminate_last=False, use_overlaps=False, use_cui_doc_limit=False, test_size=0, devalue_others=False, use_groups=False, never_terminate=False, train_from_false_positives=False, extra_cui_filter=None, retain_extra_cui_filter=False, checkpoint=None, retain_filters=False, is_resumed=False)

Train supervised based on the raw data provided.

The raw data is expected in the following format: {‘projects’:

[ # list of projects
{ # project 1

‘name’: ‘<some name>’, # list of documents ‘documents’: [{‘name’: ‘<some name>’, # document 1

‘text’: ‘<text of the document>’, # list of annotations ‘annotations’: [{‘start’: -1, # annotation 1

‘end’: 1, ‘cui’: ‘cui’, ‘value’: ‘<text value>’}, …],

}, …]

}, …

]

}

Please take care that this is more a simulated online training then supervised.

When filtering, the filters within the CAT model are used first, then the ones from MedCATtrainer (MCT) export filters, and finally the extra_cui_filter (if set). That is to say, the expectation is: extra_cui_filter ⊆ MCT filter ⊆ Model/config filter.

Parameters:
  • data (Dict[str, List[Dict[str, dict]]]) – The raw data, e.g from MedCATtrainer on export.

  • reset_cui_count (bool) – Used for training with weight_decay (annealing). Each concept has a count that is there from the beginning of the CDB, that count is used for annealing. Resetting the count will significantly increase the training impact. This will reset the count only for concepts that exist in the the training data.

  • nepochs (int) – Number of epochs for which to run the training.

  • print_stats (int) – If > 0 it will print stats every print_stats epochs.

  • use_filters (bool) – Each project in medcattrainer can have filters, do we want to respect those filters when calculating metrics.

  • terminate_last (bool) – If true, concept termination will be done after all training.

  • use_overlaps (bool) – Allow overlapping entities, nearly always False as it is very difficult to annotate overlapping entities.

  • use_cui_doc_limit (bool) – If True the metrics for a CUI will be only calculated if that CUI appears in a document, in other words if the document was annotated for that CUI. Useful in very specific situations when during the annotation process the set of CUIs changed.

  • test_size (float) – If > 0 the data set will be split into train test based on this ration. Should be between 0 and 1. Usually 0.1 is fine.

  • devalue_others (bool) – Check add_name for more details.

  • use_groups (bool) – If True concepts that have groups will be combined and stats will be reported on groups.

  • never_terminate (bool) – If True no termination will be applied

  • train_from_false_positives (bool) – If True it will use false positive examples detected by medcat and train from them as negative examples.

  • extra_cui_filter (Optional[Set]) – This filter will be intersected with all other filters, or if all others are not set then only this one will be used.

  • retain_extra_cui_filter (bool) – Whether to retain the extra filters instead of the MedCATtrainer export filters. This will only have an effect if/when retain_filters is set to True. Defaults to False.

  • checkpoint (Optional[Optional[medcat.utils.checkpoint.CheckpointST]) – The MedCAT CheckpointST object

  • retain_filters (bool) – If True, retain the filters in the MedCATtrainer export within this CAT instance. In other words, the filters defined in the input file will henseforth be saved within config.linking.filters . This only makes sense if there is only one project in the input data. If that is not the case, a ValueError is raised. The merging is done in the first epoch.

  • is_resumed (bool) – If True resume the previous training; If False, start a fresh new training.

Raises:

ValueError – If attempting to retain filters with while training over multiple projects.

Returns:

Tuple – Consisting of the following parts fp (dict):

False positives for each CUI.

fn (dict):

False negatives for each CUI.

tp (dict):

True positives for each CUI.

p (dict):

Precision for each CUI.

r (dict):

Recall for each CUI.

f1 (dict):

F1 for each CUI.

cui_counts (dict):

Number of occurrence for each CUI.

examples (dict):

FP/FN examples of sentences for each CUI.

Return type:

Tuple

get_entities(text, only_cui=False, addl_info=['cui2icd10', 'cui2ontologies', 'cui2snomed'])
Parameters:
  • text (str) –

  • only_cui (bool) –

  • addl_info (List[str]) –

Return type:

Dict

get_entities_multi_texts(texts, only_cui=False, addl_info=['cui2icd10', 'cui2ontologies', 'cui2snomed'], n_process=None, batch_size=None)

Get entities

Parameters:
  • texts (Union[Iterable[str], Iterable[Tuple]]) – Text to be annotated

  • only_cui (bool) – Whether to only return CUIs. Defaults to False.

  • addl_info (List[str]) – Additional info. Defaults to [‘cui2icd10’, ‘cui2ontologies’, ‘cui2snomed’].

  • n_process (Optional[int]) – Number of processes. Defaults to None.

  • batch_size (Optional[int]) – The size of a batch. Defaults to None.

Raises:
  • ValueError – If there’s a known issue with multiprocessing.

  • RuntimeError – If there’s an unknown issue with multprocessing.

Returns:

List[Dict] – List of entity documents.

Return type:

List[Dict]

get_json(text, only_cui=False, addl_info=['cui2icd10', 'cui2ontologies'])

Get output in json format

Parameters:
  • text (str) – Text to be annotated

  • only_cui (bool) – Whether to only get CUIs. Defaults to False.

  • addl_info (List[str]) – Additional info. Defaults to [‘cui2icd10’, ‘cui2ontologies’].

Returns:

str – Json with fields {‘entities’: <>, ‘text’: text}.

Return type:

str

static _get_training_start(train_set, latest_trained_step)
_separate_nn_components()
_run_nn_components(docs, nn_components, id2text)

This will add meta_anns in-place to the docs dict.

# noqa: DAR101

Parameters:
  • docs (Dict) –

  • nn_components (List) –

  • id2text (Dict) –

Return type:

None

_batch_generator(data, batch_size_chars, skip_ids=set())
Parameters:
  • data (Iterable) –

  • batch_size_chars (int) –

  • skip_ids (Set) –

_save_docs_to_file(docs, annotated_ids, save_dir_path, annotated_ids_path, part_counter=0)
Parameters:
  • docs (Iterable) –

  • annotated_ids (List[str]) –

  • save_dir_path (str) –

  • annotated_ids_path (Optional[str]) –

  • part_counter (int) –

Return type:

int

multiprocessing(data, nproc=2, batch_size_chars=5000 * 1000, only_cui=False, addl_info=['cui2icd10', 'cui2ontologies', 'cui2snomed'], separate_nn_components=True, out_split_size_chars=None, save_dir_path=os.path.abspath(os.getcwd()), min_free_memory=0.1)
Parameters:
  • data (Union[List[Tuple], Iterable[Tuple]]) –

  • nproc (int) –

  • batch_size_chars (int) –

  • only_cui (bool) –

  • addl_info (List[str]) –

  • separate_nn_components (bool) –

  • out_split_size_chars (Optional[int]) –

  • save_dir_path (str) –

Return type:

Dict

multiprocessing_batch_char_size(data, nproc=2, batch_size_chars=5000 * 1000, only_cui=False, addl_info=[], separate_nn_components=True, out_split_size_chars=None, save_dir_path=os.path.abspath(os.getcwd()), min_free_memory=0.1, min_free_memory_size=None, enabled_progress_bar=True)

Run multiprocessing for inference, if out_save_path and out_split_size_chars is used this will also continue annotating documents if something is saved in that directory.

This method batches the data based on the number of characters as specified by user.

PS: This method is unlikely to work on a Windows machine.

Parameters:
  • data (Union[List[Tuple], Iterable[Tuple]]) – Iterator or array with format: [(id, text), (id, text), …]

  • nproc (int) – Number of processors. Defaults to 8.

  • batch_size_chars (int) – Size of a batch in number of characters, this should be around: NPROC * average_document_length * 200. Defaults to 1000000.

  • only_cui (bool) – Whether to only return the CUIs rather than the full annotations. Dedfaults to False.

  • addl_info (List[str]) – The additional information. Defaults to [].

  • separate_nn_components (bool) – If set the medcat pipe will be broken up into NN and not-NN components and they will be run sequentially. This is useful as the NN components have batching and like to process many docs at once, while the rest of the pipeline runs the documents one by one. Defaults to True.

  • out_split_size_chars (Optional[int]) – If set once more than out_split_size_chars are annotated they will be saved to a file (save_dir_path) and the memory cleared. Recommended value is 20*batch_size_chars.

  • save_dir_path (str) – Where to save the annotated documents if splitting. Defaults to the current working directory.

  • min_free_memory (float) – If set a process will not start unless there is at least this much RAM memory left, should be a range between [0, 1] meaning how much of the memory has to be free. Helps when annotating very large datasets because spacy is not the best with memory management and multiprocessing. If both min_free_memory and min_free_memory_size are set, a ValueError is raised. Defaults to 0.1.

  • min_free_memory_size (Optional[str]) – If set, the process will not start unless there’s the specified amount of memory available. For reference, we would recommend at least 5GB of memory for a full SNOMED model. You can use human readable sizes (e.g 2GB, 2000MB and so on). If both min_free_memory and min_free_memory_size are set, a ValueError is raised. Defaults to None.

  • enabled_progress_bar (bool) – Whether to enabled the progress bar. Defaults to True.

Raises:
  • Exception – If multiprocessing cannot be done.

  • ValueError – If both free memory specifiers are provided.

Returns:

Dict – {id: doc_json, id2: doc_json2, …}, in case out_split_size_chars is used the last batch will be returned while that and all previous batches will be written to disk (out_save_dir).

Return type:

Dict

_multiprocessing_batch(data, nproc=8, batch_size_chars=1000000, only_cui=False, addl_info=[], nn_components=[], min_free_memory=0.1, min_free_memory_size=None)

Run multiprocessing on one batch.

Parameters:
  • data (Union[List[Tuple], Iterable[Tuple]]) – Iterator or array with format: [(id, text), (id, text), …].

  • nproc (int) – Number of processors. Defaults to 8.

  • batch_size_chars (int) – Size of a batch in number of characters. Fefaults to 1 000 000.

  • only_cui (bool) – Whether to get only CUIs. Defaults to False.

  • addl_info (List[str]) – Additional info. Defaults to [].

  • nn_components (List) – NN components in case there’s a separation. Defaults to [].

  • min_free_memory (float) – If set a process will not start unless there is at least this much RAM memory left, should be a range between [0, 1] meaning how much of the memory has to be free. Helps when annotating very large datasets because spacy is not the best with memory management and multiprocessing. Defaults to 0.

  • min_free_memory_size (Optional[int]) – The minimum human readable memory size required.

Returns:

Dict – {id: doc_json, id2: doc_json2, …}

Return type:

Dict

multiprocessing_pipe(in_data, nproc=None, batch_size=None, only_cui=False, addl_info=[], return_dict=True, batch_factor=2)
Parameters:
  • in_data (Union[List[Tuple], Iterable[Tuple]]) –

  • nproc (Optional[int]) –

  • batch_size (Optional[int]) –

  • only_cui (bool) –

  • addl_info (List[str]) –

  • return_dict (bool) –

  • batch_factor (int) –

Return type:

Union[List[Tuple], Dict]

multiprocessing_batch_docs_size(in_data, nproc=None, batch_size=None, only_cui=False, addl_info=['cui2icd10', 'cui2ontologies', 'cui2snomed'], return_dict=True, batch_factor=2)

Run multiprocessing NOT FOR TRAINING.

This method batches the data based on the number of documents as specified by the user.

PS: This method supports Windows.

Parameters:
  • in_data (Union[List[Tuple], Iterable[Tuple]]) – List with format: [(id, text), (id, text), …]

  • nproc (Optional[int]) – The number of processors. Defaults to None.

  • batch_size (Optional[int]) – The number of texts to buffer. Defaults to None.

  • only_cui (bool) – Whether to get only CUIs. Defaults to False.

  • addl_info (List[str]) – Additional info. Defaults to [].

  • return_dict (bool) – Flag for returning either a dict or a list of tuples. Defaults to True.

  • batch_factor (int) – Batch factor. Defaults to 2.

Raises:

ValueError – When number of processes is 0.

Returns:

Union[List[Tuple], Dict] – {id: doc_json, id: doc_json, …} or if return_dict is False, a list of tuples: [(id, doc_json), (id, doc_json), …]

Return type:

Union[List[Tuple], Dict]

_mp_cons(in_q, out_list, min_free_memory, lock, min_free_memory_size=None, pid=0, only_cui=False, addl_info=[])
Parameters:
  • in_q (multiprocess.queues.Queue) –

  • out_list (List) –

  • min_free_memory (float) –

  • lock (multiprocess.synchronize.Lock) –

  • min_free_memory_size (Optional[int]) –

  • pid (int) –

  • only_cui (bool) –

  • addl_info (List) –

Return type:

None

_add_nested_ent(doc, _ents, _ent)
Parameters:
  • doc (spacy.tokens.Doc) –

  • _ents (List[spacy.tokens.Span]) –

  • _ent (Union[Dict, spacy.tokens.Span]) –

Return type:

None

_doc_to_out(doc, only_cui, addl_info, out_with_text=False)
Parameters:
  • doc (spacy.tokens.Doc) –

  • only_cui (bool) –

  • addl_info (List[str]) –

  • out_with_text (bool) –

Return type:

Dict

_get_trimmed_text(text)
Parameters:

text (Optional[str]) –

Return type:

str

_generate_trimmed_texts(texts)
Parameters:

texts (Union[Iterable[str], Iterable[Tuple]]) –

Return type:

Iterable[str]

_get_trimmed_texts(texts)
Parameters:

texts (Union[Iterable[str], Iterable[Tuple]]) –

Return type:

List[str]

static _pipe_error_handler(proc_name, proc, docs, e)
Parameters:
  • proc_name (str) –

  • proc (medcat.pipe.Pipe) –

  • docs (List[spacy.tokens.Doc]) –

  • e (Exception) –

Return type:

None

static _get_doc_annotations(doc)
Parameters:

doc (spacy.tokens.Doc) –

destroy_pipe()