medcat.utils.ner.deid
De-identification model.
This describes a wrapper on the regular CAT model. The idea is to simplify the use of a DeId-specific model.
It tackles two use cases 1) Creation of a deid model 2) Loading and use of a deid model
I.e for use case 1:
Instead of: cat = CAT(cdb=ner.cdb, addl_ner=ner)
You can use: deid = DeIdModel.create(ner)
And for use case 2:
Instead of: cat = CAT.load_model_pack(model_pack_path) anon_text = deid_text(cat, text)
You can use: deid = DeIdModel.load_model_pack(model_pack_path) anon_text = deid.deid_text(text)
Or if/when structured output is desired: deid = DeIdModel.load_model_pack(model_pack_path) anon_doc = deid(text) # the spacy document
The wrapper also exposes some CAT parts directly: - config - cdb
Module Contents
Classes
The DeID model. |
Functions
|
Match a set of rules - pat / cui combos as post processing labels. |
|
Conveniance method to merge predictions from rule based and deID model predictions. |
|
Merge predictions from rule based and deID model predictions. |
Attributes
- medcat.utils.ner.deid.logger
- class medcat.utils.ner.deid.DeIdModel(cat)
Bases:
medcat.utils.ner.model.NerModelThe DeID model.
This wraps a CAT instance and simplifies its use as a de-identification model.
It provides methods for creating one from a TransformersNER as well as loading from a model pack (along with some validation).
It also exposes some useful parts of the CAT it wraps such as the config and the concept database.
- Parameters:
cat (medcat.cat.CAT) –
- __init__(cat)
- Parameters:
cat (medcat.cat.CAT) –
- Return type:
None
- train(json_path=None, *args, **kwargs)
Train the underlying transformers NER model.
All the extra arguments are passed to the TransformersNER train method.
- Parameters:
json_path (Union[str, list, None]) – The JSON file path to read the training data from.
train_nr (int) – The number of the NER object in cat._addl_train to train. Defaults to 0.
*args – Additional arguments for TransformersNER.train .
**kwargs – Additional keyword arguments for TransformersNER.train .
- Returns:
Tuple[Any, Any, Any] – df, examples, dataset
- Return type:
Tuple[Any, Any, Any]
- eval(json_path, *args, **kwargs)
Evaluate the underlying transformers NER model.
All the extra arguments are passed to the TransformersNER eval method.
- Parameters:
json_path (Union[str, list, None]) – The JSON file path to read the training data from.
train_nr (int) – The number of the NER object in cat._addl_train to train. Defaults to 0.
*args – Additional arguments for TransformersNER.eval .
**kwargs – Additional keyword arguments for TransformersNER.eval .
- Returns:
Tuple[Any, Any, Any] – df, examples, dataset
- Return type:
Tuple[Any, Any, Any]
- deid_text(text, redact=False)
Deidentify text and potentially redact information.
De-identified text. If redaction is enabled, identifiable entities will be replaced with starts (e.g *****). Otherwise, the replacement will be the CUI or in other words, the type of information that was hidden (e.g [PATIENT]).
- Parameters:
text (str) – The text to deidentify.
redact (bool) – Whether to redact the information.
- Returns:
str – The deidentified text.
- Return type:
str
- deid_multi_texts(texts, redact=False, addl_info=['cui2icd10', 'cui2ontologies', 'cui2snomed'], n_process=None, batch_size=None)
Deidentify text on multiple branches
- Parameters:
texts (Union[Iterable[str], Iterable[Tuple]]) – Text to be annotated
redact (bool) – Whether to redact the information.
addl_info (List[str], optional) – Additional info. Defaults to [‘cui2icd10’, ‘cui2ontologies’, ‘cui2snomed’].
n_process (Optional[int], optional) – Number of processes. Defaults to None.
batch_size (Optional[int], optional) – The size of a batch. Defaults to None.
- Raises:
ValueError – In case of unsupported input.
- Returns:
List[str] – List of deidentified documents.
- Return type:
List[str]
- classmethod load_model_pack(model_pack_path, config=None)
Load DeId model from model pack.
The method first loads the CAT instance.
It then makes sure that the model pack corresponds to a valid DeId model.
- Parameters:
config (Optional[Dict]) – Config for DeId model pack (primarily for stride of overlap window)
model_pack_path (str) – The model pack path.
- Raises:
ValueError – If the model pack does not correspond to a DeId model.
- Returns:
DeIdModel – The resulting DeI model.
- Return type:
- classmethod _is_deid_model(cat)
- Parameters:
cat (medcat.cat.CAT) –
- Return type:
bool
- classmethod _get_reason_not_deid(cat)
- Parameters:
cat (medcat.cat.CAT) –
- Return type:
str
- medcat.utils.ner.deid.match_rules(rules, texts, cui2preferred_name)
Match a set of rules - pat / cui combos as post processing labels.
Uses a cat DeID model for pretty name mapping.
- Parameters:
rules (List[Tuple[str, str]]) – List of tuples of pattern and cui
texts (List[str]) – List of texts to match rules on
cui2preferred_name (Dict[str, str]) – Dictionary of CUI to preferred name, likely to be cat.cdb.cui2preferred_name.
- Return type:
List[List[Dict]]
Examples
>>> cat = CAT.load_model_pack(model_pack_path) ... >>> rules = [ ('(123) 456-7890', '134'), ('1234567890', '134'), ('123.456.7890', '134'), ('1234567890', '134'), ('1234567890', '134'), ] >>> texts = [ 'My phone number is (123) 456-7890', 'My phone number is 1234567890', 'My phone number is 123.456.7890', 'My phone number is 1234567890', ] >>> matches = match_rules(rules, texts, cat.cdb.cui2preferred_name)
- Returns:
List[List[Dict]] – List of lists of predictions from match_rules
- Parameters:
rules (List[Tuple[str, str]]) –
texts (List[str]) –
cui2preferred_name (Dict[str, str]) –
- Return type:
List[List[Dict]]
- medcat.utils.ner.deid.merge_all_preds(model_preds_by_text, rule_matches_per_text, accept_preds=True)
Conveniance method to merge predictions from rule based and deID model predictions.
- Parameters:
model_preds_by_text (List[Dict]) – list of predictions from cat.get_entities(), then [list(m[‘entities’].values()) for m in model_preds]
rule_matches_per_text (List[Dict]) – list of predictions from output of running match_rules
accept_preds (bool) – uses the predicted label from the model, model_preds_by_text, over the rule matches if they overlap. Defaults to using model preds over rules.
- Returns:
List[List[Dict]] – List of lists of predictions from merge_all_preds
- Return type:
List[List[Dict]]
- medcat.utils.ner.deid.merge_preds(model_preds, rule_matches, accept_preds=True)
Merge predictions from rule based and deID model predictions.
- Parameters:
model_preds (List[Dict]) – predictions from cat.get_entities()
rule_matches (List[Dict]) – predictions from output of running match_rules on a text
accept_preds (bool) – uses the predicted label from the model, model_preds, over the rule matches if they overlap. Defaults to using model preds over rules.
- Return type:
List[Dict]
Examples
>>> # a list of predictions from `cat.get_entities()` >>> model_preds = [ [ {'cui': '134', 'start': 10, 'end': 20, 'acc': 1.0, 'pretty_name': 'Phone Number'}, {'cui': '134', 'start': 25, 'end': 35, 'acc': 1.0, 'pretty_name': 'Phone Number'} ] ] >>> # a list of predictions from `match_rules` >>> rule_matches = [ [ {'cui': '134', 'start': 10, 'end': 20, 'acc': 1.0, 'pretty_name': 'Phone Number'}, {'cui': '134', 'start': 25, 'end': 35, 'acc': 1.0, 'pretty_name': 'Phone Number'} ] ] >>> merged_preds = merge_preds(model_preds, rule_matches)
- Returns:
List[Dict] – List of predictions from merge_preds
- Parameters:
model_preds (List[Dict]) –
rule_matches (List[Dict]) –
accept_preds (bool) –
- Return type:
List[Dict]