medcat.utils.ner.deid

De-identification model.

This describes a wrapper on the regular CAT model. The idea is to simplify the use of a DeId-specific model.

It tackles two use cases 1) Creation of a deid model 2) Loading and use of a deid model

I.e for use case 1:

Instead of: cat = CAT(cdb=ner.cdb, addl_ner=ner)

You can use: deid = DeIdModel.create(ner)

And for use case 2:

Instead of: cat = CAT.load_model_pack(model_pack_path) anon_text = deid_text(cat, text)

You can use: deid = DeIdModel.load_model_pack(model_pack_path) anon_text = deid.deid_text(text)

Or if/when structured output is desired: deid = DeIdModel.load_model_pack(model_pack_path) anon_doc = deid(text) # the spacy document

The wrapper also exposes some CAT parts directly: - config - cdb

Module Contents

Classes

DeIdModel

The DeID model.

Functions

match_rules(rules, texts, cui2preferred_name)

Match a set of rules - pat / cui combos as post processing labels.

merge_all_preds(model_preds_by_text, rule_matches_per_text)

Conveniance method to merge predictions from rule based and deID model predictions.

merge_preds(model_preds, rule_matches[, accept_preds])

Merge predictions from rule based and deID model predictions.

Attributes

logger

medcat.utils.ner.deid.logger
class medcat.utils.ner.deid.DeIdModel(cat)

Bases: medcat.utils.ner.model.NerModel

The DeID model.

This wraps a CAT instance and simplifies its use as a de-identification model.

It provides methods for creating one from a TransformersNER as well as loading from a model pack (along with some validation).

It also exposes some useful parts of the CAT it wraps such as the config and the concept database.

Parameters:

cat (medcat.cat.CAT) –

__init__(cat)
Parameters:

cat (medcat.cat.CAT) –

Return type:

None

train(json_path=None, *args, **kwargs)

Train the underlying transformers NER model.

All the extra arguments are passed to the TransformersNER train method.

Parameters:
  • json_path (Union[str, list, None]) – The JSON file path to read the training data from.

  • train_nr (int) – The number of the NER object in cat._addl_train to train. Defaults to 0.

  • *args – Additional arguments for TransformersNER.train .

  • **kwargs – Additional keyword arguments for TransformersNER.train .

Returns:

Tuple[Any, Any, Any] – df, examples, dataset

Return type:

Tuple[Any, Any, Any]

eval(json_path, *args, **kwargs)

Evaluate the underlying transformers NER model.

All the extra arguments are passed to the TransformersNER eval method.

Parameters:
  • json_path (Union[str, list, None]) – The JSON file path to read the training data from.

  • train_nr (int) – The number of the NER object in cat._addl_train to train. Defaults to 0.

  • *args – Additional arguments for TransformersNER.eval .

  • **kwargs – Additional keyword arguments for TransformersNER.eval .

Returns:

Tuple[Any, Any, Any] – df, examples, dataset

Return type:

Tuple[Any, Any, Any]

deid_text(text, redact=False)

Deidentify text and potentially redact information.

De-identified text. If redaction is enabled, identifiable entities will be replaced with starts (e.g *****). Otherwise, the replacement will be the CUI or in other words, the type of information that was hidden (e.g [PATIENT]).

Parameters:
  • text (str) – The text to deidentify.

  • redact (bool) – Whether to redact the information.

Returns:

str – The deidentified text.

Return type:

str

deid_multi_texts(texts, redact=False, addl_info=['cui2icd10', 'cui2ontologies', 'cui2snomed'], n_process=None, batch_size=None)

Deidentify text on multiple branches

Parameters:
  • texts (Union[Iterable[str], Iterable[Tuple]]) – Text to be annotated

  • redact (bool) – Whether to redact the information.

  • addl_info (List[str], optional) – Additional info. Defaults to [‘cui2icd10’, ‘cui2ontologies’, ‘cui2snomed’].

  • n_process (Optional[int], optional) – Number of processes. Defaults to None.

  • batch_size (Optional[int], optional) – The size of a batch. Defaults to None.

Raises:

ValueError – In case of unsupported input.

Returns:

List[str] – List of deidentified documents.

Return type:

List[str]

classmethod load_model_pack(model_pack_path, config=None)

Load DeId model from model pack.

The method first loads the CAT instance.

It then makes sure that the model pack corresponds to a valid DeId model.

Parameters:
  • config (Optional[Dict]) – Config for DeId model pack (primarily for stride of overlap window)

  • model_pack_path (str) – The model pack path.

Raises:

ValueError – If the model pack does not correspond to a DeId model.

Returns:

DeIdModel – The resulting DeI model.

Return type:

DeIdModel

classmethod _is_deid_model(cat)
Parameters:

cat (medcat.cat.CAT) –

Return type:

bool

classmethod _get_reason_not_deid(cat)
Parameters:

cat (medcat.cat.CAT) –

Return type:

str

medcat.utils.ner.deid.match_rules(rules, texts, cui2preferred_name)

Match a set of rules - pat / cui combos as post processing labels.

Uses a cat DeID model for pretty name mapping.

Parameters:
  • rules (List[Tuple[str, str]]) – List of tuples of pattern and cui

  • texts (List[str]) – List of texts to match rules on

  • cui2preferred_name (Dict[str, str]) – Dictionary of CUI to preferred name, likely to be cat.cdb.cui2preferred_name.

Return type:

List[List[Dict]]

Examples

>>> cat = CAT.load_model_pack(model_pack_path)
...
>>> rules = [
    ('(123) 456-7890', '134'),
    ('1234567890', '134'),
    ('123.456.7890', '134'),
    ('1234567890', '134'),
    ('1234567890', '134'),
]
>>> texts = [
    'My phone number is (123) 456-7890',
    'My phone number is 1234567890',
    'My phone number is 123.456.7890',
    'My phone number is 1234567890',
]
>>> matches = match_rules(rules, texts, cat.cdb.cui2preferred_name)
Returns:

List[List[Dict]] – List of lists of predictions from match_rules

Parameters:
  • rules (List[Tuple[str, str]]) –

  • texts (List[str]) –

  • cui2preferred_name (Dict[str, str]) –

Return type:

List[List[Dict]]

medcat.utils.ner.deid.merge_all_preds(model_preds_by_text, rule_matches_per_text, accept_preds=True)

Conveniance method to merge predictions from rule based and deID model predictions.

Parameters:
  • model_preds_by_text (List[Dict]) – list of predictions from cat.get_entities(), then [list(m[‘entities’].values()) for m in model_preds]

  • rule_matches_per_text (List[Dict]) – list of predictions from output of running match_rules

  • accept_preds (bool) – uses the predicted label from the model, model_preds_by_text, over the rule matches if they overlap. Defaults to using model preds over rules.

Returns:

List[List[Dict]] – List of lists of predictions from merge_all_preds

Return type:

List[List[Dict]]

medcat.utils.ner.deid.merge_preds(model_preds, rule_matches, accept_preds=True)

Merge predictions from rule based and deID model predictions.

Parameters:
  • model_preds (List[Dict]) – predictions from cat.get_entities()

  • rule_matches (List[Dict]) – predictions from output of running match_rules on a text

  • accept_preds (bool) – uses the predicted label from the model, model_preds, over the rule matches if they overlap. Defaults to using model preds over rules.

Return type:

List[Dict]

Examples

>>> # a list of predictions from `cat.get_entities()`
>>> model_preds = [
    [
        {'cui': '134', 'start': 10, 'end': 20, 'acc': 1.0,
         'pretty_name': 'Phone Number'},
        {'cui': '134', 'start': 25, 'end': 35, 'acc': 1.0,
         'pretty_name': 'Phone Number'}
    ]
]
>>> # a list of predictions from `match_rules`
>>> rule_matches = [
    [
        {'cui': '134', 'start': 10, 'end': 20, 'acc': 1.0,
         'pretty_name': 'Phone Number'},
        {'cui': '134', 'start': 25, 'end': 35, 'acc': 1.0,
         'pretty_name': 'Phone Number'}
    ]
]
>>> merged_preds = merge_preds(model_preds, rule_matches)
Returns:

List[Dict] – List of predictions from merge_preds

Parameters:
  • model_preds (List[Dict]) –

  • rule_matches (List[Dict]) –

  • accept_preds (bool) –

Return type:

List[Dict]