medcat.utils.relation_extraction.rel_dataset

Module Contents

Classes

RelData

An abstract class representing a Dataset.

class medcat.utils.relation_extraction.rel_dataset.RelData(tokenizer, config, cdb=CDB())

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

Parameters:
name = 'rel_dataset'
log
__init__(tokenizer, config, cdb=CDB())
Use this class to create a dataset for relation annotations from CSV exports,

MedCAT exports or Spacy Documents (assuming the documents got generated by MedCAT, if they did not then please set the required paramenters manually to match MedCAT output, see /medcat/cat.py#_add_nested_ent)

If you are using this to create relations from CSV it is assumed that your entities/concepts of interest are surrounded by the special tokens, see create_base_relations_from_csv doc.

Parameters:
  • tokenizer (TokenizerWrapperBERT) – okenizer used to generate token ids from input text

  • config (ConfigRelCAT) – same config used in RelCAT

  • cdb (CDB) – Optional, used to add concept ids and types to detected ents, useful when creating datasets from MedCAT output. Defaults to CDB().

generate_base_relations(docs)

Util function, should be used if you want to train from spacy docs

Parameters:

docs (Iterable[Doc]) – Generate relations from Spacy CAT docs.

Returns:

output_relations – List[Dict] : [] “output_relations”: relation_instances, <– see create_base_relations_from_doc/csv

for data columns

“nclasses”: self.config.model.padding_idx, <– dummy class “labels2idx”: {}, “idx2label”: {}} ]

Return type:

List[Dict]

create_base_relations_from_csv(csv_path)

Assumes the columns are as follows [“relation_token_span_ids”, “ent1_ent2_start”, “ent1”, “ent2”, “label”, “label_id”, “ent1_type”, “ent2_type”, “ent1_id”, “ent2_id”, “ent1_cui”, “ent2_cui”, “doc_id”, “sents”], last column is the actual source text.

The entities inside the text MUST be annotated with special tokens i.e:

…some text..[s1] first entity [e1]…..[s2] second entity [e2]……..

You have to store the start position, aka index position of token [e1] and also of token [e2] in the (ent1_ent2_start) column.

Parameters:

csv_path (str) – path to csv file, must have specific columns, tab separated,

Returns:
  • Dict – { “output_relations”: relation_instances, <– see create_base_relations_from_doc/csv

    for data columns

    “nclasses”: self.config.model.padding_idx, <– dummy class “labels2idx”: {}, “idx2label”: {}}

  • }

create_base_relations_from_doc(doc, doc_id, ent1_ent2_tokens_start_pos=(-1, -1))

Creates a list of tuples based on pairs of entities detected (relation, ent1, ent2) for one spacy document or text string.

Parameters:
  • doc (Union[Doc, str]) – SpacyDoc or string of text, each will get handled slightly differently

  • doc_id (str) – document id

  • ent1_ent2_tokens_start_pos (Union[List, Tuple], optional) – start of [s1][s2] tokens, if left default we assume we are dealing with a SpacyDoc. Defaults to (-1, -1).

Returns:
  • Dict – { “output_relations”: relation_instances, <– see create_base_relations_from_doc/csv

    for data columns

    “nclasses”: self.config.model.padding_idx, <– dummy class “labels2idx”: {}, “idx2label”: {}}

  • }

Return type:

Dict

create_relations_from_export(data)
Parameters:

data (Dict) – MedCAT Export data.

Returns:
  • Dict – { “output_relations”: relation_instances, <– see create_base_relations_from_doc/csv

    for data columns

    “nclasses”: self.config.model.padding_idx, <– dummy class “labels2idx”: {}, “idx2label”: {}}

  • }

classmethod get_labels(relation_labels, config)

This is used to update labels in config with unencountered classes/labels ( if any are encountered during training).

Parameters:
  • relation_labels (List[str]) – new labels to add

  • config (ConfigRelCAT) – config

Returns:

Any – _description_

Return type:

Tuple[int, Dict[str, Any], Dict[int, Any]]

__len__()
Returns:

int – num of rels records

Return type:

int

__getitem__(idx)
Parameters:

idx (int) – index of item in the dataset dict

Returns:

Tuple[torch.LongTensor, torch.LongTensor, torch.LongTensor] – long tensors of the following the columns : input_ids, ent1&ent2 token start pos idx, label_ids

Return type:

Tuple[torch.LongTensor, torch.LongTensor, torch.LongTensor]