medcat.utils.relation_extraction.rel_dataset
Module Contents
Classes
An abstract class representing a |
- class medcat.utils.relation_extraction.rel_dataset.RelData(tokenizer, config, cdb=CDB())
Bases:
torch.utils.data.Dataset
An abstract class representing a
Dataset
.All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite
__getitem__()
, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite__len__()
, which is expected to return the size of the dataset by manySampler
implementations and the default options ofDataLoader
. Subclasses could also optionally implement__getitems__()
, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.Note
DataLoader
by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.- Parameters:
tokenizer (medcat.utils.relation_extraction.tokenizer.TokenizerWrapperBERT) –
config (medcat.config_rel_cat.ConfigRelCAT) –
cdb (medcat.cdb.CDB) –
- name = 'rel_dataset'
- log
- __init__(tokenizer, config, cdb=CDB())
- Use this class to create a dataset for relation annotations from CSV exports,
MedCAT exports or Spacy Documents (assuming the documents got generated by MedCAT, if they did not then please set the required paramenters manually to match MedCAT output, see /medcat/cat.py#_add_nested_ent)
If you are using this to create relations from CSV it is assumed that your entities/concepts of interest are surrounded by the special tokens, see create_base_relations_from_csv doc.
- Parameters:
tokenizer (TokenizerWrapperBERT) – okenizer used to generate token ids from input text
config (ConfigRelCAT) – same config used in RelCAT
cdb (CDB) – Optional, used to add concept ids and types to detected ents, useful when creating datasets from MedCAT output. Defaults to CDB().
- generate_base_relations(docs)
Util function, should be used if you want to train from spacy docs
- Parameters:
docs (Iterable[Doc]) – Generate relations from Spacy CAT docs.
- Returns:
output_relations – List[Dict] : [] “output_relations”: relation_instances, <– see create_base_relations_from_doc/csv
for data columns
“nclasses”: self.config.model.padding_idx, <– dummy class “labels2idx”: {}, “idx2label”: {}} ]
- Return type:
List[Dict]
- create_base_relations_from_csv(csv_path)
Assumes the columns are as follows [“relation_token_span_ids”, “ent1_ent2_start”, “ent1”, “ent2”, “label”, “label_id”, “ent1_type”, “ent2_type”, “ent1_id”, “ent2_id”, “ent1_cui”, “ent2_cui”, “doc_id”, “sents”], last column is the actual source text.
- The entities inside the text MUST be annotated with special tokens i.e:
…some text..[s1] first entity [e1]…..[s2] second entity [e2]……..
You have to store the start position, aka index position of token [e1] and also of token [e2] in the (ent1_ent2_start) column.
- Parameters:
csv_path (str) – path to csv file, must have specific columns, tab separated,
- Returns:
Dict – { “output_relations”: relation_instances, <– see create_base_relations_from_doc/csv
for data columns
“nclasses”: self.config.model.padding_idx, <– dummy class “labels2idx”: {}, “idx2label”: {}}
}
- create_base_relations_from_doc(doc, doc_id, ent1_ent2_tokens_start_pos=(-1, -1))
Creates a list of tuples based on pairs of entities detected (relation, ent1, ent2) for one spacy document or text string.
- Parameters:
doc (Union[Doc, str]) – SpacyDoc or string of text, each will get handled slightly differently
doc_id (str) – document id
ent1_ent2_tokens_start_pos (Union[List, Tuple], optional) – start of [s1][s2] tokens, if left default we assume we are dealing with a SpacyDoc. Defaults to (-1, -1).
- Returns:
Dict – { “output_relations”: relation_instances, <– see create_base_relations_from_doc/csv
for data columns
“nclasses”: self.config.model.padding_idx, <– dummy class “labels2idx”: {}, “idx2label”: {}}
}
- Return type:
Dict
- create_relations_from_export(data)
- Parameters:
data (Dict) – MedCAT Export data.
- Returns:
Dict – { “output_relations”: relation_instances, <– see create_base_relations_from_doc/csv
for data columns
“nclasses”: self.config.model.padding_idx, <– dummy class “labels2idx”: {}, “idx2label”: {}}
}
- classmethod get_labels(relation_labels, config)
This is used to update labels in config with unencountered classes/labels ( if any are encountered during training).
- Parameters:
relation_labels (List[str]) – new labels to add
config (ConfigRelCAT) – config
- Returns:
Any – _description_
- Return type:
Tuple[int, Dict[str, Any], Dict[int, Any]]
- __len__()
- Returns:
int – num of rels records
- Return type:
int
- __getitem__(idx)
- Parameters:
idx (int) – index of item in the dataset dict
- Returns:
Tuple[torch.LongTensor, torch.LongTensor, torch.LongTensor] – long tensors of the following the columns : input_ids, ent1&ent2 token start pos idx, label_ids
- Return type:
Tuple[torch.LongTensor, torch.LongTensor, torch.LongTensor]