medcat.utils.relation_extraction.rel_dataset
Module Contents
Classes
An abstract class representing a |
- class medcat.utils.relation_extraction.rel_dataset.RelData(tokenizer, config, cdb=CDB())
Bases:
torch.utils.data.DatasetAn abstract class representing a
Dataset.All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite
__getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite__len__(), which is expected to return the size of the dataset by manySamplerimplementations and the default options ofDataLoader. Subclasses could also optionally implement__getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.Note
DataLoaderby default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.- Parameters:
tokenizer (medcat.utils.relation_extraction.tokenizer.BaseTokenizerWrapper_RelationExtraction) –
config (medcat.config_rel_cat.ConfigRelCAT) –
cdb (medcat.cdb.CDB) –
- name = 'rel_dataset'
- log
- __init__(tokenizer, config, cdb=CDB())
- Use this class to create a dataset for relation annotations from CSV exports,
MedCAT exports or Spacy Documents (assuming the documents got generated by MedCAT, if they did not then please set the required parameters manually to match MedCAT output, see /medcat/cat.py#_add_nested_ent)
If you are using this to create relations from CSV it is assumed that your entities/concepts of interest are surrounded by the special tokens, see create_base_relations_from_csv doc.
- Parameters:
tokenizer (BaseTokenizerWrapper_RelationExtraction) – tokenizer used to generate token ids from input text
config (ConfigRelCAT) – same config used in RelCAT
cdb (CDB) – Optional, used to add concept ids and types to detected ents, useful when creating datasets from MedCAT output. Defaults to CDB().
- generate_base_relations(docs)
Util function, should be used if you want to train from spacy docs
- Parameters:
docs (Iterable[Doc]) – Generate relations from Spacy CAT docs.
- Returns:
output_relations – List[Dict] : [] “output_relations”: relation_instances, <– see create_base_relations_from_doc/csv
for data columns
“nclasses”: self.config.model.padding_idx, <– dummy class “labels2idx”: {}, “idx2label”: {}} ]
- Return type:
List[Dict]
- create_base_relations_from_csv(csv_path, keep_source_text=False)
Assumes the columns are as follows [“relation_token_span_ids”, “ent1_ent2_start”, “ent1”, “ent2”, “label”, “label_id”, “ent1_type”, “ent2_type”, “ent1_id”, “ent2_id”, “ent1_cui”, “ent2_cui”, “doc_id”, “sents”], last column is the actual source text.
- The entities inside the text MUST be annotated with special tokens i.e:
…some text..[s1] first entity [e1]…..[s2] second entity [e2]……..
You have to store the start position, aka index position of token [e1] and also of token [e2] in the (ent1_ent2_start) column.
- Parameters:
csv_path (str) – path to csv file, must have specific columns, tab separated,
keep_source_text (bool) – if the text clumn should be retained in the ‘sents’ df column, used for debugging or creating custom datasets.
- Returns:
Dict – { “output_relations”: relation_instances, <– see create_base_relations_from_doc/csv
for data columns
“nclasses”: self.config.model.padding_idx, <– dummy class “labels2idx”: {}, “idx2label”: {}}
}
- _create_relation_validation(text, doc_id, tokenized_text_data, ent1_start_char_pos, ent2_start_char_pos, ent1_end_char_pos, ent2_end_char_pos, ent1_token_start_pos=-1, ent2_token_start_pos=-1, ent1_token_end_pos=-1, ent2_token_end_pos=-1, is_spacy_doc=False, is_mct_export=False)
This function checks if the relation is actually valid by distance criteria, TUIs and so on. Has diffierent handling cases for text, spacy docs and MCT exports.
- Parameters:
text (str) – doc text
doc_id (str) – doc id
tokenized_text_data (Dict[str, Any]) – tokenized text
ent1_start_char_pos (int) – ent1 start char pos
ent2_start_char_pos (int) – ent2 start char pos
ent1_end_char_pos (int) – ent1 end char pos
ent2_end_char_pos (int) – ent2 end char pos
ent1_token_start_pos (int) – ent1_token_start_pos. Defaults to -1.
ent2_token_start_pos (int) – ent2_token_start_pos. Defaults to -1.
ent1_token_end_pos (int) – ent1_token_end_pos. Defaults to -1.
ent2_token_end_pos (int) – ent2_token_end_pos. Defaults to -1.
is_spacy_doc (bool) – checks if doc is spacy docs. Defaults to False.
is_mct_export (bool) – chekcs if doc is a mct export. Defaults to False.
- Returns:
List – row containing rel data [“relation_token_span_ids”, “ent1_ent2_start”, “ent1”, “ent2”, “label”,
“label_id”, “ent1_type”, “ent2_type”, “ent1_id”, “ent2_id”, “ent1_cui”, “ent2_cui”, “doc_id”, “sents”]
- Return type:
List
- create_base_relations_from_doc(doc, doc_id, ent1_ent2_tokens_start_pos=(-1, -1))
Creates a list of tuples based on pairs of entities detected (relation, ent1, ent2) for one spacy document or text string.
- Parameters:
doc (Union[Doc, str]) – SpacyDoc or string of text, each will get handled slightly differently
doc_id (str) – document id
ent1_ent2_tokens_start_pos (Union[List, Tuple], optional) – start of [s1][s2] tokens, if left default we assume we are dealing with a SpacyDoc. Defaults to (-1, -1).
- Returns:
Dict – { “output_relations”: relation_instances, <– see create_base_relations_from_doc/csv
for data columns
“nclasses”: self.config.model.padding_idx, <– dummy class “labels2idx”: {}, “idx2label”: {}}
}
- Return type:
Dict
- create_relations_from_export(data)
- Parameters:
data (Dict) – MedCAT Export data.
- Returns:
Dict – { “output_relations”: relation_instances, <– see create_base_relations_from_doc/csv
for data columns
“nclasses”: self.config.model.padding_idx, <– dummy class “labels2idx”: {}, “idx2label”: {}}
}
- classmethod get_labels(relation_labels, config)
This is used to update labels in config with unencountered classes/labels ( if any are encountered during training).
- Parameters:
relation_labels (List[str]) – new labels to add
config (ConfigRelCAT) – config
- Returns:
Tuple[int, Dict[str, int], Dict[int, str]] – label count, labesl2idx mapping, idx2labels mapping
- Return type:
Tuple[int, Dict[str, int], Dict[int, str]]
- __len__()
- Returns:
int – num of rels records
- Return type:
int
- __getitem__(idx)
- Parameters:
idx (int) – index of item in the dataset dict
- Returns:
Tuple[torch.LongTensor, torch.LongTensor, torch.LongTensor] – long tensors of the following the columns : input_ids, ent1&ent2 token start pos idx, label_ids
- Return type:
Tuple[torch.LongTensor, torch.LongTensor, torch.LongTensor]