:py:mod:`medcat.utils.relation_extraction.rel_dataset` ====================================================== .. py:module:: medcat.utils.relation_extraction.rel_dataset Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: medcat.utils.relation_extraction.rel_dataset.RelData .. py:class:: RelData(tokenizer, config, cdb = CDB()) Bases: :py:obj:`torch.utils.data.Dataset` An abstract class representing a :class:`Dataset`. All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:`__len__`, which is expected to return the size of the dataset by many :class:`~torch.utils.data.Sampler` implementations and the default options of :class:`~torch.utils.data.DataLoader`. Subclasses could also optionally implement :meth:`__getitems__`, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples. .. note:: :class:`~torch.utils.data.DataLoader` by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided. .. py:attribute:: name :value: 'rel_dataset' .. py:attribute:: log .. py:method:: __init__(tokenizer, config, cdb = CDB()) Use this class to create a dataset for relation annotations from CSV exports, MedCAT exports or Spacy Documents (assuming the documents got generated by MedCAT, if they did not then please set the required parameters manually to match MedCAT output, see /medcat/cat.py#_add_nested_ent) If you are using this to create relations from CSV it is assumed that your entities/concepts of interest are surrounded by the special tokens, see create_base_relations_from_csv doc. :param tokenizer: tokenizer used to generate token ids from input text :type tokenizer: BaseTokenizerWrapper_RelationExtraction :param config: same config used in RelCAT :type config: ConfigRelCAT :param cdb: Optional, used to add concept ids and types to detected ents, useful when creating datasets from MedCAT output. Defaults to CDB(). :type cdb: CDB .. py:method:: generate_base_relations(docs) Util function, should be used if you want to train from spacy docs :param docs: Generate relations from Spacy CAT docs. :type docs: Iterable[Doc] :Returns: **output_relations** -- List[Dict] : [] "output_relations": relation_instances, <-- see create_base_relations_from_doc/csv for data columns "nclasses": self.config.model.padding_idx, <-- dummy class "labels2idx": {}, "idx2label": {}} ] .. py:method:: create_base_relations_from_csv(csv_path, keep_source_text = False) Assumes the columns are as follows ["relation_token_span_ids", "ent1_ent2_start", "ent1", "ent2", "label", "label_id", "ent1_type", "ent2_type", "ent1_id", "ent2_id", "ent1_cui", "ent2_cui", "doc_id", "sents"], last column is the actual source text. The entities inside the text MUST be annotated with special tokens i.e: ...some text..[s1] first entity [e1].....[s2] second entity [e2]........ You have to store the start position, aka index position of token [e1] and also of token [e2] in the (ent1_ent2_start) column. :param csv_path: path to csv file, must have specific columns, tab separated, :type csv_path: str :param keep_source_text: if the text clumn should be retained in the 'sents' df column, used for debugging or creating custom datasets. :type keep_source_text: bool :Returns: * **Dict** -- { "output_relations": relation_instances, <-- see create_base_relations_from_doc/csv for data columns "nclasses": self.config.model.padding_idx, <-- dummy class "labels2idx": {}, "idx2label": {}} * **}** .. py:method:: _create_relation_validation(text, doc_id, tokenized_text_data, ent1_start_char_pos, ent2_start_char_pos, ent1_end_char_pos, ent2_end_char_pos, ent1_token_start_pos = -1, ent2_token_start_pos = -1, ent1_token_end_pos = -1, ent2_token_end_pos = -1, is_spacy_doc = False, is_mct_export = False) This function checks if the relation is actually valid by distance criteria, TUIs and so on. Has diffierent handling cases for text, spacy docs and MCT exports. :param text: doc text :type text: str :param doc_id: doc id :type doc_id: str :param tokenized_text_data: tokenized text :type tokenized_text_data: Dict[str, Any] :param ent1_start_char_pos: ent1 start char pos :type ent1_start_char_pos: int :param ent2_start_char_pos: ent2 start char pos :type ent2_start_char_pos: int :param ent1_end_char_pos: ent1 end char pos :type ent1_end_char_pos: int :param ent2_end_char_pos: ent2 end char pos :type ent2_end_char_pos: int :param ent1_token_start_pos: ent1_token_start_pos. Defaults to -1. :type ent1_token_start_pos: int :param ent2_token_start_pos: ent2_token_start_pos. Defaults to -1. :type ent2_token_start_pos: int :param ent1_token_end_pos: ent1_token_end_pos. Defaults to -1. :type ent1_token_end_pos: int :param ent2_token_end_pos: ent2_token_end_pos. Defaults to -1. :type ent2_token_end_pos: int :param is_spacy_doc: checks if doc is spacy docs. Defaults to False. :type is_spacy_doc: bool :param is_mct_export: chekcs if doc is a mct export. Defaults to False. :type is_mct_export: bool :Returns: * **List** -- row containing rel data ["relation_token_span_ids", "ent1_ent2_start", "ent1", "ent2", "label", * **"label_id", "ent1_type", "ent2_type", "ent1_id", "ent2_id", "ent1_cui", "ent2_cui", "doc_id", "sents"]** .. py:method:: create_base_relations_from_doc(doc, doc_id, ent1_ent2_tokens_start_pos = (-1, -1)) Creates a list of tuples based on pairs of entities detected (relation, ent1, ent2) for one spacy document or text string. :param doc: SpacyDoc or string of text, each will get handled slightly differently :type doc: Union[Doc, str] :param doc_id: document id :type doc_id: str :param ent1_ent2_tokens_start_pos: start of [s1][s2] tokens, if left default we assume we are dealing with a SpacyDoc. Defaults to (-1, -1). :type ent1_ent2_tokens_start_pos: Union[List, Tuple], optional :Returns: * **Dict** -- { "output_relations": relation_instances, <-- see create_base_relations_from_doc/csv for data columns "nclasses": self.config.model.padding_idx, <-- dummy class "labels2idx": {}, "idx2label": {}} * **}** .. py:method:: create_relations_from_export(data) :param data: MedCAT Export data. :type data: Dict :Returns: * **Dict** -- { "output_relations": relation_instances, <-- see create_base_relations_from_doc/csv for data columns "nclasses": self.config.model.padding_idx, <-- dummy class "labels2idx": {}, "idx2label": {}} * **}** .. py:method:: get_labels(relation_labels, config) :classmethod: This is used to update labels in config with unencountered classes/labels ( if any are encountered during training). :param relation_labels: new labels to add :type relation_labels: List[str] :param config: config :type config: ConfigRelCAT :Returns: **Tuple[int, Dict[str, int], Dict[int, str]]** -- label count, labesl2idx mapping, idx2labels mapping .. py:method:: __len__() :Returns: **int** -- num of rels records .. py:method:: __getitem__(idx) :param idx: index of item in the dataset dict :type idx: int :Returns: **Tuple[torch.LongTensor, torch.LongTensor, torch.LongTensor]** -- long tensors of the following the columns : input_ids, ent1&ent2 token start pos idx, label_ids