:py:mod:`medcat.utils.relation_extraction.rel_dataset` ====================================================== .. py:module:: medcat.utils.relation_extraction.rel_dataset Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: medcat.utils.relation_extraction.rel_dataset.RelData .. py:class:: RelData(tokenizer, config, cdb = CDB()) Bases: :py:obj:`torch.utils.data.Dataset` An abstract class representing a :class:`Dataset`. All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:`__len__`, which is expected to return the size of the dataset by many :class:`~torch.utils.data.Sampler` implementations and the default options of :class:`~torch.utils.data.DataLoader`. Subclasses could also optionally implement :meth:`__getitems__`, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples. .. note:: :class:`~torch.utils.data.DataLoader` by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided. .. py:attribute:: name :value: 'rel_dataset' .. py:attribute:: log .. py:method:: __init__(tokenizer, config, cdb = CDB()) Use this class to create a dataset for relation annotations from CSV exports, MedCAT exports or Spacy Documents (assuming the documents got generated by MedCAT, if they did not then please set the required parameters manually to match MedCAT output, see /medcat/cat.py#_add_nested_ent) If you are using this to create relations from CSV it is assumed that your entities/concepts of interest are surrounded by the special tokens, see create_base_relations_from_csv doc. :param tokenizer: okenizer used to generate token ids from input text :type tokenizer: TokenizerWrapperBERT :param config: same config used in RelCAT :type config: ConfigRelCAT :param cdb: Optional, used to add concept ids and types to detected ents, useful when creating datasets from MedCAT output. Defaults to CDB(). :type cdb: CDB .. py:method:: generate_base_relations(docs) Util function, should be used if you want to train from spacy docs :param docs: Generate relations from Spacy CAT docs. :type docs: Iterable[Doc] :Returns: **output_relations** -- List[Dict] : [] "output_relations": relation_instances, <-- see create_base_relations_from_doc/csv for data columns "nclasses": self.config.model.padding_idx, <-- dummy class "labels2idx": {}, "idx2label": {}} ] .. py:method:: create_base_relations_from_csv(csv_path) Assumes the columns are as follows ["relation_token_span_ids", "ent1_ent2_start", "ent1", "ent2", "label", "label_id", "ent1_type", "ent2_type", "ent1_id", "ent2_id", "ent1_cui", "ent2_cui", "doc_id", "sents"], last column is the actual source text. The entities inside the text MUST be annotated with special tokens i.e: ...some text..[s1] first entity [e1].....[s2] second entity [e2]........ You have to store the start position, aka index position of token [e1] and also of token [e2] in the (ent1_ent2_start) column. :param csv_path: path to csv file, must have specific columns, tab separated, :type csv_path: str :Returns: * **Dict** -- { "output_relations": relation_instances, <-- see create_base_relations_from_doc/csv for data columns "nclasses": self.config.model.padding_idx, <-- dummy class "labels2idx": {}, "idx2label": {}} * **}** .. py:method:: create_base_relations_from_doc(doc, doc_id, ent1_ent2_tokens_start_pos = (-1, -1)) Creates a list of tuples based on pairs of entities detected (relation, ent1, ent2) for one spacy document or text string. :param doc: SpacyDoc or string of text, each will get handled slightly differently :type doc: Union[Doc, str] :param doc_id: document id :type doc_id: str :param ent1_ent2_tokens_start_pos: start of [s1][s2] tokens, if left default we assume we are dealing with a SpacyDoc. Defaults to (-1, -1). :type ent1_ent2_tokens_start_pos: Union[List, Tuple], optional :Returns: * **Dict** -- { "output_relations": relation_instances, <-- see create_base_relations_from_doc/csv for data columns "nclasses": self.config.model.padding_idx, <-- dummy class "labels2idx": {}, "idx2label": {}} * **}** .. py:method:: create_relations_from_export(data) :param data: MedCAT Export data. :type data: Dict :Returns: * **Dict** -- { "output_relations": relation_instances, <-- see create_base_relations_from_doc/csv for data columns "nclasses": self.config.model.padding_idx, <-- dummy class "labels2idx": {}, "idx2label": {}} * **}** .. py:method:: get_labels(relation_labels, config) :classmethod: This is used to update labels in config with unencountered classes/labels ( if any are encountered during training). :param relation_labels: new labels to add :type relation_labels: List[str] :param config: config :type config: ConfigRelCAT :Returns: **Any** -- _description_ .. py:method:: __len__() :Returns: **int** -- num of rels records .. py:method:: __getitem__(idx) :param idx: index of item in the dataset dict :type idx: int :Returns: **Tuple[torch.LongTensor, torch.LongTensor, torch.LongTensor]** -- long tensors of the following the columns : input_ids, ent1&ent2 token start pos idx, label_ids