:py:mod:`medcat.utils.data_utils` ================================= .. py:module:: medcat.utils.data_utils Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: medcat.utils.data_utils.MetaAnnotationDS Functions ~~~~~~~~~ .. autoapisummary:: medcat.utils.data_utils.set_all_seeds medcat.utils.data_utils.count_annotations_project medcat.utils.data_utils.load_data medcat.utils.data_utils.count_annotations medcat.utils.data_utils.get_doc_from_project medcat.utils.data_utils.get_ann_from_doc medcat.utils.data_utils.meta_ann_from_ann medcat.utils.data_utils.are_anns_same medcat.utils.data_utils.get_same_anns medcat.utils.data_utils.print_consolid_stats medcat.utils.data_utils.check_differences medcat.utils.data_utils.consolidate_double_annotations medcat.utils.data_utils.validate_ner_data medcat.utils.data_utils.prepare_from_json_hf medcat.utils.data_utils.prepare_from_json_chars medcat.utils.data_utils.make_mc_train_test medcat.utils.data_utils.get_false_positives Attributes ~~~~~~~~~~ .. autoapisummary:: medcat.utils.data_utils.logger .. py:data:: logger .. py:function:: set_all_seeds(seed) .. py:function:: count_annotations_project(project, cnt_per_cui=None) .. py:function:: load_data(data_path, require_annotations = True, order_by_num_ann = True) Load data. :param data_path: The path to the data to load. :type data_path: str :param require_annotations: This will require anns but on project level, any doc in a project needs anns. :type require_annotations: bool :param order_by_num_ann: Whether to order by number of annoations. Defaults to True. :type order_by_num_ann: bool :Returns: **Dict** -- The loaded data. .. py:function:: count_annotations(data_path) .. py:function:: get_doc_from_project(project, doc_id) .. py:function:: get_ann_from_doc(document, start, end) .. py:function:: meta_ann_from_ann(ann, meta_name) .. py:function:: are_anns_same(ann, ann2, meta_names = [], require_double_inner = True) .. py:function:: get_same_anns(document, document2, require_double_inner = True, ann_stats = [], meta_names = []) .. py:function:: print_consolid_stats(ann_stats = [], meta_names = []) .. py:function:: check_differences(data_path, cat, cntx_size=30, min_acc=0.2, ignore_already_done=False, only_start=False, only_saved=False) .. py:function:: consolidate_double_annotations(data_path, out_path, require_double = True, require_double_inner = False, meta_anns_to_match = []) Consolidated a dataset that was multi-annotated (same documents two times). :param data_path: Output from MedCATtrainer - projects containig the same documents must have the same name. :type data_path: str :param out_path: The consolidated data will be saved here - usually only annotations where both annotators agree :type out_path: str :param require_double: If True everything must be double annotated, meaning there have to be two projects of the same name for each name. Else, it will also use projects that do not have double annotiations. If this is False, projects that do not have double anns will be included as is, and projects that have will still be checked. :type require_double: bool :param require_double_inner: If False - this will allow some entities to be annotated by only one annotator and not the other, while still requiring annotations to be the same if they exist. :type require_double_inner: bool :param meta_anns_to_match: List of meta annotations that must match for two annotations to be the same. If empty only the mention level will be checked. :type meta_anns_to_match: List :Returns: **Dict** -- The consolidated annoation. .. py:function:: validate_ner_data(data_path, cdb, cntx_size = 70, status_only = False, ignore_if_already_done = False) Please just ignore this function, I'm afraid to even look at it. :param data_path: The data path. :type data_path: str :param cdb: The concept database. :type cdb: CDB :param cntx_size: The context size. Defaults to 70. :type cntx_size: int :param status_only: Whether to only consider status. Defaults to False. :type status_only: bool :param ignore_if_already_done: Whether to ignore if already done. Defaults to False. :type ignore_if_already_done: bool .. py:class:: MetaAnnotationDS(data, category_map) Bases: :py:obj:`torch.utils.data.Dataset` An abstract class representing a :class:`Dataset`. All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:`__len__`, which is expected to return the size of the dataset by many :class:`~torch.utils.data.Sampler` implementations and the default options of :class:`~torch.utils.data.DataLoader`. Subclasses could also optionally implement :meth:`__getitems__`, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples. .. note:: :class:`~torch.utils.data.DataLoader` by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided. .. py:method:: __init__(data, category_map) Create MetaAnnotationDS. :param data: Dictionary of data values. :type data: Dict :param category_map: Map from category naem to id. :type category_map: Dict .. py:method:: __getitem__(idx) .. py:method:: __len__() .. py:function:: prepare_from_json_hf(data_path, cntx_left, cntx_right, tokenizer, cui_filter = None, replace_center = None) .. py:function:: prepare_from_json_chars(data, cntx_left, cntx_right, tokenizer, cui_filter = None, replace_center = None) Convert the data from a json format into a CSV-like format for training. :param data: The json file from MedCAT. :type data: Dict :param cntx_left: The size of the context. :type cntx_left: int :param cntx_right: The size of the context. :type cntx_right: int :param tokenizer: The instance of the class from huggingface. :type tokenizer: Any :param cui_filter: The CUI filter. Defaults to None. :type cui_filter: Optional[Dict], optional :param replace_center: If not None the center word (concept) will be replaced with whatever is set. Defaults to None. :type replace_center: Optional[Dict], optional :Returns: **Dict** -- {'category_name': [('category_value', 'tokens', 'center_token'), ...], ...} .. py:function:: make_mc_train_test(data, cdb, test_size = 0.2) Make train set. This is a disaster. :param data: The data. :type data: Dict :param cdb: The concept database. :type cdb: CDB :param test_size: The test size. Defaults to 0.2. :type test_size: float :Returns: **Tuple** -- The train set, the test set, the test annotations, and the total annotations .. py:function:: get_false_positives(doc, spacy_doc)