:py:mod:`medcat.utils.data_utils`
=================================

.. py:module:: medcat.utils.data_utils


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   medcat.utils.data_utils.MetaAnnotationDS


Functions
~~~~~~~~~

.. autoapisummary::

   medcat.utils.data_utils.set_all_seeds
   medcat.utils.data_utils.count_annotations_project
   medcat.utils.data_utils.load_data
   medcat.utils.data_utils.count_annotations
   medcat.utils.data_utils.get_doc_from_project
   medcat.utils.data_utils.get_ann_from_doc
   medcat.utils.data_utils.meta_ann_from_ann
   medcat.utils.data_utils.are_anns_same
   medcat.utils.data_utils.get_same_anns
   medcat.utils.data_utils.print_consolid_stats
   medcat.utils.data_utils.check_differences
   medcat.utils.data_utils.consolidate_double_annotations
   medcat.utils.data_utils.validate_ner_data
   medcat.utils.data_utils.prepare_from_json_hf
   medcat.utils.data_utils.prepare_from_json_chars
   medcat.utils.data_utils.make_mc_train_test
   medcat.utils.data_utils.get_false_positives


Attributes
~~~~~~~~~~

.. autoapisummary::

   medcat.utils.data_utils.logger


.. py:data:: logger

   
.. py:function:: set_all_seeds(seed)


.. py:function:: count_annotations_project(project, cnt_per_cui=None)


.. py:function:: load_data(data_path, require_annotations = True, order_by_num_ann = True)

   Load data.

   :param data_path: The path to the data to load.
   :type data_path: str
   :param require_annotations: This will require anns but on project level, any doc in a project needs anns.
   :type require_annotations: bool
   :param order_by_num_ann: Whether to order by number of annoations. Defaults to True.
   :type order_by_num_ann: bool

   :Returns: **Dict** -- The loaded data.


.. py:function:: count_annotations(data_path)


.. py:function:: get_doc_from_project(project, doc_id)


.. py:function:: get_ann_from_doc(document, start, end)


.. py:function:: meta_ann_from_ann(ann, meta_name)


.. py:function:: are_anns_same(ann, ann2, meta_names = [], require_double_inner = True)


.. py:function:: get_same_anns(document, document2, require_double_inner = True, ann_stats = [], meta_names = [])


.. py:function:: print_consolid_stats(ann_stats = [], meta_names = [])


.. py:function:: check_differences(data_path, cat, cntx_size=30, min_acc=0.2, ignore_already_done=False, only_start=False, only_saved=False)


.. py:function:: consolidate_double_annotations(data_path, out_path, require_double = True, require_double_inner = False, meta_anns_to_match = [])

   Consolidated a dataset that was multi-annotated (same documents two times).

   :param data_path: Output from MedCATtrainer - projects containig the same documents must have the same name.
   :type data_path: str
   :param out_path: The consolidated data will be saved here - usually only annotations where both annotators agree
   :type out_path: str
   :param require_double:
                          If True everything must be double annotated, meaning there have to be two projects of the same name for each name. Else, it will
                              also use projects that do not have double annotiations. If this is False, projects that do not have double anns will be
                              included as is, and projects that have will still be checked.
   :type require_double: bool
   :param require_double_inner:
                                If False - this will allow some entities to be annotated by only one annotator and not the other, while still requiring
                                    annotations to be the same if they exist.
   :type require_double_inner: bool
   :param meta_anns_to_match:
                              List of meta annotations that must match for two annotations to be the same. If empty only the mention
                                  level will be checked.
   :type meta_anns_to_match: List

   :Returns: **Dict** -- The consolidated annoation.


.. py:function:: validate_ner_data(data_path, cdb, cntx_size = 70, status_only = False, ignore_if_already_done = False)

   Please just ignore this function, I'm afraid to even look at it.

   :param data_path: The data path.
   :type data_path: str
   :param cdb: The concept database.
   :type cdb: CDB
   :param cntx_size: The context size. Defaults to 70.
   :type cntx_size: int
   :param status_only: Whether to only consider status. Defaults to False.
   :type status_only: bool
   :param ignore_if_already_done: Whether to ignore if already done. Defaults to False.
   :type ignore_if_already_done: bool


.. py:class:: MetaAnnotationDS(data, category_map)


   Bases: :py:obj:`torch.utils.data.Dataset`

   An abstract class representing a :class:`Dataset`.

   All datasets that represent a map from keys to data samples should subclass
   it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
   data sample for a given key. Subclasses could also optionally overwrite
   :meth:`__len__`, which is expected to return the size of the dataset by many
   :class:`~torch.utils.data.Sampler` implementations and the default options
   of :class:`~torch.utils.data.DataLoader`. Subclasses could also
   optionally implement :meth:`__getitems__`, for speedup batched samples
   loading. This method accepts list of indices of samples of batch and returns
   list of samples.

   .. note::
     :class:`~torch.utils.data.DataLoader` by default constructs an index
     sampler that yields integral indices.  To make it work with a map-style
     dataset with non-integral indices/keys, a custom sampler must be provided.

   .. py:method:: __init__(data, category_map)

      Create  MetaAnnotationDS.

      :param data: Dictionary of data values.
      :type data: Dict
      :param category_map: Map from category naem to id.
      :type category_map: Dict


   .. py:method:: __getitem__(idx)


   .. py:method:: __len__()


.. py:function:: prepare_from_json_hf(data_path, cntx_left, cntx_right, tokenizer, cui_filter = None, replace_center = None)


.. py:function:: prepare_from_json_chars(data, cntx_left, cntx_right, tokenizer, cui_filter = None, replace_center = None)

   Convert the data from a json format into a CSV-like format for training.

   :param data: The json file from MedCAT.
   :type data: Dict
   :param cntx_left: The size of the context.
   :type cntx_left: int
   :param cntx_right: The size of the context.
   :type cntx_right: int
   :param tokenizer: The instance of the <FastTokenizer> class from huggingface.
   :type tokenizer: Any
   :param cui_filter: The CUI filter. Defaults to None.
   :type cui_filter: Optional[Dict], optional
   :param replace_center: If not None the center word (concept) will be
                          replaced with whatever is set. Defaults to None.
   :type replace_center: Optional[Dict], optional

   :Returns: **Dict** -- {'category_name': [('category_value', 'tokens', 'center_token'), ...], ...}


.. py:function:: make_mc_train_test(data, cdb, test_size = 0.2)

   Make train set.

   This is a disaster.

   :param data: The data.
   :type data: Dict
   :param cdb: The concept database.
   :type cdb: CDB
   :param test_size: The test size. Defaults to 0.2.
   :type test_size: float

   :Returns: **Tuple** -- The train set, the test set, the test annotations, and the total annotations


.. py:function:: get_false_positives(doc, spacy_doc)