:py:mod:`medcat.utils.meta_cat.data_utils`
==========================================

.. py:module:: medcat.utils.meta_cat.data_utils


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   medcat.utils.meta_cat.data_utils.Empty
   medcat.utils.meta_cat.data_utils.Span
   medcat.utils.meta_cat.data_utils.Doc


Functions
~~~~~~~~~

.. autoapisummary::

   medcat.utils.meta_cat.data_utils.prepare_from_json
   medcat.utils.meta_cat.data_utils.prepare_for_oversampled_data
   medcat.utils.meta_cat.data_utils.encode_category_values
   medcat.utils.meta_cat.data_utils.json_to_fake_spacy


Attributes
~~~~~~~~~~

.. autoapisummary::

   medcat.utils.meta_cat.data_utils.logger


.. py:data:: logger

   
.. py:function:: prepare_from_json(data, cntx_left, cntx_right, tokenizer, cui_filter = None, replace_center = None, prerequisites = {}, lowercase = True)

   Convert the data from a json format into a CSV-like format for training. This function is not very efficient (the one
   working with spacy documents as part of the meta_cat.pipe method is much better). If your dataset is > 1M documents think
   about rewriting this function - but would be strange to have more than 1M manually annotated documents.

   :param data: Loaded output of MedCATtrainer. If we have a `my_export.json` from MedCATtrainer, than data = json.load(<my_export>).
   :type data: Dict
   :param cntx_left: Size of context to get from the left of the concept
   :type cntx_left: int
   :param cntx_right: Size of context to get from the right of the concept
   :type cntx_right: int
   :param tokenizer: Something to split text into tokens for the LSTM/BERT/whatever meta models.
   :type tokenizer: TokenizerWrapperBase
   :param replace_center: If not None the center word (concept) will be replaced with whatever this is.
   :type replace_center: Optional[str]
   :param prerequisites: A map of prerequisites, for example our data has two meta-annotations (experiencer, negation). Assume I want to create
                         a dataset for `negation` but only in those cases where `experiencer=patient`, my prerequisites would be:
                             {'Experiencer': 'Patient'} - Take care that the CASE has to match whatever is in the data. Defaults to `{}`.
   :type prerequisites: Dict
   :param lowercase: Should the text be lowercased before tokenization. Defaults to True.
   :type lowercase: bool
   :param cui_filter: CUI filter if set. Defaults to None.
   :type cui_filter: Optional[set]

   :Returns: **out_data** (*dict*) -- Example: {'category_name': [('<category_value>', '<[tokens]>', '<center_token>'), ...], ...}


.. py:function:: prepare_for_oversampled_data(data, tokenizer)

   Convert the data from a json format into a CSV-like format for training. This function is not very efficient (the one
   working with spacy documents as part of the meta_cat.pipe method is much better). If your dataset is > 1M documents think
   about rewriting this function - but would be strange to have more than 1M manually annotated documents.

   :param data: Oversampled data expected in the following format:
                [[['text','of','the','document'], [index of medical entity], "label" ],
                 ['text','of','the','document'], [index of medical entity], "label" ]]
   :type data: List
   :param tokenizer: Something to split text into tokens for the LSTM/BERT/whatever meta models.
   :type tokenizer: TokenizerWrapperBase

   :Returns: **data_sampled** (*list*) -- The processed data in the format that can be merged with the output from prepare_from_json.
             [[<[tokens]>, [index of medical entity], "label" ],
             <[tokens]>, [index of medical entity], "label" ]]


.. py:function:: encode_category_values(data, existing_category_value2id = None, category_undersample=None)

   Converts the category values in the data outputted by `prepare_from_json`
   into integer values.

   :param data: Output of `prepare_from_json`.
   :type data: Dict
   :param existing_category_value2id: Map from category_value to id (old/existing).
   :type existing_category_value2id: Optional[Dict]
   :param category_undersample: Name of class that should be used to undersample the data (for 2 phase learning)

   :Returns: * **dict** -- New data with integers inplace of strings for category values.
             * **dict** -- New undersampled data (for 2 phase learning) with integers inplace of strings for category values
             * **dict** -- Map from category value to ID for all categories in the data.


.. py:function:: json_to_fake_spacy(data, id2text)

   Creates a generator of fake spacy documents, used for running
   meta_cat pipe separately from main cat pipeline.

   :param data: Output from cat formatted as: {<id>: <output of get_entities, ...}.
   :type data: Dict
   :param id2text: Map from document id to text of that document.
   :type id2text: Dict

   :Yields: *Generator* -- Generator of spacy like documents that can be feed into meta_cat.pipe.


.. py:class:: Empty


   Bases: :py:obj:`object`

   .. py:method:: __init__()


.. py:class:: Span(start_char, end_char, id_)


   Bases: :py:obj:`object`

   .. py:method:: __init__(start_char, end_char, id_)


.. py:class:: Doc(text, id_)


   Bases: :py:obj:`object`

   .. py:method:: __init__(text, id_)