:py:mod:`medcat.vocab`
======================

.. py:module:: medcat.vocab


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   medcat.vocab.Vocab


.. py:class:: Vocab


   Bases: :py:obj:`object`

   Vocabulary used to store word embeddings for context similarity
   calculation. Also used by the spell checker - but not for fixing the spelling
   only for checking is something correct.

   Properties:
       vocab (dict):
           Map from word to attributes, e.g. {'house': {'vec': <np.array>, 'cnt': <int>, ...}, ...}
       index2word (dict):
           From word to an index - used for negative sampling
       vec_index2word (dict):
           Same as index2word but only words that have vectors
       unigram_table (dict):
           Negative sampling.

   .. py:method:: __init__()


   .. py:method:: inc_or_add(word, cnt = 1, vec = None)

      Add a word or incrase its count.

      :param word: Word to be added
      :type word: str
      :param cnt: By how much should the count be increased, or to what
                  should it be set if a new word. (Default value = 1)
      :type cnt: int
      :param vec: Word vector (Default value = None)
      :type vec: Optional[np.ndarray]


   .. py:method:: remove_all_vectors()

      Remove all stored vector representations.


   .. py:method:: remove_words_below_cnt(cnt)

      Remove all words with frequency below cnt.

      :param cnt: Word count limit.
      :type cnt: int


   .. py:method:: inc_wc(word, cnt = 1)

      Incraese word count by cnt.

      :param word: For which word to increase the count
      :type word: str
      :param cnt: By how muhc to incrase the count (Default value = 1)
      :type cnt: int


   .. py:method:: add_vec(word, vec)

      Add vector to a word.

      :param word: To which word to add the vector.
      :type word: str
      :param vec: The vector to add.
      :type vec: np.ndarray


   .. py:method:: reset_counts(cnt = 1)

      Reset the count for all word to cnt.

      :param cnt: New count for all words in the vocab. (Default value = 1)
      :type cnt: int


   .. py:method:: update_counts(tokens)

      Given a list of tokens update counts for words in the vocab.

      :param tokens: Usually a large block of text split into tokens/words.
      :type tokens: List[str]


   .. py:method:: add_word(word, cnt = 1, vec = None, replace = True)

      Add a word to the vocabulary

      :param word: The word to be added, it should be lemmatized and lowercased
      :type word: str
      :param cnt: Count of this word in your dataset (Default value = 1)
      :type cnt: int
      :param vec: The vector representation of the word (Default value = None)
      :type vec: Optional[np.ndarray]
      :param replace: Will replace old vector representation (Default value = True)
      :type replace: bool


   .. py:method:: add_words(path, replace = True)

      Adds words to the vocab from a file, the file
      is required to have the following format (vec being optional):
          <word>      <cnt>[  <vec_space_separated>]

      e.g. one line: the word house with 3 dimensional vectors
          house   34444   0.3232 0.123213 1.231231

      :param path: path to the file with words and vectors
      :type path: str
      :param replace: existing words in the vocabulary will be replaced (Default value = True)
      :type replace: bool


   .. py:method:: make_unigram_table(table_size = 100000000)

      Make unigram table for negative sampling, look at the paper if interested
      in details.

      :param table_size: The size of the table (Defaults to 100 000 000)
      :type table_size: int


   .. py:method:: get_negative_samples(n = 6, ignore_punct_and_num = False)

      Get N negative samples.

      :param n: How many words to return (Default value = 6)
      :type n: int
      :param ignore_punct_and_num: Whether to ignore punctuation and numbers. (Default value = False)
      :type ignore_punct_and_num: bool

      :raises Exception: If no unigram table is present.

      :Returns: **List[int]** -- Indices for words in this vocabulary.


   .. py:method:: __getitem__(word)


   .. py:method:: vec(word)


   .. py:method:: count(word)


   .. py:method:: item(word)


   .. py:method:: __contains__(word)


   .. py:method:: save(path)


   .. py:method:: load(path)
      :classmethod: