:py:mod:`medcat.utils.make_vocab`
=================================

.. py:module:: medcat.utils.make_vocab


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   medcat.utils.make_vocab.MakeVocab


Attributes
~~~~~~~~~~

.. autoapisummary::

   medcat.utils.make_vocab.logger


.. py:data:: logger

   
.. py:class:: MakeVocab(config, cdb=None, vocab=None, word_tokenizer=None)


   Bases: :py:obj:`object`

   Create a new vocab from a text file.

   :param config: Global configuration for medcat.
   :type config: medcat.config.Config
   :param cdb: The concept database that will be added ontop of the Vocab built from the text file.
   :type cdb: medcat.cdb.CDB
   :param vocab: Vocabulary to be extended, leave as None if you want to make a new Vocab. Default: None
   :type vocab: medcat.vocab.Vocab, optional
   :param word_tokenizer: A custom tokenizer for word spliting - used if embeddings are BERT or similar.
                          Default: None
   :type word_tokenizer: <function>

   .. rubric:: Examples

   To make a vocab and train word embeddings.

   >>> cdb = <your existing cdb>
   >>> maker = MakeVocab(cdb=cdb, config=config)
   >>> maker.make(data_iterator, out_folder="./output/")
   >>> maker.add_vectors(in_path="./output/data.txt")

   .. py:method:: __init__(config, cdb=None, vocab=None, word_tokenizer=None)


   .. py:method:: _tok(text)


   .. py:method:: make(iter_data, out_folder, join_cdb=True, normalize_tokens=False)

      Make a vocab - without vectors initially. This will create two files in the out_folder:
      - vocab.dat -> The vocabulary without vectors
      - data.txt -> The tokenized dataset prepared for training of word2vec or similar embeddings.

      :param iter_data: An iterator over sentences or documents. Can also be a simple array of text documents/sentences.
      :type iter_data: Iterator
      :param out_folder: A path to a folder where all the results will be saved.
      :type out_folder: string
      :param join_cdb: Should the words from the CDB be added to the Vocab. Default: True.
      :type join_cdb: bool
      :param normalize_tokens: If set tokens will be lematized - tends to work better in some cases where the difference
                               between e.g. plural/singular should be ignored. But in general not so important if the dataset is big enough.
      :type normalize_tokens: bool, defaults to True


   .. py:method:: add_vectors(in_path = None, w2v = None, overwrite = False, data_iter = None, workers = 14, epochs = 2, min_count = 10, window = 10, vector_size = 300, unigram_table_size = 100000000)

      Add vectors to an existing vocabulary and save changes to the vocab_path.

      :param in_path: Path to the data.txt that was created by the MakeVocab.make() function.
      :type in_path: Optional[str]
      :param w2v: An existing word2vec instance. Default: None
      :type w2v: Optional[Word2Vec]
      :param overwrite: If True it will overwrite existing vectors in the vocabulary. Default: False
      :type overwrite: bool
      :param data_iter: If you want to provide a customer iterator over the data use this. If yes, then in_path is not needed.
      :type data_iter: Optional[Iterator]
      :param workers: Number of workers for Word2Vec. Defaults to 14.
      :type workers: int
      :param epochs: Number of epochs for Word2Vec. Defaults to 2.
      :type epochs: int
      :param min_count: Minimum count for Word2Vec. Defaults to 10.
      :type min_count: int
      :param window: Window size for Word2Vec. Defaults to 10.
      :type window: int
      :param vector_size: Vector size for Word2Vec. Defaults to 300.
      :type vector_size: int
      :param unigram_table_size: Unigram table size for vocab. Defaults to 100_000_000.
      :type unigram_table_size: int

      :raises ValueError: In case of unknown input.

      :Returns: **Word2Vec** -- A trained word2vec model.


   .. py:method:: destroy_pipe()