:py:mod:`medcat.utils.make_vocab` ================================= .. py:module:: medcat.utils.make_vocab Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: medcat.utils.make_vocab.MakeVocab Attributes ~~~~~~~~~~ .. autoapisummary:: medcat.utils.make_vocab.logger .. py:data:: logger .. py:class:: MakeVocab(config, cdb=None, vocab=None, word_tokenizer=None) Bases: :py:obj:`object` Create a new vocab from a text file. :param config: Global configuration for medcat. :type config: medcat.config.Config :param cdb: The concept database that will be added ontop of the Vocab built from the text file. :type cdb: medcat.cdb.CDB :param vocab: Vocabulary to be extended, leave as None if you want to make a new Vocab. Default: None :type vocab: medcat.vocab.Vocab, optional :param word_tokenizer: A custom tokenizer for word spliting - used if embeddings are BERT or similar. Default: None :type word_tokenizer: .. rubric:: Examples To make a vocab and train word embeddings. >>> cdb = >>> maker = MakeVocab(cdb=cdb, config=config) >>> maker.make(data_iterator, out_folder="./output/") >>> maker.add_vectors(in_path="./output/data.txt") .. py:method:: __init__(config, cdb=None, vocab=None, word_tokenizer=None) .. py:method:: _tok(text) .. py:method:: make(iter_data, out_folder, join_cdb=True, normalize_tokens=False) Make a vocab - without vectors initially. This will create two files in the out_folder: - vocab.dat -> The vocabulary without vectors - data.txt -> The tokenized dataset prepared for training of word2vec or similar embeddings. :param iter_data: An iterator over sentences or documents. Can also be a simple array of text documents/sentences. :type iter_data: Iterator :param out_folder: A path to a folder where all the results will be saved. :type out_folder: string :param join_cdb: Should the words from the CDB be added to the Vocab. Default: True. :type join_cdb: bool :param normalize_tokens: If set tokens will be lematized - tends to work better in some cases where the difference between e.g. plural/singular should be ignored. But in general not so important if the dataset is big enough. :type normalize_tokens: bool, defaults to True .. py:method:: add_vectors(in_path = None, w2v = None, overwrite = False, data_iter = None, workers = 14, epochs = 2, min_count = 10, window = 10, vector_size = 300, unigram_table_size = 100000000) Add vectors to an existing vocabulary and save changes to the vocab_path. :param in_path: Path to the data.txt that was created by the MakeVocab.make() function. :type in_path: Optional[str] :param w2v: An existing word2vec instance. Default: None :type w2v: Optional[Word2Vec] :param overwrite: If True it will overwrite existing vectors in the vocabulary. Default: False :type overwrite: bool :param data_iter: If you want to provide a customer iterator over the data use this. If yes, then in_path is not needed. :type data_iter: Optional[Iterator] :param workers: Number of workers for Word2Vec. Defaults to 14. :type workers: int :param epochs: Number of epochs for Word2Vec. Defaults to 2. :type epochs: int :param min_count: Minimum count for Word2Vec. Defaults to 10. :type min_count: int :param window: Window size for Word2Vec. Defaults to 10. :type window: int :param vector_size: Vector size for Word2Vec. Defaults to 300. :type vector_size: int :param unigram_table_size: Unigram table size for vocab. Defaults to 100_000_000. :type unigram_table_size: int :raises ValueError: In case of unknown input. :Returns: **Word2Vec** -- A trained word2vec model. .. py:method:: destroy_pipe()