medcat.utils.make_vocab

Module Contents

Classes

MakeVocab

Create a new vocab from a text file.

Attributes

logger

medcat.utils.make_vocab.logger
class medcat.utils.make_vocab.MakeVocab(config, cdb=None, vocab=None, word_tokenizer=None)

Bases: object

Create a new vocab from a text file.

Parameters:
  • config (medcat.config.Config) – Global configuration for medcat.

  • cdb (medcat.cdb.CDB) – The concept database that will be added ontop of the Vocab built from the text file.

  • vocab (medcat.vocab.Vocab, optional) – Vocabulary to be extended, leave as None if you want to make a new Vocab. Default: None

  • word_tokenizer (<function>) – A custom tokenizer for word spliting - used if embeddings are BERT or similar. Default: None

Examples

To make a vocab and train word embeddings.

>>> cdb = <your existing cdb>
>>> maker = MakeVocab(cdb=cdb, config=config)
>>> maker.make(data_iterator, out_folder="./output/")
>>> maker.add_vectors(in_path="./output/data.txt")
__init__(config, cdb=None, vocab=None, word_tokenizer=None)
_tok(text)
make(iter_data, out_folder, join_cdb=True, normalize_tokens=False)

Make a vocab - without vectors initially. This will create two files in the out_folder: - vocab.dat -> The vocabulary without vectors - data.txt -> The tokenized dataset prepared for training of word2vec or similar embeddings.

Parameters:
  • iter_data (Iterator) – An iterator over sentences or documents. Can also be a simple array of text documents/sentences.

  • out_folder (string) – A path to a folder where all the results will be saved.

  • join_cdb (bool) – Should the words from the CDB be added to the Vocab. Default: True.

  • normalize_tokens (bool, defaults to True) – If set tokens will be lematized - tends to work better in some cases where the difference between e.g. plural/singular should be ignored. But in general not so important if the dataset is big enough.

add_vectors(in_path=None, w2v=None, overwrite=False, data_iter=None, workers=14, epochs=2, min_count=10, window=10, vector_size=300, unigram_table_size=100000000)

Add vectors to an existing vocabulary and save changes to the vocab_path.

Parameters:
  • in_path (Optional[str]) – Path to the data.txt that was created by the MakeVocab.make() function.

  • w2v (Optional[Word2Vec]) – An existing word2vec instance. Default: None

  • overwrite (bool) – If True it will overwrite existing vectors in the vocabulary. Default: False

  • data_iter (Optional[Iterator]) – If you want to provide a customer iterator over the data use this. If yes, then in_path is not needed.

  • workers (int) – Number of workers for Word2Vec. Defaults to 14.

  • epochs (int) – Number of epochs for Word2Vec. Defaults to 2.

  • min_count (int) – Minimum count for Word2Vec. Defaults to 10.

  • window (int) – Window size for Word2Vec. Defaults to 10.

  • vector_size (int) – Vector size for Word2Vec. Defaults to 300.

  • unigram_table_size (int) – Unigram table size for vocab. Defaults to 100_000_000.

Raises:

ValueError – In case of unknown input.

Returns:

Word2Vec – A trained word2vec model.

Return type:

gensim.models.Word2Vec

destroy_pipe()