medcat.utils.make_vocab
Module Contents
Classes
Create a new vocab from a text file. |
Attributes
- medcat.utils.make_vocab.logger
- class medcat.utils.make_vocab.MakeVocab(config, cdb=None, vocab=None, word_tokenizer=None)
Bases:
object
Create a new vocab from a text file.
- Parameters:
config (medcat.config.Config) – Global configuration for medcat.
cdb (medcat.cdb.CDB) – The concept database that will be added ontop of the Vocab built from the text file.
vocab (medcat.vocab.Vocab, optional) – Vocabulary to be extended, leave as None if you want to make a new Vocab. Default: None
word_tokenizer (<function>) – A custom tokenizer for word spliting - used if embeddings are BERT or similar. Default: None
Examples
To make a vocab and train word embeddings.
>>> cdb = <your existing cdb> >>> maker = MakeVocab(cdb=cdb, config=config) >>> maker.make(data_iterator, out_folder="./output/") >>> maker.add_vectors(in_path="./output/data.txt")
- __init__(config, cdb=None, vocab=None, word_tokenizer=None)
- _tok(text)
- make(iter_data, out_folder, join_cdb=True, normalize_tokens=False)
Make a vocab - without vectors initially. This will create two files in the out_folder: - vocab.dat -> The vocabulary without vectors - data.txt -> The tokenized dataset prepared for training of word2vec or similar embeddings.
- Parameters:
iter_data (Iterator) – An iterator over sentences or documents. Can also be a simple array of text documents/sentences.
out_folder (string) – A path to a folder where all the results will be saved.
join_cdb (bool) – Should the words from the CDB be added to the Vocab. Default: True.
normalize_tokens (bool, defaults to True) – If set tokens will be lematized - tends to work better in some cases where the difference between e.g. plural/singular should be ignored. But in general not so important if the dataset is big enough.
- add_vectors(in_path=None, w2v=None, overwrite=False, data_iter=None, workers=14, epochs=2, min_count=10, window=10, vector_size=300, unigram_table_size=100000000)
Add vectors to an existing vocabulary and save changes to the vocab_path.
- Parameters:
in_path (Optional[str]) – Path to the data.txt that was created by the MakeVocab.make() function.
w2v (Optional[Word2Vec]) – An existing word2vec instance. Default: None
overwrite (bool) – If True it will overwrite existing vectors in the vocabulary. Default: False
data_iter (Optional[Iterator]) – If you want to provide a customer iterator over the data use this. If yes, then in_path is not needed.
workers (int) – Number of workers for Word2Vec. Defaults to 14.
epochs (int) – Number of epochs for Word2Vec. Defaults to 2.
min_count (int) – Minimum count for Word2Vec. Defaults to 10.
window (int) – Window size for Word2Vec. Defaults to 10.
vector_size (int) – Vector size for Word2Vec. Defaults to 300.
unigram_table_size (int) – Unigram table size for vocab. Defaults to 100_000_000.
- Raises:
ValueError – In case of unknown input.
- Returns:
Word2Vec – A trained word2vec model.
- Return type:
gensim.models.Word2Vec
- destroy_pipe()