:py:mod:`medcat.vocab` ====================== .. py:module:: medcat.vocab Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: medcat.vocab.Vocab .. py:class:: Vocab Bases: :py:obj:`object` Vocabulary used to store word embeddings for context similarity calculation. Also used by the spell checker - but not for fixing the spelling only for checking is something correct. Properties: vocab (dict): Map from word to attributes, e.g. {'house': {'vec': , 'cnt': , ...}, ...} index2word (dict): From word to an index - used for negative sampling vec_index2word (dict): Same as index2word but only words that have vectors unigram_table (dict): Negative sampling. .. py:method:: __init__() .. py:method:: inc_or_add(word, cnt = 1, vec = None) Add a word or incrase its count. :param word: Word to be added :type word: str :param cnt: By how much should the count be increased, or to what should it be set if a new word. (Default value = 1) :type cnt: int :param vec: Word vector (Default value = None) :type vec: Optional[np.ndarray] .. py:method:: remove_all_vectors() Remove all stored vector representations. .. py:method:: remove_words_below_cnt(cnt) Remove all words with frequency below cnt. :param cnt: Word count limit. :type cnt: int .. py:method:: inc_wc(word, cnt = 1) Incraese word count by cnt. :param word: For which word to increase the count :type word: str :param cnt: By how muhc to incrase the count (Default value = 1) :type cnt: int .. py:method:: add_vec(word, vec) Add vector to a word. :param word: To which word to add the vector. :type word: str :param vec: The vector to add. :type vec: np.ndarray .. py:method:: reset_counts(cnt = 1) Reset the count for all word to cnt. :param cnt: New count for all words in the vocab. (Default value = 1) :type cnt: int .. py:method:: update_counts(tokens) Given a list of tokens update counts for words in the vocab. :param tokens: Usually a large block of text split into tokens/words. :type tokens: List[str] .. py:method:: add_word(word, cnt = 1, vec = None, replace = True) Add a word to the vocabulary :param word: The word to be added, it should be lemmatized and lowercased :type word: str :param cnt: Count of this word in your dataset (Default value = 1) :type cnt: int :param vec: The vector representation of the word (Default value = None) :type vec: Optional[np.ndarray] :param replace: Will replace old vector representation (Default value = True) :type replace: bool .. py:method:: add_words(path, replace = True) Adds words to the vocab from a file, the file is required to have the following format (vec being optional): [ ] e.g. one line: the word house with 3 dimensional vectors house 34444 0.3232 0.123213 1.231231 :param path: path to the file with words and vectors :type path: str :param replace: existing words in the vocabulary will be replaced (Default value = True) :type replace: bool .. py:method:: make_unigram_table(table_size = 100000000) Make unigram table for negative sampling, look at the paper if interested in details. :param table_size: The size of the table (Defaults to 100 000 000) :type table_size: int .. py:method:: get_negative_samples(n = 6, ignore_punct_and_num = False) Get N negative samples. :param n: How many words to return (Default value = 6) :type n: int :param ignore_punct_and_num: Whether to ignore punctuation and numbers. (Default value = False) :type ignore_punct_and_num: bool :raises Exception: If no unigram table is present. :Returns: **List[int]** -- Indices for words in this vocabulary. .. py:method:: __getitem__(word) .. py:method:: vec(word) .. py:method:: count(word) .. py:method:: item(word) .. py:method:: __contains__(word) .. py:method:: save(path) .. py:method:: load(path) :classmethod: