medcat.vocab
Module Contents
Classes
Vocabulary used to store word embeddings for context similarity |
- class medcat.vocab.Vocab
Bases:
object
Vocabulary used to store word embeddings for context similarity calculation. Also used by the spell checker - but not for fixing the spelling only for checking is something correct.
- Properties:
- vocab (dict):
Map from word to attributes, e.g. {‘house’: {‘vec’: <np.array>, ‘cnt’: <int>, …}, …}
- index2word (dict):
From word to an index - used for negative sampling
- vec_index2word (dict):
Same as index2word but only words that have vectors
- unigram_table (dict):
Negative sampling.
- __init__()
- Return type:
None
- inc_or_add(word, cnt=1, vec=None)
Add a word or incrase its count.
- Parameters:
word (str) – Word to be added
cnt (int) – By how much should the count be increased, or to what should it be set if a new word. (Default value = 1)
vec (Optional[np.ndarray]) – Word vector (Default value = None)
- Return type:
None
- remove_all_vectors()
Remove all stored vector representations.
- Return type:
None
- remove_words_below_cnt(cnt)
Remove all words with frequency below cnt.
- Parameters:
cnt (int) – Word count limit.
- Return type:
None
- inc_wc(word, cnt=1)
Incraese word count by cnt.
- Parameters:
word (str) – For which word to increase the count
cnt (int) – By how muhc to incrase the count (Default value = 1)
- Return type:
None
- add_vec(word, vec)
Add vector to a word.
- Parameters:
word (str) – To which word to add the vector.
vec (np.ndarray) – The vector to add.
- Return type:
None
- reset_counts(cnt=1)
Reset the count for all word to cnt.
- Parameters:
cnt (int) – New count for all words in the vocab. (Default value = 1)
- Return type:
None
- update_counts(tokens)
Given a list of tokens update counts for words in the vocab.
- Parameters:
tokens (List[str]) – Usually a large block of text split into tokens/words.
- Return type:
None
- add_word(word, cnt=1, vec=None, replace=True)
Add a word to the vocabulary
- Parameters:
word (str) – The word to be added, it should be lemmatized and lowercased
cnt (int) – Count of this word in your dataset (Default value = 1)
vec (Optional[np.ndarray]) – The vector representation of the word (Default value = None)
replace (bool) – Will replace old vector representation (Default value = True)
- Return type:
None
- add_words(path, replace=True)
Adds words to the vocab from a file, the file is required to have the following format (vec being optional):
<word> <cnt>[ <vec_space_separated>]
- e.g. one line: the word house with 3 dimensional vectors
house 34444 0.3232 0.123213 1.231231
- Parameters:
path (str) – path to the file with words and vectors
replace (bool) – existing words in the vocabulary will be replaced (Default value = True)
- Return type:
None
- make_unigram_table(table_size=100000000)
Make unigram table for negative sampling, look at the paper if interested in details.
- Parameters:
table_size (int) – The size of the table (Defaults to 100 000 000)
- Return type:
None
- get_negative_samples(n=6, ignore_punct_and_num=False)
Get N negative samples.
- Parameters:
n (int) – How many words to return (Default value = 6)
ignore_punct_and_num (bool) – Whether to ignore punctuation and numbers. (Default value = False)
- Raises:
Exception – If no unigram table is present.
- Returns:
List[int] – Indices for words in this vocabulary.
- Return type:
List[int]
- __getitem__(word)
- Parameters:
word (str) –
- Return type:
int
- vec(word)
- Parameters:
word (str) –
- Return type:
numpy.ndarray
- count(word)
- Parameters:
word (str) –
- Return type:
int
- item(word)
- Parameters:
word (str) –
- Return type:
Dict
- __contains__(word)
- Parameters:
word (str) –
- Return type:
bool
- save(path)
- Parameters:
path (str) –
- Return type:
None