`medcat.vocab`

Module Contents

Classes

Vocab

Vocabulary used to store word embeddings for context similarity

class medcat.vocab.Vocab

Bases: object

Vocabulary used to store word embeddings for context similarity calculation. Also used by the spell checker - but not for fixing the spelling only for checking is something correct.

Properties:

vocab (dict):: Map from word to attributes, e.g. {‘house’: {‘vec’: <np.array>, ‘cnt’: <int>, …}, …}
index2word (dict):: From word to an index - used for negative sampling
vec_index2word (dict):: Same as index2word but only words that have vectors
unigram_table (dict):: Negative sampling.

__init__()

Return type:: None

inc_or_add(word, cnt=1, vec=None)

Add a word or incrase its count.

Parameters:

word (str) – Word to be added
cnt (int) – By how much should the count be increased, or to what should it be set if a new word. (Default value = 1)
vec (Optional[np.ndarray]) – Word vector (Default value = None)

Return type:

None

remove_all_vectors()

Remove all stored vector representations.

Return type:: None

remove_words_below_cnt(cnt)

Remove all words with frequency below cnt.

Parameters:: cnt (int) – Word count limit.
Return type:: None

inc_wc(word, cnt=1)

Incraese word count by cnt.

Parameters:

word (str) – For which word to increase the count
cnt (int) – By how muhc to incrase the count (Default value = 1)

Return type:

None

add_vec(word, vec)

Add vector to a word.

Parameters:

word (str) – To which word to add the vector.
vec (np.ndarray) – The vector to add.

Return type:

None

reset_counts(cnt=1)

Reset the count for all word to cnt.

Parameters:: cnt (int) – New count for all words in the vocab. (Default value = 1)
Return type:: None

update_counts(tokens)

Given a list of tokens update counts for words in the vocab.

Parameters:: tokens (List[str]) – Usually a large block of text split into tokens/words.
Return type:: None

add_word(word, cnt=1, vec=None, replace=True)

Add a word to the vocabulary

Parameters:

word (str) – The word to be added, it should be lemmatized and lowercased
cnt (int) – Count of this word in your dataset (Default value = 1)
vec (Optional[np.ndarray]) – The vector representation of the word (Default value = None)
replace (bool) – Will replace old vector representation (Default value = True)

Return type:

None

add_words(path, replace=True)

Adds words to the vocab from a file, the file is required to have the following format (vec being optional):

<word> <cnt>[ <vec_space_separated>]

e.g. one line: the word house with 3 dimensional vectors: house 34444 0.3232 0.123213 1.231231

Parameters:

path (str) – path to the file with words and vectors
replace (bool) – existing words in the vocabulary will be replaced (Default value = True)

Return type:

None

make_unigram_table(table_size=100000000)

Make unigram table for negative sampling, look at the paper if interested in details.

Parameters:: table_size (int) – The size of the table (Defaults to 100 000 000)
Return type:: None

get_negative_samples(n=6, ignore_punct_and_num=False)

Get N negative samples.

Parameters:

n (int) – How many words to return (Default value = 6)
ignore_punct_and_num (bool) – Whether to ignore punctuation and numbers. (Default value = False)

Raises:

Exception – If no unigram table is present.

Returns:

List[int] – Indices for words in this vocabulary.

Return type:

List[int]

__getitem__(word)

Parameters:: word (str) –
Return type:: int

vec(word)

Parameters:: word (str) –
Return type:: numpy.ndarray

count(word)

Parameters:: word (str) –
Return type:: int

item(word)

Parameters:: word (str) –
Return type:: Dict

__contains__(word)

Parameters:: word (str) –
Return type:: bool

save(path)

Parameters:: path (str) –
Return type:: None

classmethod load(path)

Parameters:: path (str) –
Return type:: Vocab

medcat.vocab

Module Contents

Classes

`medcat.vocab`