medcat.vocab

Module Contents

Classes

Vocab

Vocabulary used to store word embeddings for context similarity

class medcat.vocab.Vocab

Bases: object

Vocabulary used to store word embeddings for context similarity calculation. Also used by the spell checker - but not for fixing the spelling only for checking is something correct.

Properties:
vocab (dict):

Map from word to attributes, e.g. {‘house’: {‘vec’: <np.array>, ‘cnt’: <int>, …}, …}

index2word (dict):

From word to an index - used for negative sampling

vec_index2word (dict):

Same as index2word but only words that have vectors

unigram_table (dict):

Negative sampling.

__init__()
Return type:

None

inc_or_add(word, cnt=1, vec=None)

Add a word or incrase its count.

Parameters:
  • word (str) – Word to be added

  • cnt (int) – By how much should the count be increased, or to what should it be set if a new word. (Default value = 1)

  • vec (Optional[np.ndarray]) – Word vector (Default value = None)

Return type:

None

remove_all_vectors()

Remove all stored vector representations.

Return type:

None

remove_words_below_cnt(cnt)

Remove all words with frequency below cnt.

Parameters:

cnt (int) – Word count limit.

Return type:

None

inc_wc(word, cnt=1)

Incraese word count by cnt.

Parameters:
  • word (str) – For which word to increase the count

  • cnt (int) – By how muhc to incrase the count (Default value = 1)

Return type:

None

add_vec(word, vec)

Add vector to a word.

Parameters:
  • word (str) – To which word to add the vector.

  • vec (np.ndarray) – The vector to add.

Return type:

None

reset_counts(cnt=1)

Reset the count for all word to cnt.

Parameters:

cnt (int) – New count for all words in the vocab. (Default value = 1)

Return type:

None

update_counts(tokens)

Given a list of tokens update counts for words in the vocab.

Parameters:

tokens (List[str]) – Usually a large block of text split into tokens/words.

Return type:

None

add_word(word, cnt=1, vec=None, replace=True)

Add a word to the vocabulary

Parameters:
  • word (str) – The word to be added, it should be lemmatized and lowercased

  • cnt (int) – Count of this word in your dataset (Default value = 1)

  • vec (Optional[np.ndarray]) – The vector representation of the word (Default value = None)

  • replace (bool) – Will replace old vector representation (Default value = True)

Return type:

None

add_words(path, replace=True)

Adds words to the vocab from a file, the file is required to have the following format (vec being optional):

<word> <cnt>[ <vec_space_separated>]

e.g. one line: the word house with 3 dimensional vectors

house 34444 0.3232 0.123213 1.231231

Parameters:
  • path (str) – path to the file with words and vectors

  • replace (bool) – existing words in the vocabulary will be replaced (Default value = True)

Return type:

None

make_unigram_table(table_size=100000000)

Make unigram table for negative sampling, look at the paper if interested in details.

Parameters:

table_size (int) – The size of the table (Defaults to 100 000 000)

Return type:

None

get_negative_samples(n=6, ignore_punct_and_num=False)

Get N negative samples.

Parameters:
  • n (int) – How many words to return (Default value = 6)

  • ignore_punct_and_num (bool) – Whether to ignore punctuation and numbers. (Default value = False)

Raises:

Exception – If no unigram table is present.

Returns:

List[int] – Indices for words in this vocabulary.

Return type:

List[int]

__getitem__(word)
Parameters:

word (str) –

Return type:

int

vec(word)
Parameters:

word (str) –

Return type:

numpy.ndarray

count(word)
Parameters:

word (str) –

Return type:

int

item(word)
Parameters:

word (str) –

Return type:

Dict

__contains__(word)
Parameters:

word (str) –

Return type:

bool

save(path)
Parameters:

path (str) –

Return type:

None

classmethod load(path)
Parameters:

path (str) –

Return type:

Vocab