medcat.utils.vocab_utils

Module Contents

Functions

calc_matrix(vocab, target_size)

Calculate the transformation matrix based on the word vectors in the Vocab.

convert_vec(cur, matrix[, target_dtype])

Helper function to convert the vector.

convert_vocab(vocab, matrix[, unigram_table_size])

Use the transformation matrix to convert the word vectors.

convert_context_vectors(cdb, matrix)

Use the transformation matrix to convert the context vectors within the CDB.

convert_vocab_vector_size(cdb, vocab, vec_size)

Convert the vocab vector size to a smaller one.

Attributes

logger

medcat.utils.vocab_utils.logger
medcat.utils.vocab_utils.calc_matrix(vocab, target_size)

Calculate the transformation matrix based on the word vectors in the Vocab.

Performs Principal Component Analysis (PCA). This first means all the word vectors in the Vocab. It then finds the covariance matrix. After that, the eigenvalues and and eigenvectors are calculated. And the target_size eigenvectors corresponding to the largest eigenvalues are selected to create the transformation matrix.

Parameters:
  • vocab (Vocab) – The Vocab.

  • target_size (int) – The target vector size.

Returns:

np.ndarray – The transformation matrix.

Return type:

numpy.ndarray

medcat.utils.vocab_utils.convert_vec(cur, matrix, target_dtype=np.float32)

Helper function to convert the vector.

This also guarantees uniform typing (of np.float32) since in our experience some vectors may be of a different type before (i.e np.float64).

Parameters:
  • cur (np.ndarray) – The current vector.

  • matrix (np.ndarray) – The transformation matrix.

  • target_dtype (Type) – The target element data ype. Defaults to np.float32.

Returns:

np.ndarray – The transformed vector.

Return type:

numpy.ndarray

medcat.utils.vocab_utils.convert_vocab(vocab, matrix, unigram_table_size=10000000)

Use the transformation matrix to convert the word vectors.

Parameters:
  • vocab (Vocab) – The Vocab.

  • matrix (np.ndarray) – The transformation matrix.

  • unigram_table_size (int) – The unigram table size. Defualts to 10 000 000.

Return type:

None

medcat.utils.vocab_utils.convert_context_vectors(cdb, matrix)

Use the transformation matrix to convert the context vectors within the CDB.

Parameters:
  • cdb (CDB) – The Context Database.

  • matrix (np.ndarray) – The transformation matrix.

Return type:

None

medcat.utils.vocab_utils.convert_vocab_vector_size(cdb, vocab, vec_size)

Convert the vocab vector size to a smaller one.

This uses Principal Component Analysis (PCA). The idea is that we first center all the word vectors (in Vocab), then compute the covariance matrix, then find the eigenvalues and eigenvectors, and then we select the top vec_size eigenvectors. This produces a transformation matrix of shape (vec_size, N), where N is the current vector length in the vocab.

After that, we perform the tranformation. First we transform all the vectors in the Vocab. And then we transform all the context vectors defined within the CDB.

NOTE: This requires the CDB as well since the per concept context vectors stored within it are based on the vectors in the vocab and thus they also need to be transformed.

Parameters:
  • cdb (CDB) – The Concept Database.

  • vocab (Vocab) – The Vocab.

  • vec_size (int) – The target vector size.