medcat.utils.vocab_utils
Module Contents
Functions
|
Calculate the transformation matrix based on the word vectors in the Vocab. |
|
Helper function to convert the vector. |
|
Use the transformation matrix to convert the word vectors. |
|
Use the transformation matrix to convert the context vectors within the CDB. |
|
Convert the vocab vector size to a smaller one. |
Attributes
- medcat.utils.vocab_utils.logger
- medcat.utils.vocab_utils.calc_matrix(vocab, target_size)
Calculate the transformation matrix based on the word vectors in the Vocab.
Performs Principal Component Analysis (PCA). This first means all the word vectors in the Vocab. It then finds the covariance matrix. After that, the eigenvalues and and eigenvectors are calculated. And the target_size eigenvectors corresponding to the largest eigenvalues are selected to create the transformation matrix.
- Parameters:
vocab (Vocab) – The Vocab.
target_size (int) – The target vector size.
- Returns:
np.ndarray – The transformation matrix.
- Return type:
numpy.ndarray
- medcat.utils.vocab_utils.convert_vec(cur, matrix, target_dtype=np.float32)
Helper function to convert the vector.
This also guarantees uniform typing (of np.float32) since in our experience some vectors may be of a different type before (i.e np.float64).
- Parameters:
cur (np.ndarray) – The current vector.
matrix (np.ndarray) – The transformation matrix.
target_dtype (Type) – The target element data ype. Defaults to np.float32.
- Returns:
np.ndarray – The transformed vector.
- Return type:
numpy.ndarray
- medcat.utils.vocab_utils.convert_vocab(vocab, matrix, unigram_table_size=10000000)
Use the transformation matrix to convert the word vectors.
- Parameters:
vocab (Vocab) – The Vocab.
matrix (np.ndarray) – The transformation matrix.
unigram_table_size (int) – The unigram table size. Defualts to 10 000 000.
- Return type:
None
- medcat.utils.vocab_utils.convert_context_vectors(cdb, matrix)
Use the transformation matrix to convert the context vectors within the CDB.
- Parameters:
cdb (CDB) – The Context Database.
matrix (np.ndarray) – The transformation matrix.
- Return type:
None
- medcat.utils.vocab_utils.convert_vocab_vector_size(cdb, vocab, vec_size)
Convert the vocab vector size to a smaller one.
This uses Principal Component Analysis (PCA). The idea is that we first center all the word vectors (in Vocab), then compute the covariance matrix, then find the eigenvalues and eigenvectors, and then we select the top vec_size eigenvectors. This produces a transformation matrix of shape (vec_size, N), where N is the current vector length in the vocab.
After that, we perform the tranformation. First we transform all the vectors in the Vocab. And then we transform all the context vectors defined within the CDB.
NOTE: This requires the CDB as well since the per concept context vectors stored within it are based on the vectors in the vocab and thus they also need to be transformed.