:py:mod:`medcat.utils.vocab_utils` ================================== .. py:module:: medcat.utils.vocab_utils Module Contents --------------- Functions ~~~~~~~~~ .. autoapisummary:: medcat.utils.vocab_utils.calc_matrix medcat.utils.vocab_utils.convert_vec medcat.utils.vocab_utils.convert_vocab medcat.utils.vocab_utils.convert_context_vectors medcat.utils.vocab_utils.convert_vocab_vector_size Attributes ~~~~~~~~~~ .. autoapisummary:: medcat.utils.vocab_utils.logger .. py:data:: logger .. py:function:: calc_matrix(vocab, target_size) Calculate the transformation matrix based on the word vectors in the Vocab. Performs Principal Component Analysis (PCA). This first means all the word vectors in the Vocab. It then finds the covariance matrix. After that, the eigenvalues and and eigenvectors are calculated. And the `target_size` eigenvectors corresponding to the largest eigenvalues are selected to create the transformation matrix. :param vocab: The Vocab. :type vocab: Vocab :param target_size: The target vector size. :type target_size: int :Returns: **np.ndarray** -- The transformation matrix. .. py:function:: convert_vec(cur, matrix, target_dtype = np.float32) Helper function to convert the vector. This also guarantees uniform typing (of np.float32) since in our experience some vectors may be of a different type before (i.e np.float64). :param cur: The current vector. :type cur: np.ndarray :param matrix: The transformation matrix. :type matrix: np.ndarray :param target_dtype: The target element data ype. Defaults to np.float32. :type target_dtype: Type :Returns: **np.ndarray** -- The transformed vector. .. py:function:: convert_vocab(vocab, matrix, unigram_table_size = 10000000) Use the transformation matrix to convert the word vectors. :param vocab: The Vocab. :type vocab: Vocab :param matrix: The transformation matrix. :type matrix: np.ndarray :param unigram_table_size: The unigram table size. Defualts to 10 000 000. :type unigram_table_size: int .. py:function:: convert_context_vectors(cdb, matrix) Use the transformation matrix to convert the context vectors within the CDB. :param cdb: The Context Database. :type cdb: CDB :param matrix: The transformation matrix. :type matrix: np.ndarray .. py:function:: convert_vocab_vector_size(cdb, vocab, vec_size) Convert the vocab vector size to a smaller one. This uses Principal Component Analysis (PCA). The idea is that we first center all the word vectors (in Vocab), then compute the covariance matrix, then find the eigenvalues and eigenvectors, and then we select the top `vec_size` eigenvectors. This produces a transformation matrix of shape (vec_size, N), where N is the current vector length in the vocab. After that, we perform the tranformation. First we transform all the vectors in the Vocab. And then we transform all the context vectors defined within the CDB. NOTE: This requires the CDB as well since the per concept context vectors stored within it are based on the vectors in the vocab and thus they also need to be transformed. :param cdb: The Concept Database. :type cdb: CDB :param vocab: The Vocab. :type vocab: Vocab :param vec_size: The target vector size. :type vec_size: int