medcat.utils.model_creator

Module Contents

Functions

create_cdb(concept_csv_file, medcat_config)

Create concept database from csv.

create_vocab(cdb, training_data_list, medcat_config, ...)

Create vocabulary for word embeddings and spell check from list of training documents and CDB.

train_unsupervised(cdb, vocab, config, output_dir, ...)

Perform unsupervised training and save updated CDB.

create_models(config_file)

Create MedCAT CDB and Vocabulary models.

main(config_file)

Attributes

DEFAULT_UNIGRAM_TABLE_SIZE

logger

parser

medcat.utils.model_creator.DEFAULT_UNIGRAM_TABLE_SIZE = 100000000
medcat.utils.model_creator.logger
medcat.utils.model_creator.create_cdb(concept_csv_file, medcat_config)

Create concept database from csv.

Parameters:
  • concept_csv_file (Path) – Path to CSV file containing all concepts and synonyms.

  • medcat_config (Config) – MedCAT configuration file.

Returns:

CDB – MedCAT concept database containing list of entities and synonyms, without context embeddings.

Return type:

medcat.cdb.CDB

medcat.utils.model_creator.create_vocab(cdb, training_data_list, medcat_config, output_dir, unigram_table_size)

Create vocabulary for word embeddings and spell check from list of training documents and CDB.

Parameters:
  • cdb (medcat.cdb.CDB) – MedCAT concept database containing list of entities and synonyms.

  • training_data_list (list) – List of example documents.

  • medcat_config (medcat.config.Config) – MedCAT configuration file.

  • output_dir (pathlib.Path) – Output directory to write vocabulary and data.txt (required to create vocabulary) to.

  • unigram_table_size (int) – Size of unigram table to be initialized before creating vocabulary.

Returns:

medcat.vocab.Vocab – MedCAT vocabulary created from CDB and training documents.

medcat.utils.model_creator.train_unsupervised(cdb, vocab, config, output_dir, training_data_list)

Perform unsupervised training and save updated CDB.

Although not returned explicitly in this function, the CDB will be updated with context embeddings.

Parameters:
  • cdb (medcat.cdb.CDB) – MedCAT concept database containing list of entities and synonyms.

  • vocab (medcat.vocab.Vocab) – MedCAT vocabulary created from CDB and training documents.

  • config (medcat.config.Config) – MedCAT configuration file.

  • output_dir (pathlib.Path) – Output directory to write updated CDB to.

  • training_data_list (list) – List of example documents.

Returns:

medcat.cdb.CDB – MedCAT concept database containing list of entities and synonyms, as well as context embeddings.

medcat.utils.model_creator.create_models(config_file)

Create MedCAT CDB and Vocabulary models.

Parameters:

config_file (pathlib.Path) – Location of model creator configuration file to specify input, output and MedCAT configuration.

Returns:

medcat.cat.CAT – Containing CDB, Vocab and Config.

medcat.utils.model_creator.main(config_file)
medcat.utils.model_creator.parser