medcat.utils.relation_extraction.utils

Module Contents

Functions

split_list_train_test_by_class(data[, test_size, shuffle])

param data:

"output_relations": relation_instances, <-- see create_base_relations_from_doc/csv

load_bin_file(file_name[, path])

save_bin_file(file_name, data[, path])

save_state(model, optimizer, scheduler[, epoch, ...])

Used by RelCAT.save() and RelCAT.train()

load_state(model, optimizer, scheduler[, path, ...])

Used by RelCAT.load() and RelCAT.train()

save_results(data[, model_name, path, file_prefix])

load_results(path[, model_name, file_prefix])

put_blanks(relation_data[, blanking_threshold])

param relation_data:

tuple containing token (sentence_token_span , ent1 , ent2)

create_tokenizer_pretrain(tokenizer, tokenizer_path)

This method simply adds special tokens that we enouncter

tokenize(relations_dataset, tokenizer[, mask_probability])

medcat.utils.relation_extraction.utils.split_list_train_test_by_class(data, test_size=0.2, shuffle=True)
Parameters:
  • data (List) – “output_relations”: relation_instances, <– see create_base_relations_from_doc/csv for data columns

  • test_size (float) – Defaults to 0.2.

  • shuffle (bool) – shuffle data randomly. Defaults to True.

Returns:

Tuple[List, List] – train and test datasets

Return type:

Tuple[List, List]

medcat.utils.relation_extraction.utils.load_bin_file(file_name, path='./')
Return type:

Any

medcat.utils.relation_extraction.utils.save_bin_file(file_name, data, path='./')
medcat.utils.relation_extraction.utils.save_state(model, optimizer, scheduler, epoch=1, best_f1=0.0, path='./', model_name='BERT', task='train', is_checkpoint=False, final_export=False)
Used by RelCAT.save() and RelCAT.train()

Saves the RelCAT model state. For checkpointing multiple files are created, best_f1, loss etc. score. If you want to export the model after training set final_export=True and leave is_checkpoint=False.

Parameters:
  • model (BertModel_RelationExtraction) – model

  • optimizer (torch.optim.Adam, optional) – Defaults to None.

  • scheduler (torch.optim.lr_scheduler.MultiStepLR, optional) – Defaults to None.

  • epoch (int) – Defaults to None.

  • best_f1 (float) – Defaults to None.

  • path (str) – Defaults to “./”.

  • model_name (str) – . Defaults to “BERT”. This is used to checkpointing only.

  • task (str) – Defaults to “train”. This is used to checkpointing only.

  • is_checkpoint (bool) – Defaults to False.

  • final_export (bool) – Defaults to False, if True then is_checkpoint must be False also. Exports model.state_dict(), out into”model.dat”.

Return type:

None

medcat.utils.relation_extraction.utils.load_state(model, optimizer, scheduler, path='./', model_name='BERT', file_prefix='train', load_best=False, device=torch.device('cpu'), config=ConfigRelCAT())

Used by RelCAT.load() and RelCAT.train()

Parameters:
  • model (BertModel_RelationExtraction) – model, it has to be initialized before calling this method via BertModel_RelationExtraction(…)

  • optimizer (_type_) – optimizer

  • scheduler (_type_) – scheduler

  • path (str, optional) – Defaults to “./”.

  • model_name (str, optional) – Defaults to “BERT”.

  • file_prefix (str, optional) – Defaults to “train”.

  • load_best (bool, optional) – Defaults to False.

  • device (torch.device, optional) – Defaults to torch.device(“cpu”).

  • config (ConfigRelCAT) – Defaults to ConfigRelCAT().

Returns:

Tuple (int, int) – last epoch and f1 score.

Return type:

Tuple[int, int]

medcat.utils.relation_extraction.utils.save_results(data, model_name='BERT', path='./', file_prefix='train')
Parameters:
  • model_name (str) –

  • path (str) –

  • file_prefix (str) –

medcat.utils.relation_extraction.utils.load_results(path, model_name='BERT', file_prefix='train')
Parameters:
  • model_name (str) –

  • file_prefix (str) –

Return type:

Tuple[List, List, List]

medcat.utils.relation_extraction.utils.put_blanks(relation_data, blanking_threshold=0.5)
Parameters:
  • relation_data (List) – tuple containing token (sentence_token_span , ent1 , ent2) Puts blanks randomly in the relation. Used for pre-training.

  • blanking_threshold (float) – % threshold to blank token ids. Defaults to 0.5.

Returns:

List – data

Return type:

List

medcat.utils.relation_extraction.utils.create_tokenizer_pretrain(tokenizer, tokenizer_path)

This method simply adds special tokens that we enouncter

Parameters:
  • tokenizer (TokenizerWrapperBERT) – BERT tokenizer.

  • tokenizer_path (str) – path where tokenizer is to be saved.

medcat.utils.relation_extraction.utils.tokenize(relations_dataset, tokenizer, mask_probability=0.5)
Parameters:
Return type:

Tuple