medcat.rel_cat
Module Contents
Classes
Base class for all Samplers. |
|
The RelCAT class used for training 'Relation-Annotation' models, i.e., annotation of relations |
- class medcat.rel_cat.BalancedBatchSampler(dataset, classes, batch_size, max_samples, max_minority)
Bases:
torch.utils.data.SamplerBase class for all Samplers.
Every Sampler subclass has to provide an
__iter__()method, providing a way to iterate over indices or lists of indices (batches) of dataset elements, and may provide a__len__()method that returns the length of the returned iterators.- Parameters:
data_source (Dataset) – This argument is not used and will be removed in 2.2.0. You may still have custom implementation that utilizes it.
Example
>>> # xdoctest: +SKIP >>> class AccedingSequenceLengthSampler(Sampler[int]): >>> def __init__(self, data: List[str]) -> None: >>> self.data = data >>> >>> def __len__(self) -> int: >>> return len(self.data) >>> >>> def __iter__(self) -> Iterator[int]: >>> sizes = torch.tensor([len(x) for x in self.data]) >>> yield from torch.argsort(sizes).tolist() >>> >>> class AccedingSequenceLengthBatchSampler(Sampler[List[int]]): >>> def __init__(self, data: List[str], batch_size: int) -> None: >>> self.data = data >>> self.batch_size = batch_size >>> >>> def __len__(self) -> int: >>> return (len(self.data) + self.batch_size - 1) // self.batch_size >>> >>> def __iter__(self) -> Iterator[List[int]]: >>> sizes = torch.tensor([len(x) for x in self.data]) >>> for batch in torch.chunk(torch.argsort(sizes), len(self)): >>> yield batch.tolist()
Note
The
__len__()method isn’t strictly required byDataLoader, but is expected in any calculation involving the length of aDataLoader.- __init__(dataset, classes, batch_size, max_samples, max_minority)
- __len__()
- __iter__()
- class medcat.rel_cat.RelCAT(cdb, config=ConfigRelCAT(), task='train', init_model=False)
Bases:
medcat.pipeline.pipe_runner.PipeRunner- The RelCAT class used for training ‘Relation-Annotation’ models, i.e., annotation of relations
between clinical concepts.
- Parameters:
cdb (CDB) – cdb, this is used when creating relation datasets.
tokenizer (TokenizerWrapperBERT) – The Huggingface tokenizer instance. This can be a pre-trained tokenzier instance from a BERT-style model. For now, only BERT models are supported.
config (ConfigRelCAT) – the configuration for RelCAT. Param descriptions available in ConfigRelCAT class docs.
task (str, optional) – What task is this model supposed to handle. Defaults to “train”
init_model (bool, optional) – loads default model. Defaults to False.
- name = 'rel_cat'
- log
- __init__(cdb, config=ConfigRelCAT(), task='train', init_model=False)
- Parameters:
cdb (medcat.cdb.CDB) –
config (medcat.config_rel_cat.ConfigRelCAT) –
- save(save_path='./')
- Parameters:
save_path (str) –
- Return type:
None
- __call__(doc)
- Parameters:
doc (spacy.tokens.Doc) –
- Return type:
spacy.tokens.Doc
- _create_test_train_datasets(data, split_sets=False)
- Parameters:
data (Dict) –
split_sets (bool) –
- train(export_data_path='', train_csv_path='', test_csv_path='', checkpoint_path='./')
- Parameters:
export_data_path (str) –
train_csv_path (str) –
test_csv_path (str) –
checkpoint_path (str) –
- evaluate_(output_logits, labels, ignore_idx)
- evaluate_results(data_loader, pad_id)
- pipe(stream, *args, **kwargs)
- Parameters:
stream (Iterable[spacy.tokens.Doc]) –
- Return type:
Iterator[spacy.tokens.Doc]
- predict_text_with_anns(text, annotations)
Creates spacy doc from text and annotation input. Predicts using self.__call__
- Parameters:
text (str) – text
annotations (Dict) –
dict containing the entities from NER (of your choosing), the format must be the following format:
- [
- {
“cui”: “202099003”, -this is optional “value”: “discoid lateral meniscus”, “start”: 294, “end”: 318
}, {
”cui”: “202099003”, “value”: “Discoid lateral meniscus”, “start”: 1905, “end”: 1929,
}
]
- Returns:
Doc – spacy doc with the relations.
- Return type:
spacy.tokens.Doc