medcat.datasets.transformers_ner

Module Contents

Classes

MedCATAnnotationsConfig

BuilderConfig for MedCATNER.

TransformersDatasetNER

MedCATNER: Output of MedCATtrainer

Attributes

_CITATION

_DESCRIPTION

medcat.datasets.transformers_ner._CITATION = Multiline-String
Show Value
"""@misc{kraljevic2020multidomain,
      title={Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit},
      author={Zeljko Kraljevic and Thomas Searle and Anthony Shek and Lukasz Roguski and Kawsar Noor and Daniel Bean and Aurelie Mascio and Leilei Zhu and Amos A Folarin and Angus Roberts and Rebecca Bendayan and Mark P Richardson and Robert Stewart and Anoop D Shah and Wai Keong Wong and Zina Ibrahim and James T Teo and Richard JB Dobson},
      year={2020},
      eprint={2010.01165},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
"""
medcat.datasets.transformers_ner._DESCRIPTION = 'Takes as input a json export from medcattrainer.'
class medcat.datasets.transformers_ner.MedCATAnnotationsConfig

Bases: datasets.BuilderConfig

BuilderConfig for MedCATNER.

Parameters:

**kwargs – keyword arguments forwarded to super.

class medcat.datasets.transformers_ner.TransformersDatasetNER(cache_dir=None, dataset_name=None, config_name=None, hash=None, base_path=None, info=None, features=None, token=None, use_auth_token='deprecated', repo_id=None, data_files=None, data_dir=None, storage_options=None, writer_batch_size=None, name='deprecated', **config_kwargs)

Bases: datasets.GeneratorBasedBuilder

MedCATNER: Output of MedCATtrainer

Parameters:
  • cache_dir (Optional[str]) –

  • dataset_name (Optional[str]) –

  • config_name (Optional[str]) –

  • hash (Optional[str]) –

  • base_path (Optional[str]) –

  • info (Optional[datasets.info.DatasetInfo]) –

  • features (Optional[datasets.features.Features]) –

  • token (Optional[Union[bool, str]]) –

  • repo_id (Optional[str]) –

  • data_files (Optional[Union[str, list, dict, datasets.data_files.DataFilesDict]]) –

  • data_dir (Optional[str]) –

  • storage_options (Optional[dict]) –

  • writer_batch_size (Optional[int]) –

BUILDER_CONFIGS
_info()

Construct the DatasetInfo object. See DatasetInfo for details.

Warning: This function is only called once and the result is cached for all following .info() calls.

Returns:

info – (DatasetInfo) The dataset information

_split_generators(dl_manager)

Returns SplitGenerators.

_generate_examples(filepaths)

Default function generating examples for each SplitGenerator.

This function preprocess the examples from the raw data to the preprocessed dataset files. This function is called once for each SplitGenerator defined in _split_generators. The examples yielded here will be written on disk.

Parameters:

**kwargs (additional keyword arguments) – Arguments forwarded from the SplitGenerator.gen_kwargs

Yields:

key

str or int, a unique deterministic example identification key.
  • Unique: An error will be raised if two examples are yield with the

    same key.

  • Deterministic: When generating the dataset twice, the same example

    should have the same key.

Good keys can be the image id, or line number if examples are extracted from a text file. The key will be hashed and sorted to shuffle examples deterministically, such as generating the dataset multiple times keep examples in the same order.

example: dict<str feature_name, feature_value>, a feature dictionary

ready to be encoded and written to disk. The example will be encoded with self.info.features.encode_example({…}).