medcat.utils.preprocess_umls

Module Contents

Classes

UMLS

Pre-process UMLS release files:

Attributes

_DEFAULT_COLUMNS

_DEFAULT_SEM_TYPE_COLUMNS

_DEFAULT_MRHIER_COLUMNS

medcat_csv_mapper

umls

medcat.utils.preprocess_umls._DEFAULT_COLUMNS: list = ['CUI', 'LAT', 'TS', 'LUI', 'STT', 'SUI', 'ISPREF', 'AUI', 'SAUI', 'SCUI', 'SDUI', 'SAB', 'TTY',...
medcat.utils.preprocess_umls._DEFAULT_SEM_TYPE_COLUMNS: list = ['CUI', 'TUI', 'STN', 'STY', 'ATUI', 'CVF']
medcat.utils.preprocess_umls._DEFAULT_MRHIER_COLUMNS: list = ['CUI', 'AUI', 'CXN', 'PAUI', 'SAB', 'RELA', 'PTR', 'HCD', 'CVF']
medcat.utils.preprocess_umls.medcat_csv_mapper: dict
class medcat.utils.preprocess_umls.UMLS(main_file_name, sem_types_file, allow_languages=['ENG'], sep='|')

Pre-process UMLS release files: :param main_file_name: Path to the main file name (probably MRCONSO.RRF) :type main_file_name: str :param sem_types_file: Path to the semantic types file name (probably MRSTY.RRF) :type sem_types_file: str :param allow_langugages: Languages to filter out. Defaults to just English ([‘ENG’]). :type allow_langugages: list :param sep: The separator used within the files. Defaults to ‘|’. :type sep: str

Parameters:
  • main_file_name (str) –

  • sem_types_file (str) –

  • allow_languages (list) –

  • sep (str) –

__init__(main_file_name, sem_types_file, allow_languages=['ENG'], sep='|')
Parameters:
  • main_file_name (str) –

  • sem_types_file (str) –

  • allow_languages (list) –

  • sep (str) –

to_concept_df()

Create a concept DataFrame. The default column names are expected.

Returns:

pd.DataFrame – The resulting DataFrame

Return type:

pandas.DataFrame

map_umls2snomed()

Map to SNOMED-CT.

Currently, uses the SCUI column. At the time of writing, this is equal to the CODE column. But this may not be the case in the future.

Returns:

pd.DataFrame – Dataframe that contains the SCUI (source CUI) as well as the UMLS CUI for each applicable concept

Return type:

pandas.DataFrame

map_umls2icd10()

Map to ICD-10.

Available SAB’s that contain ‘ICD10’:
  • CCSR_ICD10CM - CCSR_ICD10CM (Clinical Classifications Software Refined for ICD-10-CM) - Synopsis

  • CCSR_ICD10PCS - CCSR_ICD10PCS (Clinical Classifications Software Refined for ICD-10-PCS) - Synopsis

  • DMDICD10 - DMDICD10 (ICD-10 German) - Statistics

  • ICD10AE - ICD10AE (ICD-10, American English Equivalents) - Synopsis

  • ICD10AMAE - ICD10AMAE (ICD-10, Australian Modification, Americanized English Equivalents) - Synopsis

  • ICD10AM - ICD10AM (ICD-10, Australian Modification) - Synopsis

  • ICD10DUT - ICD10DUT (ICD10, Dutch Translation) - Synopsis

  • ICD10PCS - ICD10PCS (ICD-10 Procedure Coding System) - Synopsis

  • ICD10 - ICD10 (International Classification of Diseases and Related Health Problems, Tenth Revision) - Synopsis

  • ICPC2ICD10DUT - ICPC2ICD10DUT (ICPC2-ICD10 Thesaurus, Dutch Translation) - Synopsis

  • ICPC2ICD10ENG - ICPC2ICD10ENG (ICPC2-ICD10 Thesaurus) - Synopsis

  • MTHICPC2ICD10AE - MTHICPC2ICD10AE (ICPC2E-ICD10 Thesaurus, American English Equivalents) - Synopsis

Currently only using ‘ICD10’. But others may be relevant as well.

If one wants to use one of the other sources listed above, they would need to use the map_umls2source method.

Returns:

pd.DataFrame – DataFrame that has the ICD-10 codes

Return type:

pandas.DataFrame

map_umls2source(sources)

Allows mapping to an arbitrary

Parameters:

sources (Union[str, List[str]]) – The source or sources to include.

Returns:

pd.DataFrame – DataFrame that has the target source codes

Return type:

pandas.DataFrame

get_pt2ch()

Generates a parent to children dict.

It goes through all the < # TODO

The resulting dictionary maps a CUI to a list of CUIs that consider that CUI as their parent.

PS: This expects the MRHIER.RRF file to also exist in the same folder as the MRCONSO.RRF file.

Raises:

ValueError – If the MRHIER.RRF file wasn’t found

Returns:

dict – The dictionary of parent CUI and their children.

Return type:

dict

medcat.utils.preprocess_umls.umls