medcat.utils.preprocess_snomed

Module Contents

Classes

Snomed

Pre-process SNOMED CT release files.

Functions

parse_file(filename[, first_row_header, columns])

get_all_children(sctid, pt2ch)

Retrieves all the children of a given SNOMED CT ID (SCTID) from a given parent-to-child mapping (pt2ch) via the "IS A" relationship.

get_direct_refset_mapping(in_dict)

This method uses the output from Snomed.map_snomed2icd10 or

medcat.utils.preprocess_snomed.parse_file(filename, first_row_header=True, columns=None)
medcat.utils.preprocess_snomed.get_all_children(sctid, pt2ch)

Retrieves all the children of a given SNOMED CT ID (SCTID) from a given parent-to-child mapping (pt2ch) via the “IS A” relationship. pt2ch can be found in a MedCAT model in the additional info via the call: cat.cdb.addl_info[‘pt2ch’]

Parameters:
  • sctid (int) – The SCTID whose children need to be retrieved.

  • pt2ch (dict) – A dictionary containing the parent-to-child relationships in the form {parent_sctid: [list of child sctids]}.

Returns:

list – A list of unique SCTIDs that are children of the given SCTID.

medcat.utils.preprocess_snomed.get_direct_refset_mapping(in_dict)

This method uses the output from Snomed.map_snomed2icd10 or Snomed.map_snomed2opcs4 and removes the metadata and maps each SNOMED CUI to the prioritised list of the target ontology CUIs.

The input dict is expected to be in the following format: - Keys are SnomedCT CUIs - The values are lists of dictionaries, each list item (at least)

  • Has a key ‘code’ that specifies the target onotlogy CUI

  • Has a key ‘mapPriority’ that specifies the priority

Parameters:

in_dict (dict) – The input dict.

Returns:

dict – The map from Snomed CUI to list of priorities list of target ontology CUIs.

Return type:

dict

class medcat.utils.preprocess_snomed.Snomed(data_path, uk_ext=False, uk_drug_ext=False, au_ext=False)

Pre-process SNOMED CT release files.

This class is used to create a SNOMED CT concept DataFrame ready for MedCAT CDB creation.

Parameters:

au_ext (bool) –

data_path

Path to the unzipped SNOMED CT folder.

Type:

str

release

Release of SNOMED CT folder.

Type:

str

uk_ext

Specifies whether the version is a SNOMED UK extension released after 2021. Defaults to False.

Type:

bool, optional

uk_drug_ext

Specifies whether the version is a SNOMED UK drug extension. Defaults to False.

Type:

bool, optional

au_ext

Specifies wether the version is a AU release. Defaults to False.

Type:

bool, optional

__init__(data_path, uk_ext=False, uk_drug_ext=False, au_ext=False)
Parameters:

au_ext (bool) –

to_concept_df()

Create a SNOMED CT concept DataFrame.

Creates a SNOMED CT concept DataFrame ready for MEDCAT CDB creation. Checks if the version is a UK extension release and sets the correct file names for the concept and description snapshots accordingly. Additionally, handles the divergent release format of the UK Drug Extension >v2021 with the uk_drug_ext variable.

Returns:

pandas.DataFrame – SNOMED CT concept DataFrame.

list_all_relationships()

List all SNOMED CT relationships.

SNOMED CT provides a rich set of inter-relationships between concepts.

Returns:

list – List of all SNOMED CT relationships.

relationship2json(relationshipcode, output_jsonfile)

Convert a single relationship map structure to JSON file.

Parameters:
  • relationshipcode (str) – A single SCTID or unique concept identifier of the relationship type.

  • output_jsonfile (str) – Name of JSON file output.

Returns:

file – JSON file of relationship mapping.

map_snomed2icd10()

This function maps SNOMED CT concepts to ICD-10 codes using the refset mappings provided in the SNOMED CT release package.

Returns:

dict – A dictionary containing the SNOMED CT to ICD-10 mappings including metadata.

map_snomed2opcs4()

This function maps SNOMED CT concepts to OPCS-4 codes using the refset mappings provided in the SNOMED CT release package.

Then it calls the internal function _map_snomed2refset() to get the DataFrame containing the OPCS-4 mappings. The function then converts the DataFrame to a dictionary using the internal function _refset_df2dict()

Raises:

AttributeError – If OPCS-4 mappings aren’t available.

Returns:

dict – A dictionary containing the SNOMED CT to OPCS-4 mappings including metadata.

Return type:

dict

_check_path_and_release()

This function checks the path and release of the SNOMED CT data provided. It looks for the “Snapshot” folder within the data path, and if it’s not found, it looks for any folder containing the name “SnomedCT”. It then stores the path and release in separate lists. If no valid paths are found, it raises a FileNotFoundError.

Returns:

tuple – a tuple containing two lists, the first one is a list of the paths where the data is located and the second is a list of the releases of the data.

Raises:

FileNotFoundError – If the path to the SNOMED CT directory is incorrect.

_refset_df2dict(refset_df)

This function takes a SNOMED refset DataFrame as an input and converts it into a dictionary. The DataFrame should contain the columns ‘referencedComponentId’,’mapTarget’,’mapGroup’,’mapPriority’,’mapRule’,’mapAdvice’.

Parameters:

refset_df (pd.DataFrame) – DataFrame containing the refset data

Returns:

dict – mapping from SNOMED CT codes as key and the refset metadata list of dictionaries as values.

Return type:

dict

_map_snomed2refset()

Maps SNOMED CT concepts to refset mappings provided in the SNOMED CT release package.

This function maps SNOMED CT concepts using the refset mappings in the Snapshot/Refset/Map directory. The refset mappings can either be ICD-10 codes in international releases or OPCS4 codes for SNOMED UK_extension, if available.

Returns:
  • pd.DataFrame – Dataframe containing SNOMED CT to refset mappings and metadata.

  • OR

  • tuple – Tuple of dataframes containing SNOMED CT to refset mappings and metadata (ICD-10, OPCS4), if uk_ext is True.