medcat.utils.regression.category_separation

Module Contents

Classes

CategoryDescription

A descriptor for a category.

Category

The category base class.

AllPartsCategory

Represents a category which only fits a regression case if it matches all parts of category description.

AnyPartOfCategory

Represents a category which fits a regression case that matches any part of its category desription.

SeparationObserver

Keeps track of which case is separate into which category/categories.

StrategyType

Describes the types of strategies one can can employ for strategy.

SeparatorStrategy

The strategy according to which the separation takes place.

SeparateToFirst

Separator strategy that separates each case to its first match.

SeparateToAll

A separator strateg that allows separation to all matching categories.

RegressionCheckerSeparator

Regression checker separtor.

Functions

get_random_str([length])

get_strategy(strategy_type)

Get the separator strategy from the strategy type.

get_separator(categories, strategy_type[, ...])

Get the regression checker separator for the list of categories and the specified strategy.

get_description(cat_description)

Get the description from its dict representation.

get_category(cat_name, cat_description)

Get the category of the specified name from the dict.

read_categories(yaml_file)

Read categories from a YAML file.

separate_categories(category_yaml, strategy_type, ...)

Separate categories based on simple input.

Attributes

logger

medcat.utils.regression.category_separation.logger
class medcat.utils.regression.category_separation.CategoryDescription

Bases: pydantic.BaseModel

A descriptor for a category.

Parameters:
  • target_cuis (Set[str]) – The set of target CUIs

  • target_names (Set[str]) – The set of target names

  • target_tuis (Set[str]) – The set of target type IDs

  • anything_goes (bool) – Matches any CUI/NAME/TUI. Defaults to False

target_cuis: Set[str]
target_names: Set[str]
target_tuis: Set[str]
allow_everything: bool = False
_get_required_filter(case, target_filter)
Parameters:
Return type:

Optional[medcat.utils.regression.checking.TypedFilter]

_has_specific_from(case, targets, target_filter)
Parameters:
has_cui_from(case)

Check if the description has a CUI from the specified regression case.

Parameters:

case (RegressionCase) – The regression case to check

Returns:

bool – True if the description has a CUI from the regression case

Return type:

bool

has_name_from(case)

Check if the description has a name from the specified regression case.

Parameters:

case (RegressionCase) – The regression case to check

Returns:

bool – True if the description has a name from the regression case

Return type:

bool

has_tui_from(case)

Check if the description has a target ID/TUI from the specified regression case.

Parameters:

case (RegressionCase) – The regression case to check

Returns:

bool – True if the description has a target ID/TUI from the regression case

Return type:

bool

__hash__()
Return type:

int

__eq__(other)
Parameters:

other (Any) –

Return type:

bool

classmethod anything_goes()
Return type:

CategoryDescription

class medcat.utils.regression.category_separation.Category(name)

Bases: abc.ABC

The category base class.

A category defines which regression cases fit in it.

Parameters:

name (str) – The name of the category

__init__(name)
Parameters:

name (str) –

Return type:

None

abstract fits(case)

Check if a particular regression case fits in this category.

Parameters:

case (RegressionCase) – The regression case.

Returns:

bool – Whether the case is in this category.

Return type:

bool

class medcat.utils.regression.category_separation.AllPartsCategory(name, descr)

Bases: Category

Represents a category which only fits a regression case if it matches all parts of category description.

That is, in order for a regression case to match, it would need to match a CUI, a name and a TUI specified in the category description.

Parameters:
  • name (str) – The name of the category

  • descr (CategoryDescription) – The description of the category

__init__(name, descr)
Parameters:
Return type:

None

fits(case)

Check if a particular regression case fits in this category.

Parameters:

case (RegressionCase) – The regression case.

Returns:

bool – Whether the case is in this category.

Return type:

bool

__eq__(__o)

Return self==value.

Parameters:

__o (object) –

Return type:

bool

__hash__()

Return hash(self).

Return type:

int

__str__()

Return str(self).

Return type:

str

__repr__()

Return repr(self).

Return type:

str

class medcat.utils.regression.category_separation.AnyPartOfCategory(name, descr)

Bases: Category

Represents a category which fits a regression case that matches any part of its category desription.

That is, any case that matches either a CUI, a name or a TUI within the category description, will fit.

Parameters:
  • name (str) – The name of the category

  • descr (CategoryDescription) – The description of the category

__init__(name, descr)
Parameters:
Return type:

None

fits(case)

Check if a particular regression case fits in this category.

Parameters:

case (RegressionCase) – The regression case.

Returns:

bool – Whether the case is in this category.

Return type:

bool

__eq__(__o)

Return self==value.

Parameters:

__o (object) –

Return type:

bool

__hash__()

Return hash(self).

Return type:

int

__str__()

Return str(self).

Return type:

str

__repr__()

Return repr(self).

Return type:

str

class medcat.utils.regression.category_separation.SeparationObserver

Keeps track of which case is separate into which category/categories.

It also keeps track of which cases have been observed as separated and into which category.

__init__()
Return type:

None

observe(case, category)

Observe the specified regression case in the specified category.

Parameters:
  • case (RegressionCase) – The regression case to observe

  • category (Category) – The category to link the case tos

Return type:

None

has_observed(case)

Check if the case has already been observed.

Parameters:

case (RegressionCase) – The case to check

Returns:

bool – True if the case had been observed, False otherwise

Return type:

bool

reset()

Allows resetting the state of the observer.

Return type:

None

class medcat.utils.regression.category_separation.StrategyType

Bases: enum.Enum

Describes the types of strategies one can can employ for strategy.

FIRST
ALL
class medcat.utils.regression.category_separation.SeparatorStrategy(observer)

Bases: abc.ABC

The strategy according to which the separation takes place.

The separation strategy relies on the mutable separation observer instance.

Parameters:

observer (SeparationObserver) –

__init__(observer)
Parameters:

observer (SeparationObserver) –

Return type:

None

abstract can_separate(case)

Check if the separator strategy can separate the specified regression case

Parameters:

case (RegressionCase) – The regression case to check

Returns:

bool – True if the strategy allows separation, False otherwise

Return type:

bool

abstract separate(case, category)

Separate the regression case

Parameters:
  • case (RegressionCase) – The regression case to separate

  • category (Category) – The category to separate to

Return type:

None

reset()

Allows resetting the state of the separator strategy.

Return type:

None

class medcat.utils.regression.category_separation.SeparateToFirst(observer)

Bases: SeparatorStrategy

Separator strategy that separates each case to its first match.

That is to say, any subsequently matching categories are ignored. This means that no regression case gets duplicated. It also means that the number of cases in all categories will be the same as the initial number of cases.

Parameters:

observer (SeparationObserver) –

can_separate(case)

Check if the separator strategy can separate the specified regression case

Parameters:

case (RegressionCase) – The regression case to check

Returns:

bool – True if the strategy allows separation, False otherwise

Return type:

bool

separate(case, category)

Separate the regression case

Parameters:
  • case (RegressionCase) – The regression case to separate

  • category (Category) – The category to separate to

Return type:

None

class medcat.utils.regression.category_separation.SeparateToAll(observer)

Bases: SeparatorStrategy

A separator strateg that allows separation to all matching categories.

This means that when one regression case fits into multiple categories, it will be saved in each such category. I.e the some cases may be duplicated.

Parameters:

observer (SeparationObserver) –

can_separate(case)

Check if the separator strategy can separate the specified regression case

Parameters:

case (RegressionCase) – The regression case to check

Returns:

bool – True if the strategy allows separation, False otherwise

Return type:

bool

separate(case, category)

Separate the regression case

Parameters:
  • case (RegressionCase) – The regression case to separate

  • category (Category) – The category to separate to

Return type:

None

medcat.utils.regression.category_separation.get_random_str(length=8)
class medcat.utils.regression.category_separation.RegressionCheckerSeparator

Bases: pydantic.BaseModel

Regression checker separtor.

It is able to separate cases in a regression checker into multiple different sets of regression cases based on the given list of categories and the specified strategy.

Parameters:
  • categories (List[Category]) – The categories to separate into

  • strategy (SeparatorStrategy) – The strategy for separation

  • overflow_category (bool) – Whether to use an overflow category for cases that don’t fit in other categoreis. Defaults to False.

class Config
arbitrary_types_allowed = True
categories: List[Category]
strategy: SeparatorStrategy
overflow_category: bool = False
_attempt_category_for(cat, case)
Parameters:
find_categories_for(case)

Find the categories for a specific regression case

Parameters:

case (RegressionCase) – The regression case to check

Raises:

ValueError – If no category found.

separate(checker)

Separate the specified regression checker into multiple sets of cases.

Each case may be associated with either no, one, or multiple categories. The specifics depends on allow_overflow and strategy.

Parameters:

checker (RegressionChecker) – The input regression checker

Return type:

None

save(prefix, metadata, overwrite=False)

Save the results of the separation in different files.

This needs to be called after the separate method has been called.

Each separated category (that has any cases registered to it) will be saved in a separate file with the specified predix and the category name.

Parameters:
  • prefix (str) – The prefix for the saved file(s)

  • metadata (MetaData) – The metadata for the regression suite

  • overwrite (bool) – Whether to overwrite file(s) if/when needed. Defaults to False.

Raises:
  • ValueError – If the method is called before separation or no separtion was done

  • ValueError – If a file already exists and is not allowed to be overwritten

Return type:

None

medcat.utils.regression.category_separation.get_strategy(strategy_type)

Get the separator strategy from the strategy type.

Parameters:

strategy_type (StrategyType) – The type of strategy

Raises:

ValueError – If an unknown strategy is provided

Returns:

SeparatorStrategy – The resulting separator strategys

Return type:

SeparatorStrategy

medcat.utils.regression.category_separation.get_separator(categories, strategy_type, overflow_category=False)

Get the regression checker separator for the list of categories and the specified strategy.

Parameters:
  • categories (List[Category]) – The list of categories to include

  • strategy_type (StrategyType) – The strategy for separation

  • overflow_category (bool) – Whether to use an overflow category for items that don’t go in other categories. Defaults to False.

Returns:

RegressionCheckerSeparator – The resulting separator

Return type:

RegressionCheckerSeparator

medcat.utils.regression.category_separation.get_description(cat_description)

Get the description from its dict representation.

The dict is expected to have the following keys: ‘cuis’, ‘tuis’, and ‘names’ Each one should have a list of strings as their values.

Parameters:

cat_description (dict) – The dict representation

Returns:

CategoryDescription – The resulting category description

Return type:

CategoryDescription

medcat.utils.regression.category_separation.get_category(cat_name, cat_description)

Get the category of the specified name from the dict.

The dict is expected to be in the form:

type: <category type> # either any or all cuis: [] # list of CUIs in category names: [] # list of names in category tuis: [] # list of type IDs in category

Parameters:
  • cat_name (str) – The name of the category

  • cat_description (dict) – The dict describing the category

Raises:

ValueError – If an unknown type is specified.

Returns:

Category – The resulting category

Return type:

Category

medcat.utils.regression.category_separation.read_categories(yaml_file)

Read categories from a YAML file.

The yaml is assumed to be in the format: categories:

category-name:

type: <category type> cuis: [<target cui 1>, <target cui 2>, …] names: [<target name 1>, <target name 2>, …] tuis: [<target tui 1>, <target tui 2>, …]

other-category-name:

… # and so on

Parameters:

yaml_file (str) – The yaml file location

Returns:

List[Category] – The resulting categories

Return type:

List[Category]

medcat.utils.regression.category_separation.separate_categories(category_yaml, strategy_type, regression_suite_yaml, target_file_prefix, overwrite=False, overflow_category=False)

Separate categories based on simple input.

The categories are read from the provided file and the regression suite from its corresponding yaml. The separated regression suites are saved in accordance to the defined prefix.

Parameters:
  • category_yaml (str) – The name of the YAML file describing the categories

  • strategy_type (StrategyType) – The strategy for separation

  • regression_suite_yaml (str) – The regression suite YAML

  • target_file_prefix (str) – The target file prefix

  • overwrite (bool) – Whether to overwrite file(s) if/when needed. Defaults to False.

  • overflow_category (bool) – Whether to use an overflow category for items that don’t go in other categories. Defaults to False.

Return type:

None