4. Cross-Validation with Instance-dependent Costs#

Cost-sensitive, models, samplers and metrics depend on instance-dependent costs. In a simple train-validation-test split scenario, using instance-dependent costs is straightforward, since they can just be passed to the fit, fit_resample or score methods. However, when performing cross-validation, the costs for each fold change, which requires special handling and can be done through metadata routing.

As of scikit-learn 1.4.0, some cross-validation methods support metadata routing. This feature allows instance-dependent costs to be passed to estimators, samplers, and scorers, and these costs are split accordingly for each fold. For a list of cross-validation methods that support metadata routing, refer to this link. Please note that metadata routing is an experimental feature and needs to be enabled manually.

4.1. Enabling Metadata Routing#

To enable metadata routing in scikit-learn, use the following code snippet:

from sklearn import set_config

set_config(enable_metadata_routing=True)

Alternatively, you can also use the context manager to not enable metadata routing globally:

from sklearn import config_context

with config_context(enable_metadata_routing=True):
    # code that uses metadata routing
    ...

4.2. What is Metadata Routing#

A full explanation of how metadata routing works can be found in sklearn’s User Guide.

But for a brief summary to be able to use the tools in Empulse, all methods on an estimator like fit, fit_resample, score can request metadata. Metadata routers like cross_val_score or GridSearchCV can pass metadata to the metadata requesters. A metadata requester can request metadata through calling set_***_request methods.

So for a cost-sensitive model can request the fp_cost metadata to its fit method like this:

from empulse.models import CSLogitClassifier

cslogit = CSLogitClassifier().set_fit_request(fp_cost=True)

A sampler can request the fp_cost metadata to its fit_resample method like this:

from empulse.samplers import CostSensitiveSampler

sampler = CostSensitiveSampler().set_fit_resample_request(fp_cost=True)

Note

When using a sampler inside a pipeline, it should be an imbalanced-learn Pipeline. Otherwise the parameters will not be passed to the sampler.

A scorer can request the fp_cost metadata to its score method like this:

from empulse.metrics import expected_savings_score
from sklearn.metrics import make_scorer

scorer = make_scorer(
    expected_savings_score,
    greater_is_better=True,
    response_method='predict_proba',
).set_score_request(fp_cost=True)

Then, when using cross_val_score or GridSearchCV, you can pass the metadata to the method like this:

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification

X, y = make_classification()
fp_cost = np.random.rand(X.shape[0])  # instance-dependent costs

cross_val_score(cslogit, X, y, scoring=scorer, params={"fp_cost": fp_cost})

Now the fp_cost metadata will be passed to the fit method of the CSLogitClassifier and the score method of the expected_savings_score scorer.

4.3. Gridsearch Example#

In this example we want to train a cost-sensitive logistic regression model. We will find the best hyperparameters using a grid search optimizing the expected cost loss. The model and scorer are set up to request the instance-dependent costs.

import numpy as np
from sklearn import set_config
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import make_scorer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from empulse.models import CSLogitClassifier
from empulse.metrics import expected_cost_loss

set_config(enable_metadata_routing=True)

X, y = make_classification()
fp_cost = np.random.rand(X.shape[0])
fn_cost = np.random.rand(X.shape[0])

scorer = make_scorer(
    expected_cost_loss,
    greater_is_better=False,
    response_method='predict_proba',
).set_score_request(fp_cost=True, fn_cost=True)

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', CSLogitClassifier().set_fit_request(fp_cost=True, fn_cost=True))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {'model__C': [0.1, 1]}
grid_search = GridSearchCV(pipe, param_grid, cv=cv, scoring=scorer)
grid_search.fit(X, y, fp_cost=fp_cost, fn_cost=fn_cost)