CostSensitiveSampler#

class empulse.samplers.CostSensitiveSampler(method='rejection sampling', *, oversampling_norm=0.1, percentile_threshold=0.975, random_state=None, fp_cost=0.0, fn_cost=0.0)[source]#

Sampler which performs cost-proportionate resampling.

This method adjusts the sampling probability of each sample based on the cost of misclassification. This is done either by rejection sampling [1] or oversampling [2].

Read more in the User Guide.

Parameters:
method{‘rejection sampling’, ‘oversampling’}, default=’rejection sampling’

Method to perform the cost-proportionate sampling, either ‘RejectionSampling’ or ‘OverSampling’.

oversampling_norm: float, default=0.1

Oversampling norm for the cost. The smaller the oversampling_norm, the more samples are generated.

percentile_threshold: float, default=0.975

Outlier adjustment for the cost. Costs are normalized and cost values above the percentile_threshold’th percentile are set to 1.

random_stateint or numpy.random.RandomState, optional

Random number generator seed for reproducibility.

fp_costfloat or array-like, shape=(n_samples,), default=0.0

Cost of false positives. If float, then all false positives have the same cost. If array-like, then it is the cost of each false positive classification. Is overwritten if another fp_cost is passed to the fit_resample method.

Note

It is not recommended to pass instance-dependent costs to the __init__ method. Instead, pass them to the fit_resample method.

fn_costfloat or array-like, shape=(n_samples,), default=0.0

Cost of false negatives. If float, then all false negatives have the same cost. If array-like, then it is the cost of each false negative classification. Is overwritten if another fn_cost is passed to the fit_resample method.

Note

It is not recommended to pass instance-dependent costs to the __init__ method. Instead, pass them to the fit_resample method.

Attributes:
sample_indices_numpy.ndarray

Indices of the samples that were selected.

Notes

code modified from costcla.sampling.cost_sampling.

References

[1]

B. Zadrozny, J. Langford, N. Naoki, “Cost-sensitive learning by cost-proportionate example weighting”, in Proceedings of the Third IEEE International Conference on Data Mining, 435-442, 2003.

[2]

C. Elkan, “The foundations of Cost-Sensitive Learning”, in Seventeenth International Joint Conference on Artificial Intelligence, 973-978, 2001.

Examples

import numpy as np
from empulse.samplers import CostSensitiveSampler
from sklearn.datasets import make_classification

X, y = make_classification()
fp_cost = np.ones_like(y) * 10
fn_cost = np.ones_like(y)

sampler = CostSensitiveSampler(method='oversampling', random_state=42)
X_re, y_re = sampler.fit_resample(X, y, fp_cost=fp_cost, fn_cost=fn_cost)
fit(X, y, **params)#

Check inputs and statistics of the sampler.

You should use fit_resample in all cases.

Parameters:
X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features)

Data array.

yarray-like of shape (n_samples,)

Target array.

Returns:
selfobject

Return the instance itself.

fit_resample(X, y, *, fp_cost=Parameter.UNCHANGED, fn_cost=Parameter.UNCHANGED)[source]#

Resample the dataset.

Parameters:
Xarray-like of shape (n_samples, n_features)
yarray-like of shape (n_samples,)
fp_costfloat or array-like, shape=(n_samples,), default=$UNCHANGED$

Cost of false positives. If float, then all false positives have the same cost. If array-like, then it is the cost of each false positive classification.

fn_costfloat or array-like, shape=(n_samples,), default=$UNCHANGED$

Cost of false negatives. If float, then all false negatives have the same cost. If array-like, then it is the cost of each false negative classification.

Returns:
X_resampledndarray of shape (n_samples_new, n_features)

The array containing the resampled data.

y_resampledndarray of shape (n_samples_new,)

The corresponding label of X_resampled.

get_feature_names_out(input_features=None)#

Get output feature names for transformation.

Parameters:
input_featuresarray-like of str or None, default=None

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:
feature_names_outndarray of str objects

Same as input features.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_fit_resample_request(*, fn_cost='$UNCHANGED$', fp_cost='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the fit_resample method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit_resample if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit_resample.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
fn_coststr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for fn_cost parameter in fit_resample.

fp_coststr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for fp_cost parameter in fit_resample.

Returns:
selfobject

The updated object.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.