BiasResampler#

class empulse.samplers.BiasResampler(*, strategy='statistical parity', transform_feature=None, random_state=None)[source]#

Sampler which resamples instances to remove bias against a subgroup.

Read more in the User Guide.

Parameters:
strategy{‘statistical parity’, ‘demographic parity’} or Callable, default=’statistical parity’

Determines how the group weights are computed. Group weights determine how much to over or undersample each combination of target and sensitive feature. For example, a weight of 2 for the pair (y_true == 1, sensitive_feature == 0) means that the resampled dataset should have twice as many instances with y_true == 1 and sensitive_feature == 0 compared to the original dataset.

  • 'statistical parity' or 'demographic parity': probability of positive predictions are equal between subgroups of sensitive feature.

  • Callable: function which computes the group weights based on the target and sensitive feature. Callable accepts two arguments: y_true and sensitive_feature and returns the group weights. Group weights are a 2x2 matrix where the rows represent the target variable and the columns represent the sensitive feature. The element at position (i, j) is the weight for the pair (y_true == i, sensitive_feature == j).

transform_featureOptional[Callable], default=None

Function which transforms sensitive_feature before resampling the training data. The function takes in the sensitive feature in the form of a numpy.ndarray and outputs the transformed sensitive feature as a numpy.ndarray. This can be useful if you want to transform a continuous variable to a binary variable at fit time.

random_stateint or numpy.random.RandomState, optional

Random number generator seed for reproducibility.

Attributes:
sample_indices_numpy.ndarray

Indices of the samples that were selected.

References

[1]

Rahman, S., Janssens, B., & Bogaert, M. (2025). Profit-driven pre-processing in B2B customer churn modeling using fairness techniques. Journal of Business Research, 189, 115159. doi:10.1016/j.jbusres.2024.115159

Examples

import numpy as np
from empulse.samplers import BiasResampler
from sklearn.datasets import make_classification

X, y = make_classification()
high_clv = np.random.randint(0, 2, y.shape)

sampler = BiasResampler()
sampler.fit_resample(X, y, sensitive_feature=high_clv)

Example with passing high-clv indicator through cross-validation:

import numpy as np
from empulse.samplers import BiasResampler
from imblearn.pipeline import Pipeline
from sklearn import set_config
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

set_config(enable_metadata_routing=True)

X, y = make_classification()
high_clv = np.random.randint(0, 2, y.shape)

pipeline = Pipeline([
    ('sampler', BiasResampler().set_fit_resample_request(sensitive_feature=True)),
    ('model', LogisticRegression())
])

cross_val_score(pipeline, X, y, params={'sensitive_feature': high_clv})

Example with passing clv through a grid search and dynamically determining high_clv customer based on training data:

import numpy as np
from empulse.samplers import BiasResampler
from imblearn.pipeline import Pipeline
from sklearn import set_config
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

set_config(enable_metadata_routing=True)

X, y = make_classification()
clv = np.random.rand(y.size)

def to_high_clv(clv: np.ndarray) -> np.ndarray:
    return (clv > np.median(clv)).astype(np.int8)

pipeline = Pipeline([
    ('sampler', BiasResampler(
        transform_feature=to_high_clv
    ).set_fit_resample_request(sensitive_feature=True)),
    ('model', LogisticRegression())
])
param_grid = {'model__C': np.logspace(-5, 2, 10)}

grid_search = GridSearchCV(pipeline, param_grid=param_grid)
grid_search.fit(X, y, sensitive_feature=clv)
fit(X, y, **params)#

Check inputs and statistics of the sampler.

You should use fit_resample in all cases.

Parameters:
X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features)

Data array.

yarray-like of shape (n_samples,)

Target array.

Returns:
selfobject

Return the instance itself.

fit_resample(X, y, *, sensitive_feature=None)[source]#

Resample the data according to the strategy.

Parameters:
X2D array-like, shape=(n_samples, n_features)
y1D array-like, shape=(n_samples,)
sensitive_feature1D array-like, shape=(n_samples,)

Sensitive attribute used to determine which instances to resample.

Returns:
X2D array-like, shape=(n_samples, n_features)

Resampled training data.

y1D array-like, shape=(n_samples,)

Resampled target values.

get_feature_names_out(input_features=None)#

Get output feature names for transformation.

Parameters:
input_featuresarray-like of str or None, default=None

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:
feature_names_outndarray of str objects

Same as input features.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_fit_resample_request(*, sensitive_feature='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the fit_resample method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit_resample if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit_resample.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
sensitive_featurestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sensitive_feature parameter in fit_resample.

Returns:
selfobject

The updated object.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.