BiasResampler#
- class empulse.samplers.BiasResampler(*, strategy='statistical parity', transform_feature=None, random_state=None)[source]#
Sampler which resamples instances to remove bias against a subgroup.
Read more in the User Guide.
- Parameters:
- strategy{‘statistical parity’, ‘demographic parity’} or Callable, default=’statistical parity’
Determines how the group weights are computed. Group weights determine how much to over or undersample each combination of target and sensitive feature. For example, a weight of 2 for the pair (y_true == 1, sensitive_feature == 0) means that the resampled dataset should have twice as many instances with y_true == 1 and sensitive_feature == 0 compared to the original dataset.
'statistical parity'or'demographic parity': probability of positive predictions are equal between subgroups of sensitive feature.Callable: function which computes the group weights based on the target and sensitive feature. Callable accepts two arguments: y_true and sensitive_feature and returns the group weights. Group weights are a 2x2 matrix where the rows represent the target variable and the columns represent the sensitive feature. The element at position (i, j) is the weight for the pair (y_true == i, sensitive_feature == j).
- transform_featureOptional[Callable], default=None
Function which transforms sensitive_feature before resampling the training data. The function takes in the sensitive feature in the form of a numpy.ndarray and outputs the transformed sensitive feature as a numpy.ndarray. This can be useful if you want to transform a continuous variable to a binary variable at fit time.
- random_stateint or
numpy.random.RandomState, optional Random number generator seed for reproducibility.
- Attributes:
- sample_indices_numpy.ndarray
Indices of the samples that were selected.
References
[1]Rahman, S., Janssens, B., & Bogaert, M. (2025). Profit-driven pre-processing in B2B customer churn modeling using fairness techniques. Journal of Business Research, 189, 115159. doi:10.1016/j.jbusres.2024.115159
Examples
import numpy as np from empulse.samplers import BiasResampler from sklearn.datasets import make_classification X, y = make_classification() high_clv = np.random.randint(0, 2, y.shape) sampler = BiasResampler() sampler.fit_resample(X, y, sensitive_feature=high_clv)
Example with passing high-clv indicator through cross-validation:
import numpy as np from empulse.samplers import BiasResampler from imblearn.pipeline import Pipeline from sklearn import set_config from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score set_config(enable_metadata_routing=True) X, y = make_classification() high_clv = np.random.randint(0, 2, y.shape) pipeline = Pipeline([ ('sampler', BiasResampler().set_fit_resample_request(sensitive_feature=True)), ('model', LogisticRegression()) ]) cross_val_score(pipeline, X, y, params={'sensitive_feature': high_clv})
Example with passing clv through a grid search and dynamically determining high_clv customer based on training data:
import numpy as np from empulse.samplers import BiasResampler from imblearn.pipeline import Pipeline from sklearn import set_config from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV set_config(enable_metadata_routing=True) X, y = make_classification() clv = np.random.rand(y.size) def to_high_clv(clv: np.ndarray) -> np.ndarray: return (clv > np.median(clv)).astype(np.int8) pipeline = Pipeline([ ('sampler', BiasResampler( transform_feature=to_high_clv ).set_fit_resample_request(sensitive_feature=True)), ('model', LogisticRegression()) ]) param_grid = {'model__C': np.logspace(-5, 2, 10)} grid_search = GridSearchCV(pipeline, param_grid=param_grid) grid_search.fit(X, y, sensitive_feature=clv)
- fit(X, y, **params)#
Check inputs and statistics of the sampler.
You should use
fit_resamplein all cases.- Parameters:
- X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features)
Data array.
- yarray-like of shape (n_samples,)
Target array.
- Returns:
- selfobject
Return the instance itself.
- fit_resample(X, y, *, sensitive_feature=None)[source]#
Resample the data according to the strategy.
- Parameters:
- X2D array-like, shape=(n_samples, n_features)
- y1D array-like, shape=(n_samples,)
- sensitive_feature1D array-like, shape=(n_samples,)
Sensitive attribute used to determine which instances to resample.
- Returns:
- X2D array-like, shape=(n_samples, n_features)
Resampled training data.
- y1D array-like, shape=(n_samples,)
Resampled target values.
- get_feature_names_out(input_features=None)#
Get output feature names for transformation.
- Parameters:
- input_featuresarray-like of str or None, default=None
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns:
- feature_names_outndarray of str objects
Same as input features.
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequestencapsulating routing information.
- get_params(deep=True)#
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- set_fit_resample_request(*, sensitive_feature='$UNCHANGED$')#
Configure whether metadata should be requested to be passed to the
fit_resamplemethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofit_resampleif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit_resample.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
- sensitive_featurestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sensitive_featureparameter infit_resample.
- Returns:
- selfobject
The updated object.
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.