3.2. Cost-Proportionate Sampling#

The CostSensitiveSampler lets you resample the training data to minimize the cost_loss. Two sampling techniques are available: 1. Rejection sampling 2. Oversampling

3.2.1. Rejection Sampling#

Cost-proportionate rejection sampling allows you to draw examples independently from a distribution. That is done by first drawing samples for the distribution and then keeping (or accepting) the samples with a probability proportional to their cost [1].

This can be done using the CostSensitiveSampler with the method parameter set to ‘rejection sampling’.

from sklearn.datasets import make_classification
from empulse.samplers import CostSensitiveSampler

X, y = make_classification(random_state=42)

sampler = CostSensitiveSampler(method='rejection sampling')
X_resampled, y_resampled = sampler.fit_resample(X, y)

3.2.2. Oversampling#

Cost-proportionate oversampling allows you to draw examples from a distribution with replacement. How many times an example is drawn is proportional to its cost [2]. To configure the degree of oversampling, you can set the oversampling_norm parameter. The smaller the oversampling norm, the more oversampling is done.

To indicate that you want to use oversampling, set the method parameter to ‘oversampling’.

from sklearn.datasets import make_classification
from empulse.samplers import CostSensitiveSampler

X, y = make_classification(random_state=42)

sampler = CostSensitiveSampler(method='oversampling', oversampling_norm=0.2)
X_resampled, y_resampled = sampler.fit_resample(X, y)

3.2.3. Outlier Robustness#

When computing the probability of keeping a sample (in the case of rejection sampling) or the number of times a sample is drawn (in the case of oversampling), costs above the 97.5th percentile are truncated to the 97.5th percentile to decrease outlier influence. If you wish to change this behavior, you can set the percentile_threshold parameter to any number between 0-1.

from empulse.samplers import CostSensitiveSampler

sampler = CostSensitiveSampler(method='rejection sampling', percentile_threshold=0.9)
X_resampled, y_resampled = sampler.fit_resample(X, y)

3.2.4. Using the Cost-Sensitive Sampler in a Pipeline#

This sampler can easily be used inside an imbalanced-learn imblearn.pipeline.Pipeline (note that the scikit-learn sklearn.pipeline.Pipeline does not support samplers):.

from imblearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from empulse.samplers import CostSensitiveSampler

pipeline = Pipeline([
    ('sampler', CostSensitiveSampler(method='rejection sampling')),
    ('classifier', LogisticRegression())
])
pipeline.fit(X, y)