5.2. Churn in a TV Subscription Company#

5.2.1. Summary#

This is a private dataset provided by a TV cable provider [1]. The dataset consists of active customers during the first semester of 2014. The total dataset contains 9,410 individual registries, each one with 45 attributes, including a churn label indicating whenever a customer is a churner. This label was created internally in the company, and can be regarded as highly accurate. In the dataset only 455 customers are churners, leading to a churn ratio of 4.83 %.

The features names are anonymized to protect the privacy of the customers.

Classes

2

Churners

455

Non-churners

8955

Samples

9410

Features

45

5.2.2. Using the Dataset#

The dataset can be loaded through the load_churn_tv_subscriptions function. This returns a Dataset object with the following attributes:

  • data: the feature matrix

  • target: the target vector

  • tp_cost: the cost of a true positive

  • fp_cost: the cost of a false positive

  • fn_cost: the cost of a false negative

  • tn_cost: the cost of a true negative

  • feature_names: the feature names

  • target_names: the target names

  • DESCR: the full description of the dataset

from empulse.datasets import load_churn_tv_subscriptions

dataset = load_churn_tv_subscriptions()

Alternatively, the load function can also return the features, target, and costs separately, by setting return_X_y_costs=True. Additionally, you can specify that you want the output in a pandas.DataFrame format, by setting as_frame=True.

The following code snippet demonstrates how to load the dataset and fit a model using the CSLogitClassifier:

from empulse.datasets import load_churn_tv_subscriptions
from empulse.models import CSLogitClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

X, y, tp_cost, fp_cost, fn_cost, tn_cost = load_churn_tv_subscriptions(
    return_X_y_costs=True,
    as_frame=True
)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', CSLogitClassifier())
])
pipeline.fit(
    X,
    y,
    model__tp_cost=tp_cost,
    model__fp_cost=fp_cost,
    model__fn_cost=fn_cost,
    model__tn_cost=tn_cost
)

5.2.3. Cost Matrix#

Actual positive \(y_i = 1\)

Actual negative \(y_i = 0\)

Predicted positive \(\hat{y}_i = 1\)

tp_cost \(= \gamma_i d_i + (1 - \gamma_i) (CLV_i + c_i)\)

fp_cost \(= d_i + c_i\)

Predicted negative \(\hat{y}_i = 0\)

fn_cost \(= CLV_i\)

tn_cost \(= 0\)

with
  • \(\gamma_i\) : probability of the customer accepting the retention offer

  • \(CLV_i\) : customer lifetime value of the retained customer

  • \(d_i\) : cost of incentive offered to the customer

  • \(c_i\) : cost of contacting the customer

5.2.4. References#