5.3. Credit Risk Assessment on a Private Label Credit Card Application#

5.3.1. Summary#

This dataset comes from the private label credit card operation of a major Brazilian retail chain [1], along stable inflation condition (2003-2008). The goal is to predict whether a customer will default on their credit card payment.

There are some difficulties related to credit scoring which are often overlooked by modelers, namely.

In general, there are only data about the company’s clients for modeling, but not about the rejected applicants. These represent a sample of the potential clients (market) that is strongly biased given that a systematic procedure focused on the problem target (payment default) has been applied for selection.

Also, the data has been collected from a time interval in the past for developing a model to be applied in a future time. Not considering any drastic change in the economy, gradual market changes occur and reduce the performance of the model estimated on the modeling data set.

Another important aspect, often, overlooked by scientists is the series of impacts of re-calibrating the solution already in operation (retraining and tuning the decision threshold) for correcting the degradation to improve or, at least, preserve the existing performance. This task is particularly risky when the credit is lent for long term payment (such as mortgages). Furthermore, the score generated by such a model may be in use within the company for several decision making processes such as, for instance, the marketing/sales department trading off market expansion with risk increase or as input to other decision systems such as debt collection scoring models.

Classes	2
Defaulters	7743
Non-defaulters	31195
Samples	38938
Features	25

5.3.2. Using the Dataset#

The dataset can be loaded through the load_credit_scoring_pakdd function. This returns a Dataset object with the following attributes:

data: the feature matrix
target: the target vector
tp_cost: the cost of a true positive
fp_cost: the cost of a false positive
fn_cost: the cost of a false negative
tn_cost: the cost of a true negative
feature_names: the feature names
target_names: the target names
DESCR: the full description of the dataset

from empulse.datasets import load_credit_scoring_pakdd

dataset = load_credit_scoring_pakdd()

Alternatively, the load function can also return the features, target, and costs separately, by setting return_X_y_costs=True. Additionally, you can specify that you want the output in a pandas.DataFrame format, by setting as_frame=True.

The following code snippet demonstrates how to load the dataset and fit a model using the CSLogitClassifier:

from empulse.datasets import load_credit_scoring_pakdd
from empulse.models import CSLogitClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, TargetEncoder

X, y, tp_cost, fp_cost, fn_cost, tn_cost = load_credit_scoring_pakdd(
    return_X_y_costs=True,
    as_frame=True
)
pipeline = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('num', StandardScaler(), X.select_dtypes(include=['number']).columns),
        ('cat', TargetEncoder(), X.select_dtypes(include=['category']).columns)
    ])),
    ('model', CSLogitClassifier())
])
pipeline.fit(
    X,
    y,
    model__tp_cost=tp_cost,
    model__fp_cost=fp_cost,
    model__fn_cost=fn_cost,
    model__tn_cost=tn_cost
)

5.3.3. Cost Matrix#

	Actual positive $y_i = 1$	Actual negative $y_i = 0$
Predicted positive $\hat{y}_i = 1$	`tp_cost` $= 0$	`fp_cost` $= r_i + -\bar{r} \cdot \pi_0 + \bar{Cl} \cdot L_{gd} \cdot \pi_1$
Predicted negative $\hat{y}_i = 0$	`fn_cost` $= Cl_i \cdot L_{gd}$	`tn_cost` $= 0$

with

$r_i$ : loss in profit by rejecting what would have been a good loan
$\bar{r}$ : average loss in profit by rejecting what would have been a good loan
$\pi_0$ : percentage of defaulters
$\pi_1$ : percentage of non-defaulters
$Cl_i$ : credit line of the client
$\bar{Cl}$ : average credit line
$L_{gd}$ : the fraction of the loan amount which is lost if the client defaults

Using default parameters, it is assumed that the interest rate is 63%, the cost of running the fund is 16.5%, the maximum credit line is 25,000, the loss given default is 75%, the term length is 24 months, and the loan to income ratio is 3. The default parameters are based on [2].

These assumptions can be changed by passing your own values to the load_credit_scoring_pakdd function:

from empulse.datasets import load_credit_scoring_pakdd

X, y, tp_cost, fp_cost, fn_cost, tn_cost = load_credit_scoring_pakdd(
    return_X_y_costs=True,
    interest_rate=0.63,
    fund_cost=0.165,
    max_credit_line=25000,
    loss_given_default=0.75,
    term_length_months=24,
    loan_to_income_ratio=3,
)

5.3.4. Data Description#

Variable Name	Description	Type
age	Applicant’s age	numeric
personal_net_income	Applicant’s personal monthly net income in Brazilian currency (R$)	numeric
partner_income	Applicant’s partner monthly net income in Brazilian currency (R$)	numeric
months_in_residence	Time in the current residence in months	numeric
months_in_the_job	Time in the current job in months	numeric
payment_day	Fixed month day selected for the eventual monthly payment	numeric
n_banking_accounts	Quantity of applicant’s banking accounts	numeric
n_additional_cards	Quantity of additional cards asked for in the same application form	numeric
is_male	Whether the applicant is male (‘yes’ = 1, ‘no’ = 0)	binary
has_residential_phone	If the applicant possesses a residential phone (‘yes’ = 1, ‘no’ = 0)	binary
has_mobile_phone	If the applicant possesses a mobile phone (‘yes’ = 1, ‘no’ = 0)	binary
has_contact_phone	If the applicant possesses a contact phone (‘yes’ = 1, ‘no’ = 0)	binary
has_same_postal_address	If the applicant receives the post in the same address where lives (‘yes’ = 1, ‘no’ = 0)	binary
has_other_card	If the applicant possesses another credit or private label card (‘yes’ = 1, ‘no’ = 0)	binary
lives_in_working_town	If the applicant works in the same town where lives (‘yes’ = 1, ‘no’ = 0)	binary
lives_in_working_state	If the applicant works in the same state where lives (‘yes’ = 1, ‘no’ = 0)	binary
filled_in_mothers_name	If the applicant had filled the father’s name in the form (‘yes’ = 1, ‘no’ = 0)	binary
filled_in_fathers_name	If the applicant had filled the mother’s name in the form (‘yes’ = 1, ‘no’ = 0)	binary
shop_rank	Company’s rating for the shop in commercial terms	ordinal
marital_status	The marital status of the applicant (‘single’, ‘married’, ‘divorced’, ‘widow’, ‘other’)	categorical
residence_type	The type of the applicant’s residence (‘owned’, ‘rented’, ‘parents’, ‘other’)	categorical
area_code_residential_phone	Modified residential phone area code	categorical
shop_code	Shop code where the application has been made	categorical
application_booth_code	Booth code where application was handed in	categorical
profession_code	Applicant’s profession code	categorical
default	Has the applicant defaulted? (‘yes’ = 1, ‘no’ = 0)	binary

	Actual positive \(y_i = 1\)	Actual negative \(y_i = 0\)
Predicted positive \(\hat{y}_i = 1\)	`tp_cost` \(= 0\)	`fp_cost` \(= r_i + -\bar{r} \cdot \pi_0 + \bar{Cl} \cdot L_{gd} \cdot \pi_1\)
Predicted negative \(\hat{y}_i = 0\)	`fn_cost` \(= Cl_i \cdot L_{gd}\)	`tn_cost` \(= 0\)