5.3. Credit Risk Assessment on a Private Label Credit Card Application#

5.3.1. Summary#

This dataset comes from the private label credit card operation of a major Brazilian retail chain [1], along stable inflation condition (2003-2008). The goal is to predict whether a customer will default on their credit card payment.

There are some difficulties related to credit scoring which are often overlooked by modelers, namely.

In general, there are only data about the company’s clients for modeling, but not about the rejected applicants. These represent a sample of the potential clients (market) that is strongly biased given that a systematic procedure focused on the problem target (payment default) has been applied for selection.

Also, the data has been collected from a time interval in the past for developing a model to be applied in a future time. Not considering any drastic change in the economy, gradual market changes occur and reduce the performance of the model estimated on the modeling data set.

Another important aspect, often, overlooked by scientists is the series of impacts of re-calibrating the solution already in operation (retraining and tuning the decision threshold) for correcting the degradation to improve or, at least, preserve the existing performance. This task is particularly risky when the credit is lent for long term payment (such as mortgages). Furthermore, the score generated by such a model may be in use within the company for several decision making processes such as, for instance, the marketing/sales department trading off market expansion with risk increase or as input to other decision systems such as debt collection scoring models.

Classes

2

Defaulters

7743

Non-defaulters

31195

Samples

38938

Features

25

5.3.2. Using the Dataset#

The dataset can be loaded through the load_credit_scoring_pakdd function. This returns a Dataset object with the following attributes:

  • data: the feature matrix

  • target: the target vector

  • tp_cost: the cost of a true positive

  • fp_cost: the cost of a false positive

  • fn_cost: the cost of a false negative

  • tn_cost: the cost of a true negative

  • feature_names: the feature names

  • target_names: the target names

  • DESCR: the full description of the dataset

from empulse.datasets import load_credit_scoring_pakdd

dataset = load_credit_scoring_pakdd()

Alternatively, the load function can also return the features, target, and costs separately, by setting return_X_y_costs=True. Additionally, you can specify that you want the output in a pandas.DataFrame format, by setting as_frame=True.

The following code snippet demonstrates how to load the dataset and fit a model using the CSLogitClassifier:

from empulse.datasets import load_credit_scoring_pakdd
from empulse.models import CSLogitClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, TargetEncoder

X, y, tp_cost, fp_cost, fn_cost, tn_cost = load_credit_scoring_pakdd(
    return_X_y_costs=True,
    as_frame=True
)
pipeline = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('num', StandardScaler(), X.select_dtypes(include=['number']).columns),
        ('cat', TargetEncoder(), X.select_dtypes(include=['category']).columns)
    ])),
    ('model', CSLogitClassifier())
])
pipeline.fit(
    X,
    y,
    model__tp_cost=tp_cost,
    model__fp_cost=fp_cost,
    model__fn_cost=fn_cost,
    model__tn_cost=tn_cost
)

5.3.3. Cost Matrix#

Actual positive \(y_i = 1\)

Actual negative \(y_i = 0\)

Predicted positive \(\hat{y}_i = 1\)

tp_cost \(= 0\)

fp_cost \(= r_i + -\bar{r} \cdot \pi_0 + \bar{Cl} \cdot L_{gd} \cdot \pi_1\)

Predicted negative \(\hat{y}_i = 0\)

fn_cost \(= Cl_i \cdot L_{gd}\)

tn_cost \(= 0\)

with
  • \(r_i\) : loss in profit by rejecting what would have been a good loan

  • \(\bar{r}\) : average loss in profit by rejecting what would have been a good loan

  • \(\pi_0\) : percentage of defaulters

  • \(\pi_1\) : percentage of non-defaulters

  • \(Cl_i\) : credit line of the client

  • \(\bar{Cl}\) : average credit line

  • \(L_{gd}\) : the fraction of the loan amount which is lost if the client defaults

Using default parameters, it is assumed that the interest rate is 63%, the cost of running the fund is 16.5%, the maximum credit line is 25,000, the loss given default is 75%, the term length is 24 months, and the loan to income ratio is 3. The default parameters are based on [2].

These assumptions can be changed by passing your own values to the load_credit_scoring_pakdd function:

from empulse.datasets import load_credit_scoring_pakdd

X, y, tp_cost, fp_cost, fn_cost, tn_cost = load_credit_scoring_pakdd(
    return_X_y_costs=True,
    interest_rate=0.63,
    fund_cost=0.165,
    max_credit_line=25000,
    loss_given_default=0.75,
    term_length_months=24,
    loan_to_income_ratio=3,
)

5.3.4. Data Description#

Variable Name

Description

Type

age

Applicant’s age

numeric

personal_net_income

Applicant’s personal monthly net income in Brazilian currency (R$)

numeric

partner_income

Applicant’s partner monthly net income in Brazilian currency (R$)

numeric

months_in_residence

Time in the current residence in months

numeric

months_in_the_job

Time in the current job in months

numeric

payment_day

Fixed month day selected for the eventual monthly payment

numeric

n_banking_accounts

Quantity of applicant’s banking accounts

numeric

n_additional_cards

Quantity of additional cards asked for in the same application form

numeric

is_male

Whether the applicant is male (‘yes’ = 1, ‘no’ = 0)

binary

has_residential_phone

If the applicant possesses a residential phone (‘yes’ = 1, ‘no’ = 0)

binary

has_mobile_phone

If the applicant possesses a mobile phone (‘yes’ = 1, ‘no’ = 0)

binary

has_contact_phone

If the applicant possesses a contact phone (‘yes’ = 1, ‘no’ = 0)

binary

has_same_postal_address

If the applicant receives the post in the same address where lives (‘yes’ = 1, ‘no’ = 0)

binary

has_other_card

If the applicant possesses another credit or private label card (‘yes’ = 1, ‘no’ = 0)

binary

lives_in_working_town

If the applicant works in the same town where lives (‘yes’ = 1, ‘no’ = 0)

binary

lives_in_working_state

If the applicant works in the same state where lives (‘yes’ = 1, ‘no’ = 0)

binary

filled_in_mothers_name

If the applicant had filled the father’s name in the form (‘yes’ = 1, ‘no’ = 0)

binary

filled_in_fathers_name

If the applicant had filled the mother’s name in the form (‘yes’ = 1, ‘no’ = 0)

binary

shop_rank

Company’s rating for the shop in commercial terms

ordinal

marital_status

The marital status of the applicant (‘single’, ‘married’, ‘divorced’, ‘widow’, ‘other’)

categorical

residence_type

The type of the applicant’s residence (‘owned’, ‘rented’, ‘parents’, ‘other’)

categorical

area_code_residential_phone

Modified residential phone area code

categorical

shop_code

Shop code where the application has been made

categorical

application_booth_code

Booth code where application was handed in

categorical

profession_code

Applicant’s profession code

categorical

default

Has the applicant defaulted? (‘yes’ = 1, ‘no’ = 0)

binary

5.3.5. References#