5.3. Credit Risk Assessment on a Private Label Credit Card Application#
5.3.1. Summary#
This dataset comes from the private label credit card operation of a major Brazilian retail chain [1], along stable inflation condition (2003-2008). The goal is to predict whether a customer will default on their credit card payment.
There are some difficulties related to credit scoring which are often overlooked by modelers, namely.
In general, there are only data about the company’s clients for modeling, but not about the rejected applicants. These represent a sample of the potential clients (market) that is strongly biased given that a systematic procedure focused on the problem target (payment default) has been applied for selection.
Also, the data has been collected from a time interval in the past for developing a model to be applied in a future time. Not considering any drastic change in the economy, gradual market changes occur and reduce the performance of the model estimated on the modeling data set.
Another important aspect, often, overlooked by scientists is the series of impacts of re-calibrating the solution already in operation (retraining and tuning the decision threshold) for correcting the degradation to improve or, at least, preserve the existing performance. This task is particularly risky when the credit is lent for long term payment (such as mortgages). Furthermore, the score generated by such a model may be in use within the company for several decision making processes such as, for instance, the marketing/sales department trading off market expansion with risk increase or as input to other decision systems such as debt collection scoring models.
Classes |
2 |
Defaulters |
7743 |
Non-defaulters |
31195 |
Samples |
38938 |
Features |
25 |
5.3.2. Using the Dataset#
The dataset can be loaded through the load_credit_scoring_pakdd
function.
This returns a Dataset
object with the following attributes:
data
: the feature matrixtarget
: the target vectortp_cost
: the cost of a true positivefp_cost
: the cost of a false positivefn_cost
: the cost of a false negativetn_cost
: the cost of a true negativefeature_names
: the feature namestarget_names
: the target namesDESCR
: the full description of the dataset
from empulse.datasets import load_credit_scoring_pakdd
dataset = load_credit_scoring_pakdd()
Alternatively, the load function can also return the features, target, and costs separately,
by setting return_X_y_costs=True
.
Additionally, you can specify that you want the output in a pandas.DataFrame
format,
by setting as_frame=True
.
The following code snippet demonstrates how to load the dataset and fit a model using the
CSLogitClassifier
:
from empulse.datasets import load_credit_scoring_pakdd
from empulse.models import CSLogitClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, TargetEncoder
X, y, tp_cost, fp_cost, fn_cost, tn_cost = load_credit_scoring_pakdd(
return_X_y_costs=True,
as_frame=True
)
pipeline = Pipeline([
('preprocessor', ColumnTransformer([
('num', StandardScaler(), X.select_dtypes(include=['number']).columns),
('cat', TargetEncoder(), X.select_dtypes(include=['category']).columns)
])),
('model', CSLogitClassifier())
])
pipeline.fit(
X,
y,
model__tp_cost=tp_cost,
model__fp_cost=fp_cost,
model__fn_cost=fn_cost,
model__tn_cost=tn_cost
)
5.3.3. Cost Matrix#
Actual positive \(y_i = 1\) |
Actual negative \(y_i = 0\) |
|
Predicted positive \(\hat{y}_i = 1\) |
|
|
Predicted negative \(\hat{y}_i = 0\) |
|
|
- with
\(r_i\) : loss in profit by rejecting what would have been a good loan
\(\bar{r}\) : average loss in profit by rejecting what would have been a good loan
\(\pi_0\) : percentage of defaulters
\(\pi_1\) : percentage of non-defaulters
\(Cl_i\) : credit line of the client
\(\bar{Cl}\) : average credit line
\(L_{gd}\) : the fraction of the loan amount which is lost if the client defaults
Using default parameters, it is assumed that the interest rate is 63%, the cost of running the fund is 16.5%, the maximum credit line is 25,000, the loss given default is 75%, the term length is 24 months, and the loan to income ratio is 3. The default parameters are based on [2].
These assumptions can be changed by passing your own values to the
load_credit_scoring_pakdd
function:
from empulse.datasets import load_credit_scoring_pakdd
X, y, tp_cost, fp_cost, fn_cost, tn_cost = load_credit_scoring_pakdd(
return_X_y_costs=True,
interest_rate=0.63,
fund_cost=0.165,
max_credit_line=25000,
loss_given_default=0.75,
term_length_months=24,
loan_to_income_ratio=3,
)
5.3.4. Data Description#
Variable Name |
Description |
Type |
---|---|---|
age |
Applicant’s age |
numeric |
personal_net_income |
Applicant’s personal monthly net income in Brazilian currency (R$) |
numeric |
partner_income |
Applicant’s partner monthly net income in Brazilian currency (R$) |
numeric |
months_in_residence |
Time in the current residence in months |
numeric |
months_in_the_job |
Time in the current job in months |
numeric |
payment_day |
Fixed month day selected for the eventual monthly payment |
numeric |
n_banking_accounts |
Quantity of applicant’s banking accounts |
numeric |
n_additional_cards |
Quantity of additional cards asked for in the same application form |
numeric |
is_male |
Whether the applicant is male (‘yes’ = 1, ‘no’ = 0) |
binary |
has_residential_phone |
If the applicant possesses a residential phone (‘yes’ = 1, ‘no’ = 0) |
binary |
has_mobile_phone |
If the applicant possesses a mobile phone (‘yes’ = 1, ‘no’ = 0) |
binary |
has_contact_phone |
If the applicant possesses a contact phone (‘yes’ = 1, ‘no’ = 0) |
binary |
has_same_postal_address |
If the applicant receives the post in the same address where lives (‘yes’ = 1, ‘no’ = 0) |
binary |
has_other_card |
If the applicant possesses another credit or private label card (‘yes’ = 1, ‘no’ = 0) |
binary |
lives_in_working_town |
If the applicant works in the same town where lives (‘yes’ = 1, ‘no’ = 0) |
binary |
lives_in_working_state |
If the applicant works in the same state where lives (‘yes’ = 1, ‘no’ = 0) |
binary |
filled_in_mothers_name |
If the applicant had filled the father’s name in the form (‘yes’ = 1, ‘no’ = 0) |
binary |
filled_in_fathers_name |
If the applicant had filled the mother’s name in the form (‘yes’ = 1, ‘no’ = 0) |
binary |
shop_rank |
Company’s rating for the shop in commercial terms |
ordinal |
marital_status |
The marital status of the applicant (‘single’, ‘married’, ‘divorced’, ‘widow’, ‘other’) |
categorical |
residence_type |
The type of the applicant’s residence (‘owned’, ‘rented’, ‘parents’, ‘other’) |
categorical |
area_code_residential_phone |
Modified residential phone area code |
categorical |
shop_code |
Shop code where the application has been made |
categorical |
application_booth_code |
Booth code where application was handed in |
categorical |
profession_code |
Applicant’s profession code |
categorical |
default |
Has the applicant defaulted? (‘yes’ = 1, ‘no’ = 0) |
binary |