5.4. 2011 Kaggle competition Give Me Some Credit#
5.4.1. Summary#
This is a Kaggle dataset from the credit agency Credit Fusion [1]. The goal is to predict whether a customer will default on a loan in the next two years.
Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted.
Classes |
2 |
Defaulters |
7616 |
Non-defaulters |
105299 |
Samples |
112915 |
Features |
10 |
5.4.2. Using the Dataset#
The dataset can be loaded through the load_give_me_some_credit
function.
This returns a Dataset
object with the following attributes:
data
: the feature matrixtarget
: the target vectortp_cost
: the cost of a true positivefp_cost
: the cost of a false positivefn_cost
: the cost of a false negativetn_cost
: the cost of a true negativefeature_names
: the feature namestarget_names
: the target namesDESCR
: the full description of the dataset
from empulse.datasets import load_give_me_some_credit
dataset = load_give_me_some_credit()
Alternatively, the load function can also return the features, target, and costs separately,
by setting return_X_y_costs=True
.
Additionally, you can specify that you want the output in a pandas.DataFrame
format,
by setting as_frame=True
.
The following code snippet demonstrates how to load the dataset and fit a model using the
CSLogitClassifier
:
from empulse.datasets import load_give_me_some_credit
from empulse.models import CSLogitClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
X, y, tp_cost, fp_cost, fn_cost, tn_cost = load_give_me_some_credit(
return_X_y_costs=True,
as_frame=True
)
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', CSLogitClassifier())
])
pipeline.fit(
X,
y,
model__tp_cost=tp_cost,
model__fp_cost=fp_cost,
model__fn_cost=fn_cost,
model__tn_cost=tn_cost
)
5.4.3. Cost Matrix#
Actual positive \(y_i = 1\) |
Actual negative \(y_i = 0\) |
|
Predicted positive \(\hat{y}_i = 1\) |
|
|
Predicted negative \(\hat{y}_i = 0\) |
|
|
- with
\(r_i\) : loss in profit by rejecting what would have been a good loan
\(\bar{r}\) : average loss in profit by rejecting what would have been a good loan
\(\pi_0\) : percentage of defaulters
\(\pi_1\) : percentage of non-defaulters
\(Cl_i\) : credit line of the client
\(\bar{Cl}\) : average credit line
\(L_{gd}\) : the fraction of the loan amount which is lost if the client defaults
Using default parameters, it is assumed that the interest rate is 4.79%, the cost of running the fund is 2.94%, the maximum credit line is 25,000, the loss given default is 75%, the term length is 24 months, and the loan to income ratio is 3. The default parameters are based on [2].
These assumptions can be changed by passing your own values to the
load_give_me_some_credit
function:
from empulse.datasets import load_give_me_some_credit
X, y, tp_cost, fp_cost, fn_cost, tn_cost = load_give_me_some_credit(
return_X_y_costs=True,
interest_rate=0.0479,
fund_cost=0.0294,
max_credit_line=25000,
loss_given_default=0.75,
term_length_months=24,
loan_to_income_ratio=3,
)
5.4.4. Data Description#
Variable Name |
Description |
Type |
---|---|---|
monthly_income |
Monthly income of borrower |
numeric |
debt_ratio |
Monthly debt payments, alimony, living costs divided by monthly gross income |
numeric |
revolving_utilization |
Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits |
numeric |
age |
Age of borrower in years |
numeric |
n_dependents |
Number of dependents in family excluding themselves (spouse, children etc.) |
numeric |
n_open_credit_lines |
Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) |
numeric |
n_real_estate_loans |
Number of mortgage and real estate loans including home equity lines of credit |
numeric |
n_times_late_30_59_days |
Number of times borrower has been 30-59 days past due but no worse in the last 2 years. |
numeric |
n_times_late_60_89_days |
Number of times borrower has been 60-89 days past due but no worse in the last 2 years. |
numeric |
n_times_late_over_90_days |
Number of times borrower has been 90 days or more past due. |
numeric |
default |
Whether a person experienced 90 days past due delinquency or worse (‘yes’ = 1, ‘no’ = 0) |
binary |