load_credit_scoring_pakdd#
- empulse.datasets.load_credit_scoring_pakdd(*, as_frame=False, return_X_y_costs=False, interest_rate=0.63, fund_cost=0.165, max_credit_line=25000, loss_given_default=0.75, term_length_months=24, loan_to_income_ratio=3)[source]#
Load the credit scoring PAKDD 2009 competition dataset (binary classification).
The goal is to predict whether a customer will default on a loan in the next two years. The target variable is whether the customer defaulted, ‘yes’ = 1 and ‘no’ = 0.
Only clients with a personal income between 100 and 10000 are considered.
For a full data description and additional information about the dataset, consult the User Guide.
Classes
2
Defaulters
7743
Non-defaulters
31195
Samples
38938
Features
25
- Parameters:
- as_framebool, default=False
If True, the output will be a pandas DataFrames or Series instead of numpy arrays.
- return_X_y_costsbool, default=False
If True, return (data, target, tp_cost, fp_cost, tn_cost, fn_cost) instead of a Dataset object.
- interest_ratefloat, default=0.63
Annual interest rate of the term deposit.
- fund_costfloat, default=0.165
Annual cost of funds.
- max_credit_linefloat, default=25000
The maximum amount a client can borrow.
- loss_given_defaultfloat, default=0.75
The amount of the loan amount which is lost if the client defaults.
- term_length_monthsint, default=24
The length of the loan term in months.
- loan_to_income_ratiofloat, default=3
The ratio of the loan amount to the client’s income.
- Returns:
- dataset
Dataset
or tuple of (data, target, tp_cost, fp_cost, tn_cost, fn_cost) Returns a Dataset object if return_X_y_costs=False (default), otherwise a tuple.
- dataset
Notes
Cost matrix
Actual positive \(y_i = 1\)
Actual negative \(y_i = 0\)
Predicted positive \(\hat{y}_i = 1\)
tp_cost
\(= 0\)fp_cost
\(= r_i + -\bar{r} \cdot \pi_0 + \bar{Cl} \cdot L_{gd} \cdot \pi_1\)Predicted negative \(\hat{y}_i = 0\)
fn_cost
\(= Cl_i \cdot L_{gd}\)tn_cost
\(= 0\)- with
\(r_i\) : loss in profit by rejecting what would have been a good loan
\(\bar{r}\) : average loss in profit by rejecting what would have been a good loan
\(\pi_0\) : percentage of defaulters
\(\pi_1\) : percentage of non-defaulters
\(Cl_i\) : credit line of the client
\(\bar{Cl}\) : average credit line
\(L_{gd}\) : the fraction of the loan amount which is lost if the client defaults
References
[1]A. Correa Bahnsen, D.Aouada, B, Ottersten, “Example-Dependent Cost-Sensitive Logistic Regression for Credit Scoring”, in Proceedings of the International Conference on Machine Learning and Applications, , 2014.
Examples
from empulse.datasets import load_credit_scoring_pakdd from sklearn.model_selection import train_test_split dataset = load_credit_scoring_pakdd() X_train, X_test, y_train, y_test = train_test_split( dataset.data, dataset.target, random_state=42 )