load_give_me_some_credit#
- empulse.datasets.load_give_me_some_credit(*, as_frame=False, return_X_y_costs=False, interest_rate=0.0479, fund_cost=0.0294, max_credit_line=25000, loss_given_default=0.75, term_length_months=24, loan_to_income_ratio=3)[source]#
Load the “Give Me Some Credit” Kaggle credit scoring competition dataset (binary classification).
The goal is to predict whether a customer will default on a loan in the next two years. The target variable is whether the customer defaulted, ‘yes’ = 1 and ‘no’ = 0.
Only customers with a positive monthly income and a debt ratio less than 1 are considered.
For a full data description and additional information about the dataset, consult the User Guide.
Classes
2
Defaulters
7616
Non-defaulters
105299
Samples
112915
Features
10
- Parameters:
- as_framebool, default=False
If True, the output will be a pandas DataFrames or Series instead of numpy arrays.
- return_X_y_costsbool, default=False
If True, return (data, target, tp_cost, fp_cost, tn_cost, fn_cost) instead of a Dataset object.
- interest_ratefloat, default=0.02463333
Annual interest rate of the term deposit.
- fund_costfloat, default=0.0294
Annual cost of funds.
- max_credit_linefloat, default=25000
The maximum amount a client can borrow.
- loss_given_defaultfloat, default=0.75
The fraction of the loan amount which is lost if the client defaults.
- term_length_monthsint, default=24
The length of the loan term in months.
- loan_to_income_ratiofloat, default=3
The ratio of the loan amount to the client’s income.
- Returns:
- dataset
Dataset
or tuple of (data, target, tp_cost, fp_cost, tn_cost, fn_cost) Returns a Dataset object if return_X_y_costs=False (default), otherwise a tuple.
- dataset
Notes
Cost matrix
Actual positive \(y_i = 1\)
Actual negative \(y_i = 0\)
Predicted positive \(\hat{y}_i = 1\)
tp_cost
\(= 0\)fp_cost
\(= r_i + -\bar{r} \cdot \pi_0 + \bar{Cl} \cdot L_{gd} \cdot \pi_1\)Predicted negative \(\hat{y}_i = 0\)
fn_cost
\(= Cl_i \cdot L_{gd}\)tn_cost
\(= 0\)- with
\(r_i\) : loss in profit by rejecting what would have been a good loan
\(\bar{r}\) : average loss in profit by rejecting what would have been a good loan
\(\pi_0\) : percentage of defaulters
\(\pi_1\) : percentage of non-defaulters
\(Cl_i\) : credit line of the client
\(\bar{Cl}\) : average credit line
\(L_{gd}\) : the fraction of the loan amount which is lost if the client defaults
References
[1]A. Correa Bahnsen, D.Aouada, B, Ottersten, “Example-Dependent Cost-Sensitive Logistic Regression for Credit Scoring”, in Proceedings of the International Conference on Machine Learning and Applications, 2014.
Examples
from empulse.datasets import load_give_me_some_credit from sklearn.model_selection import train_test_split dataset = load_give_me_some_credit() X_train, X_test, y_train, y_test = train_test_split( dataset.data, dataset.target, random_state=42 )