load_give_me_some_credit#

empulse.datasets.load_give_me_some_credit(*, as_frame=False, return_X_y_costs=False, interest_rate=0.0479, fund_cost=0.0294, max_credit_line=25000, loss_given_default=0.75, term_length_months=24, loan_to_income_ratio=3)[source]#

Load the “Give Me Some Credit” Kaggle credit scoring competition dataset (binary classification).

The goal is to predict whether a customer will default on a loan in the next two years. The target variable is whether the customer defaulted, ‘yes’ = 1 and ‘no’ = 0.

Only customers with a positive monthly income and a debt ratio less than 1 are considered.

For a full data description and additional information about the dataset, consult the User Guide.

Classes	2
Defaulters	7616
Non-defaulters	105299
Samples	112915
Features	10

Parameters:

as_framebool, default=False: If True, the output will be a pandas DataFrames or Series instead of numpy arrays.
return_X_y_costsbool, default=False: If True, return (data, target, tp_cost, fp_cost, tn_cost, fn_cost) instead of a Dataset object.
interest_ratefloat, default=0.02463333: Annual interest rate of the term deposit.
fund_costfloat, default=0.0294: Annual cost of funds.
max_credit_linefloat, default=25000: The maximum amount a client can borrow.
loss_given_defaultfloat, default=0.75: The fraction of the loan amount which is lost if the client defaults.
term_length_monthsint, default=24: The length of the loan term in months.
loan_to_income_ratiofloat, default=3: The ratio of the loan amount to the client’s income.

Returns:

datasetDataset or tuple of (data, target, tp_cost, fp_cost, tn_cost, fn_cost): Returns a Dataset object if return_X_y_costs=False (default), otherwise a tuple.

Notes

Cost matrix

	Actual positive \(y_i = 1\)	Actual negative \(y_i = 0\)
Predicted positive \(\hat{y}_i = 1\)	`tp_cost` \(= 0\)	`fp_cost` \(= r_i + -\bar{r} \cdot \pi_0 + \bar{Cl} \cdot L_{gd} \cdot \pi_1\)
Predicted negative \(\hat{y}_i = 0\)	`fn_cost` \(= Cl_i \cdot L_{gd}\)	`tn_cost` \(= 0\)

with

\(r_i\) : loss in profit by rejecting what would have been a good loan
\(\bar{r}\) : average loss in profit by rejecting what would have been a good loan
\(\pi_0\) : percentage of defaulters
\(\pi_1\) : percentage of non-defaulters
\(Cl_i\) : credit line of the client
\(\bar{Cl}\) : average credit line
\(L_{gd}\) : the fraction of the loan amount which is lost if the client defaults

References

[1]

A. Correa Bahnsen, D.Aouada, B, Ottersten, “Example-Dependent Cost-Sensitive Logistic Regression for Credit Scoring”, in Proceedings of the International Conference on Machine Learning and Applications, 2014.

Examples

from empulse.datasets import load_give_me_some_credit
from sklearn.model_selection import train_test_split

dataset = load_give_me_some_credit()
X_train, X_test, y_train, y_test = train_test_split(
    dataset.data, dataset.target, random_state=42
)