load_give_me_some_credit#

empulse.datasets.load_give_me_some_credit(*, as_frame=False, return_X_y_costs=False, interest_rate=0.0479, fund_cost=0.0294, max_credit_line=25000, loss_given_default=0.75, term_length_months=24, loan_to_income_ratio=3)[source]#

Load the “Give Me Some Credit” Kaggle credit scoring competition dataset (binary classification).

The goal is to predict whether a customer will default on a loan in the next two years. The target variable is whether the customer defaulted, ‘yes’ = 1 and ‘no’ = 0.

Only customers with a positive monthly income and a debt ratio less than 1 are considered.

For a full data description and additional information about the dataset, consult the User Guide.

Classes

2

Defaulters

7616

Non-defaulters

105299

Samples

112915

Features

10

Parameters:
as_framebool, default=False

If True, the output will be a pandas DataFrames or Series instead of numpy arrays.

return_X_y_costsbool, default=False

If True, return (data, target, tp_cost, fp_cost, tn_cost, fn_cost) instead of a Dataset object.

interest_ratefloat, default=0.02463333

Annual interest rate of the term deposit.

fund_costfloat, default=0.0294

Annual cost of funds.

max_credit_linefloat, default=25000

The maximum amount a client can borrow.

loss_given_defaultfloat, default=0.75

The fraction of the loan amount which is lost if the client defaults.

term_length_monthsint, default=24

The length of the loan term in months.

loan_to_income_ratiofloat, default=3

The ratio of the loan amount to the client’s income.

Returns:
datasetDataset or tuple of (data, target, tp_cost, fp_cost, tn_cost, fn_cost)

Returns a Dataset object if return_X_y_costs=False (default), otherwise a tuple.

Notes

Cost matrix

Actual positive \(y_i = 1\)

Actual negative \(y_i = 0\)

Predicted positive \(\hat{y}_i = 1\)

tp_cost \(= 0\)

fp_cost \(= r_i + -\bar{r} \cdot \pi_0 + \bar{Cl} \cdot L_{gd} \cdot \pi_1\)

Predicted negative \(\hat{y}_i = 0\)

fn_cost \(= Cl_i \cdot L_{gd}\)

tn_cost \(= 0\)

with
  • \(r_i\) : loss in profit by rejecting what would have been a good loan

  • \(\bar{r}\) : average loss in profit by rejecting what would have been a good loan

  • \(\pi_0\) : percentage of defaulters

  • \(\pi_1\) : percentage of non-defaulters

  • \(Cl_i\) : credit line of the client

  • \(\bar{Cl}\) : average credit line

  • \(L_{gd}\) : the fraction of the loan amount which is lost if the client defaults

References

[1]

A. Correa Bahnsen, D.Aouada, B, Ottersten, “Example-Dependent Cost-Sensitive Logistic Regression for Credit Scoring”, in Proceedings of the International Conference on Machine Learning and Applications, 2014.

Examples

from empulse.datasets import load_give_me_some_credit
from sklearn.model_selection import train_test_split

dataset = load_give_me_some_credit()
X_train, X_test, y_train, y_test = train_test_split(
    dataset.data, dataset.target, random_state=42
)