load_credit_scoring_pakdd#

empulse.datasets.load_credit_scoring_pakdd(*, as_frame=False, return_X_y_costs=False, interest_rate=0.63, fund_cost=0.165, max_credit_line=25000, loss_given_default=0.75, term_length_months=24, loan_to_income_ratio=3)[source]#

Load the credit scoring PAKDD 2009 competition dataset (binary classification).

The goal is to predict whether a customer will default on a loan in the next two years. The target variable is whether the customer defaulted, ‘yes’ = 1 and ‘no’ = 0.

Only clients with a personal income between 100 and 10000 are considered.

For a full data description and additional information about the dataset, consult the User Guide.

Classes	2
Defaulters	7743
Non-defaulters	31195
Samples	38938
Features	25

Parameters:

as_framebool, default=False: If True, the output will be a pandas DataFrames or Series instead of numpy arrays.
return_X_y_costsbool, default=False: If True, return (data, target, tp_cost, fp_cost, tn_cost, fn_cost) instead of a Dataset object.
interest_ratefloat, default=0.63: Annual interest rate of the term deposit.
fund_costfloat, default=0.165: Annual cost of funds.
max_credit_linefloat, default=25000: The maximum amount a client can borrow.
loss_given_defaultfloat, default=0.75: The amount of the loan amount which is lost if the client defaults.
term_length_monthsint, default=24: The length of the loan term in months.
loan_to_income_ratiofloat, default=3: The ratio of the loan amount to the client’s income.

Returns:

datasetDataset or tuple of (data, target, tp_cost, fp_cost, tn_cost, fn_cost): Returns a Dataset object if return_X_y_costs=False (default), otherwise a tuple.

Notes

Cost matrix

	Actual positive \(y_i = 1\)	Actual negative \(y_i = 0\)
Predicted positive \(\hat{y}_i = 1\)	`tp_cost` \(= 0\)	`fp_cost` \(= r_i + -\bar{r} \cdot \pi_0 + \bar{Cl} \cdot L_{gd} \cdot \pi_1\)
Predicted negative \(\hat{y}_i = 0\)	`fn_cost` \(= Cl_i \cdot L_{gd}\)	`tn_cost` \(= 0\)

with

\(r_i\) : loss in profit by rejecting what would have been a good loan
\(\bar{r}\) : average loss in profit by rejecting what would have been a good loan
\(\pi_0\) : percentage of defaulters
\(\pi_1\) : percentage of non-defaulters
\(Cl_i\) : credit line of the client
\(\bar{Cl}\) : average credit line
\(L_{gd}\) : the fraction of the loan amount which is lost if the client defaults

References

[1]

A. Correa Bahnsen, D.Aouada, B, Ottersten, “Example-Dependent Cost-Sensitive Logistic Regression for Credit Scoring”, in Proceedings of the International Conference on Machine Learning and Applications, , 2014.

Examples

from empulse.datasets import load_credit_scoring_pakdd
from sklearn.model_selection import train_test_split

dataset = load_credit_scoring_pakdd()
X_train, X_test, y_train, y_test = train_test_split(
    dataset.data, dataset.target, random_state=42
)