Preprocessing#

Loading the data and performing some initial exploration on it to acquire some basic knowledge about the data, how the various features are distributed.

Importing libraries and packages#

1# Mathematical operations and data manipulation
2import pandas as pd
3
4# Warnings
5import warnings
6
7warnings.filterwarnings("ignore")
8%matplotlib inline

Set paths#

1# Path to datasets directory
2data_path = "./datasets"
3# Path to assets directory (for saving results to)
4assets_path = "./assets"

Loading dataset#

1# load data
2dataset = pd.read_excel(f"{data_path}/default_credit.xls")
3dataset.head().T
0 1 2 3 4
ID 1 2 3 4 5
LIMIT_BAL 20000 120000 90000 50000 50000
SEX 2 2 2 2 1
EDUCATION 2 2 2 2 2
MARRIAGE 1 2 2 1 1
AGE 24 26 34 37 57
PAY_0 2 -1 0 0 -1
PAY_2 2 2 0 0 0
PAY_3 -1 0 0 0 -1
PAY_4 -1 0 0 0 0
PAY_5 -2 0 0 0 0
PAY_6 -2 2 0 0 0
BILL_AMT1 3913 2682 29239 46990 8617
BILL_AMT2 3102 1725 14027 48233 5670
BILL_AMT3 689 2682 13559 49291 35835
BILL_AMT4 0 3272 14331 28314 20940
BILL_AMT5 0 3455 14948 28959 19146
BILL_AMT6 0 3261 15549 29547 19131
PAY_AMT1 0 0 1518 2000 2000
PAY_AMT2 689 1000 1500 2019 36681
PAY_AMT3 0 1000 1000 1200 10000
PAY_AMT4 0 1000 1000 1100 9000
PAY_AMT5 0 0 1000 1069 689
PAY_AMT6 0 2000 5000 1000 679
default payment next month 1 1 0 0 0

Exploring dataset#

1# Printing dimensionality of the data, columns, types and missing values
2print(f"Data dimension: {dataset.shape}")
3for col in dataset.columns:
4    print(
5        f"Column: {col:35} | "
6        f"type: {str(dataset[col].dtype):7} | "
7        f"missing values: {dataset[col].isna().sum():3d}"
8    )
Data dimension: (30000, 25)
Column: ID                                  | type: int64   | missing values:   0
Column: LIMIT_BAL                           | type: int64   | missing values:   0
Column: SEX                                 | type: int64   | missing values:   0
Column: EDUCATION                           | type: int64   | missing values:   0
Column: MARRIAGE                            | type: int64   | missing values:   0
Column: AGE                                 | type: int64   | missing values:   0
Column: PAY_0                               | type: int64   | missing values:   0
Column: PAY_2                               | type: int64   | missing values:   0
Column: PAY_3                               | type: int64   | missing values:   0
Column: PAY_4                               | type: int64   | missing values:   0
Column: PAY_5                               | type: int64   | missing values:   0
Column: PAY_6                               | type: int64   | missing values:   0
Column: BILL_AMT1                           | type: int64   | missing values:   0
Column: BILL_AMT2                           | type: int64   | missing values:   0
Column: BILL_AMT3                           | type: int64   | missing values:   0
Column: BILL_AMT4                           | type: int64   | missing values:   0
Column: BILL_AMT5                           | type: int64   | missing values:   0
Column: BILL_AMT6                           | type: int64   | missing values:   0
Column: PAY_AMT1                            | type: int64   | missing values:   0
Column: PAY_AMT2                            | type: int64   | missing values:   0
Column: PAY_AMT3                            | type: int64   | missing values:   0
Column: PAY_AMT4                            | type: int64   | missing values:   0
Column: PAY_AMT5                            | type: int64   | missing values:   0
Column: PAY_AMT6                            | type: int64   | missing values:   0
Column: default payment next month          | type: int64   | missing values:   0
1# Computing statistics on numerical features
2dataset.describe().T
count mean std min 25% 50% 75% max
ID 30000.0 15000.500000 8660.398374 1.0 7500.75 15000.5 22500.25 30000.0
LIMIT_BAL 30000.0 167484.322667 129747.661567 10000.0 50000.00 140000.0 240000.00 1000000.0
SEX 30000.0 1.603733 0.489129 1.0 1.00 2.0 2.00 2.0
EDUCATION 30000.0 1.853133 0.790349 0.0 1.00 2.0 2.00 6.0
MARRIAGE 30000.0 1.551867 0.521970 0.0 1.00 2.0 2.00 3.0
AGE 30000.0 35.485500 9.217904 21.0 28.00 34.0 41.00 79.0
PAY_0 30000.0 -0.016700 1.123802 -2.0 -1.00 0.0 0.00 8.0
PAY_2 30000.0 -0.133767 1.197186 -2.0 -1.00 0.0 0.00 8.0
PAY_3 30000.0 -0.166200 1.196868 -2.0 -1.00 0.0 0.00 8.0
PAY_4 30000.0 -0.220667 1.169139 -2.0 -1.00 0.0 0.00 8.0
PAY_5 30000.0 -0.266200 1.133187 -2.0 -1.00 0.0 0.00 8.0
PAY_6 30000.0 -0.291100 1.149988 -2.0 -1.00 0.0 0.00 8.0
BILL_AMT1 30000.0 51223.330900 73635.860576 -165580.0 3558.75 22381.5 67091.00 964511.0
BILL_AMT2 30000.0 49179.075167 71173.768783 -69777.0 2984.75 21200.0 64006.25 983931.0
BILL_AMT3 30000.0 47013.154800 69349.387427 -157264.0 2666.25 20088.5 60164.75 1664089.0
BILL_AMT4 30000.0 43262.948967 64332.856134 -170000.0 2326.75 19052.0 54506.00 891586.0
BILL_AMT5 30000.0 40311.400967 60797.155770 -81334.0 1763.00 18104.5 50190.50 927171.0
BILL_AMT6 30000.0 38871.760400 59554.107537 -339603.0 1256.00 17071.0 49198.25 961664.0
PAY_AMT1 30000.0 5663.580500 16563.280354 0.0 1000.00 2100.0 5006.00 873552.0
PAY_AMT2 30000.0 5921.163500 23040.870402 0.0 833.00 2009.0 5000.00 1684259.0
PAY_AMT3 30000.0 5225.681500 17606.961470 0.0 390.00 1800.0 4505.00 896040.0
PAY_AMT4 30000.0 4826.076867 15666.159744 0.0 296.00 1500.0 4013.25 621000.0
PAY_AMT5 30000.0 4799.387633 15278.305679 0.0 252.50 1500.0 4031.50 426529.0
PAY_AMT6 30000.0 5215.502567 17777.465775 0.0 117.75 1500.0 4000.00 528666.0
default payment next month 30000.0 0.221200 0.415062 0.0 0.00 0.0 0.00 1.0

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:

X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.

X2: Gender (1 = male; 2 = female).

X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).

X4: Marital status (1 = married; 2 = single; 3 = others).

X5: Age (year).

X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; …;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; …; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; …; X17 = amount of bill statement in April, 2005.

X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; …;X23 = amount paid in April, 2005.

Preprocessing#

1print("SEX " + str(sorted(dataset["SEX"].unique())))
SEX [1, 2]
1print("PAY_0 " + str(sorted(dataset["PAY_0"].unique())))
PAY_0 [-2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8]
1# Printing the unique values in the default payment next month column,
2# sorted in asc order
3print(
4    "default.payment.next.month: "
5    + str(sorted(dataset["default payment next month"].unique()))
6)
default.payment.next.month: [0, 1]
1# Renaming the PAY_0 column to PAY_1 and the default payment next
2# month column to DEFAULT to maintain consistency with the naming
3# of other columns
4dataset = dataset.rename(
5    columns={"default payment next month": "DEFAULT", "PAY_0": "PAY_1"}
6)
1print("EDUCATION " + str(sorted(dataset["EDUCATION"].unique())))
EDUCATION [0, 1, 2, 3, 4, 5, 6]

The EDUCATION column has 7 unique values, but the data description lists 4 unique values.

1# Joining categories 0, 5, and 6 with category 4
2fill = (
3    (dataset.EDUCATION == 0)
4    | (dataset.EDUCATION == 5)
5    | (dataset.EDUCATION == 6)
6)
7dataset.loc[fill, "EDUCATION"] = 4
8
9print("EDUCATION " + str(sorted(dataset["EDUCATION"].unique())))
EDUCATION [1, 2, 3, 4]
1print("MARRIAGE " + str(sorted(dataset["MARRIAGE"].unique())))
MARRIAGE [0, 1, 2, 3]

The data description lists 3 unique values, this lists 4.

1# Combining category 0 with category 2 (Single)
2fill = dataset.MARRIAGE == 0
3dataset.loc[fill, "MARRIAGE"] = 2
1dataset.head()
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 DEFAULT
0 1 20000 2 2 1 24 2 2 -1 -1 ... 0 0 0 0 689 0 0 0 0 1
1 2 120000 2 2 2 26 -1 2 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
2 3 90000 2 2 2 34 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
3 4 50000 2 2 1 37 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
4 5 50000 1 2 1 57 -1 0 -1 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0

5 rows × 25 columns

1dataset.to_csv(f"{data_path}/preprocessed_default_credit.csv", index=False)