Preprocessing#

Loading the data and performing some initial exploration on it to acquire some basic knowledge about the data, how the various features are distributed.

Importing libraries and packages#

1# Mathematical operations and data manipulation
2import pandas as pd
3
4# Warnings
5import warnings
6
7warnings.filterwarnings("ignore")
8
9%matplotlib inline

Set paths#

1# Path to datasets directory
2data_path = "./datasets"
3# Path to assets directory (for saving results to)
4assets_path = "./assets"

Loading dataset#

1# load data
2dataset = pd.read_csv(f"{data_path}/online_retail_II.csv")
3dataset.head().T
0 1 2 3 4
Invoice 489434 489434 489434 489434 489434
StockCode 85048 79323P 79323W 22041 21232
Description 15CM CHRISTMAS GLASS BALL 20 LIGHTS PINK CHERRY LIGHTS WHITE CHERRY LIGHTS RECORD FRAME 7" SINGLE SIZE STRAWBERRY CERAMIC TRINKET BOX
Quantity 12 12 12 48 24
InvoiceDate 01/12/2009 07:45 01/12/2009 07:45 01/12/2009 07:45 01/12/2009 07:45 01/12/2009 07:45
Price 6.95 6.75 6.75 2.1 1.25
Customer ID 13085.0 13085.0 13085.0 13085.0 13085.0
Country United Kingdom United Kingdom United Kingdom United Kingdom United Kingdom

Exploring dataset#

1# Printing dimensionality of the data, columns, types and missing values
2print(f"Data dimension: {dataset.shape}")
3for col in dataset.columns:
4    print(
5        f"Column: {col:35} | "
6        f"type: {str(dataset[col].dtype):7} | "
7        f"missing values: {dataset[col].isna().sum():3d}"
8    )
Data dimension: (525461, 8)
Column: Invoice                             | type: object  | missing values:   0
Column: StockCode                           | type: object  | missing values:   0
Column: Description                         | type: object  | missing values: 2928
Column: Quantity                            | type: int64   | missing values:   0
Column: InvoiceDate                         | type: object  | missing values:   0
Column: Price                               | type: float64 | missing values:   0
Column: Customer ID                         | type: float64 | missing values: 107927
Column: Country                             | type: object  | missing values:   0

Column Description has some missing values, Customer ID has a lot of (20%) missing values.

1# Computing statistics on numerical features
2dataset.describe().T
count mean std min 25% 50% 75% max
Quantity 525461.0 10.337667 107.424110 -9600.00 1.00 3.0 10.00 19152.00
Price 525461.0 4.688834 146.126914 -53594.36 1.25 2.1 4.21 25111.09
Customer ID 417534.0 15360.645478 1680.811316 12346.00 13983.00 15311.0 16799.00 18287.00

Preprocessing#

 1dataset.rename(
 2    index=str,
 3    columns={
 4        "Invoice": "invoice",
 5        "StockCode": "stock_code",
 6        "Quantity": "quantity",
 7        "InvoiceDate": "date",
 8        "Price": "unit_price",
 9        "Country": "country",
10        "Description": "desc",
11        "Customer ID": "cust_id",
12    },
13    inplace=True,
14)
1dataset.info()
<class 'pandas.core.frame.DataFrame'>
Index: 525461 entries, 0 to 525460
Data columns (total 8 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   invoice     525461 non-null  object 
 1   stock_code  525461 non-null  object 
 2   desc        522533 non-null  object 
 3   quantity    525461 non-null  int64  
 4   date        525461 non-null  object 
 5   unit_price  525461 non-null  float64
 6   cust_id     417534 non-null  float64
 7   country     525461 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 36.1+ MB
1dataset.to_csv(f"{data_path}/preprocessed_retail.csv", index=False)