Partitioning the wine dataset#

The split ratio to partition data is not fixed and should be decided by taking into account the amount of data available, the type of algorithm to be used, and the distribution of the data.

Importing libraries and packages#

 1# Mathematical operations and data manipulation
 2import pandas as pd
 3
 4# Dataset
 5from sklearn.datasets import load_wine
 6
 7# Model
 8from sklearn.model_selection import train_test_split
 9from sklearn.model_selection import KFold
10
11# Warnings
12import warnings
13
14warnings.filterwarnings("ignore")

Set paths#

1# Path to datasets directory
2data_path = "./datasets"
3# Path to assets directory (for saving results to)
4assets_path = "./assets"

Loading dataset#

sklearn.datasets.load_wine - The output is a dictionary-like object, which separates the features (callable as data) from the target (callable as target) into two attributes.

1dataset = load_wine()

Conventional partitioning#

60/20/20% training, validation, and testing

1# Convert each attribute (data and target) into a Pandas DataFrame
2X = pd.DataFrame(dataset.data)
3Y = pd.DataFrame(dataset.target)
4
5print("Shape of X: ", X.shape)
6print("Shape of Y: ", Y.shape)
Shape of X:  (178, 13)
Shape of Y:  (178, 1)
1# First split of the data using the train_test_split function
2X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
3
4print("Shape of X_train: ", X_train.shape)
5print("Shape of X_test: ", X_test.shape)
6print("Shape of Y_train: ", Y.shape)
7print("Shape of Y_test: ", Y_test.shape)
Shape of X_train:  (142, 13)
Shape of X_test:  (36, 13)
Shape of Y_train:  (178, 1)
Shape of Y_test:  (36, 1)
1# Second split for a validation set (dev set): toobtain a dev set
2# that's the same shape as the test set, it is necessary to calculate
3# the proportion of the size of the test set over the size of the
4# train set before creating a validation set.
5dev_size = X_test.shape[0] / X_train.shape[0]
6print(dev_size)
0.2535211267605634
 1X_train, X_dev, Y_train, Y_dev = train_test_split(
 2    X_train, Y_train, test_size=dev_size
 3)
 4
 5print("Shape of X_train: ", X_train.shape)
 6print("Shape of Y_train: ", Y_train.shape)
 7print("Shape of X_dev: ", X_dev.shape)
 8print("Shape of Y_dev: ", Y_dev.shape)
 9print("Shape of X_test: ", X_test.shape)
10print("Shape of Y_test: ", Y_test.shape)
Shape of X_train:  (106, 13)
Shape of Y_train:  (106, 1)
Shape of X_dev:  (36, 13)
Shape of Y_dev:  (36, 1)
Shape of X_test:  (36, 13)
Shape of Y_test:  (36, 1)

Cross validation partitioning#

1print("Shape of X: ", X.shape)
2print("Shape of Y: ", Y.shape)
Shape of X:  (178, 13)
Shape of Y:  (178, 1)
1# Split the data into training and testing sets
2X, X_test, Y, Y_test = train_test_split(X, Y, test_size=0.10)
3# Instantiate the KFold class with a 10-fold configuration
4kf = KFold(n_splits=10)
1# Apply the split method to the data in X .
2# Output: the index of the instances to be used as training
3# and validation sets.
4splits = kf.split(X)
 1# for loop going through the different split configurations.
 2# In the loop body, create the variables that will hold the data
 3# for the training and validation sets.
 4for train_index, test_index in splits:
 5    X_train, X_dev = X.iloc[train_index, :], X.iloc[test_index, :]
 6    Y_train, Y_dev = Y.iloc[train_index, :], Y.iloc[test_index, :]
 7
 8# The code to train and evaluate the model should be written inside
 9# the loop body, given that the objective of the cross-validation
10# procedure is to train and validate the model using the different
11# split configurations.
1print("Shape of X_train: ", X_train.shape)
2print("Shape of Y_train: ", Y_train.shape)
3print("Shape of X_dev: ", X_dev.shape)
4print("Shape of Y_dev: ", Y_dev.shape)
5print("Shape of X_test: ", X_test.shape)
6print("Shape of Y_test: ", Y_test.shape)
Shape of X_train:  (144, 13)
Shape of Y_train:  (144, 1)
Shape of X_dev:  (16, 13)
Shape of Y_dev:  (16, 1)
Shape of X_test:  (18, 13)
Shape of Y_test:  (18, 1)