Evaluation metrics on classification of breast cancer dataset#

The confusion matrix is a table that contains the performance of the model: The columns represent the instances that belong to a predicted class; and the rows refer to the instances that actually belong to that class (ground truth).

Confusion matrix

Predicted: True

Predicted: False

Actual: True

True Positives

False Negatives

Actual: False

False Positives

True Negatives

Accuracy measures the model’s ability to correctly classify all instances, and not always a useful metric when the objective is to minimize/maximize the occurrence of one class independently of its performance on other classes.

Precision measures the model’s ability to correctly classify positive labels by comparing it with the total number of instances predicted as positive. This is represented by the ratio between the true positives and the sum of the true positives and false positives.

Recall measures the number of correctly predicted positive labels against all positive labels. This is represented by the ratio between true positives and the sum of true positives and false negatives.

Importing libraries and packages#

 1# Mathematical operations and data manipulation
 2import pandas as pd
 3
 4# Dataset
 5from sklearn.datasets import load_breast_cancer
 6
 7# Model
 8from sklearn.model_selection import train_test_split
 9from sklearn import tree
10from sklearn.metrics import confusion_matrix
11from sklearn.metrics import accuracy_score
12from sklearn.metrics import precision_score
13from sklearn.metrics import recall_score
14
15# Warnings
16import warnings
17
18warnings.filterwarnings("ignore")

Set paths#

1# Path to datasets directory
2data_path = "./datasets"
3# Path to assets directory (for saving results to)
4assets_path = "./assets"

Loading dataset#

sklearn.datasets.load_breast_cancer - The output is a dictionary-like object, which separates the features (callable as data) from the target (callable as target) into two attributes.

1dataset = load_breast_cancer()

Partitioning and training#

1# Convert each attribute (data and target) into a Pandas DataFrame
2X = pd.DataFrame(dataset.data)
3Y = pd.DataFrame(dataset.target)
4
5print("Shape of X: ", X.shape)
6print("Shape of Y: ", Y.shape)
Shape of X:  (569, 30)
Shape of Y:  (569, 1)
1# First split of the data using the train_test_split function
2X_train, X_test, Y_train, Y_test = train_test_split(
3    X, Y, test_size=0.1, random_state=0
4)
5
6print("Shape of X_train: ", X_train.shape)
7print("Shape of X_test: ", X_test.shape)
8print("Shape of Y_train: ", Y.shape)
9print("Shape of Y_test: ", Y_test.shape)
Shape of X_train:  (512, 30)
Shape of X_test:  (57, 30)
Shape of Y_train:  (569, 1)
Shape of Y_test:  (57, 1)
1model = tree.DecisionTreeClassifier(random_state=0)
2model = model.fit(X_train, Y_train)
3Y_pred = model.predict(X_test)

Metrics#

Confusion matrix#

1confusion_matrix(Y_test, Y_pred)
array([[21,  1],
       [ 6, 29]])

Accuracy#

1accuracy = accuracy_score(Y_test, Y_pred)
2print("accuracy:", accuracy)
accuracy: 0.8771929824561403

Precision#

1precision = precision_score(Y_test, Y_pred)
2print("precision:", precision)
precision: 0.9666666666666667

Recall#

1recall = recall_score(Y_test, Y_pred)
2print("recall:", recall)
recall: 0.8285714285714286