Evaluation metrics on classification of breast cancer dataset#

The confusion matrix is a table that contains the performance of the model: The columns represent the instances that belong to a predicted class; and the rows refer to the instances that actually belong to that class (ground truth).

Confusion matrix	Predicted: True	Predicted: False
Actual: True	True Positives	False Negatives
Actual: False	False Positives	True Negatives

Accuracy measures the model’s ability to correctly classify all instances, and not always a useful metric when the objective is to minimize/maximize the occurrence of one class independently of its performance on other classes.

Precision measures the model’s ability to correctly classify positive labels by comparing it with the total number of instances predicted as positive. This is represented by the ratio between the true positives and the sum of the true positives and false positives.

Recall measures the number of correctly predicted positive labels against all positive labels. This is represented by the ratio between true positives and the sum of true positives and false negatives.

Importing libraries and packages#

# Mathematical operations and data manipulation
import pandas as pd

# Dataset
from sklearn.datasets import load_breast_cancer

# Model
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

# Warnings
import warnings

warnings.filterwarnings("ignore")

Set paths#

# Path to datasets directory
data_path = "./datasets"
# Path to assets directory (for saving results to)
assets_path = "./assets"

Loading dataset#

sklearn.datasets.load_breast_cancer - The output is a dictionary-like object, which separates the features (callable as data) from the target (callable as target) into two attributes.

dataset = load_breast_cancer()

Partitioning and training#

# Convert each attribute (data and target) into a Pandas DataFrame
X = pd.DataFrame(dataset.data)
Y = pd.DataFrame(dataset.target)

print("Shape of X: ", X.shape)
print("Shape of Y: ", Y.shape)

Shape of X:  (569, 30)
Shape of Y:  (569, 1)

# First split of the data using the train_test_split function
X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.1, random_state=0
)

print("Shape of X_train: ", X_train.shape)
print("Shape of X_test: ", X_test.shape)
print("Shape of Y_train: ", Y.shape)
print("Shape of Y_test: ", Y_test.shape)

Shape of X_train:  (512, 30)
Shape of X_test:  (57, 30)
Shape of Y_train:  (569, 1)
Shape of Y_test:  (57, 1)

model = tree.DecisionTreeClassifier(random_state=0)
model = model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)

Metrics#

Confusion matrix#

confusion_matrix(Y_test, Y_pred)

array([[21,  1],
       [ 6, 29]])

Accuracy#

accuracy = accuracy_score(Y_test, Y_pred)
print("accuracy:", accuracy)

accuracy: 0.8771929824561403

Precision#

precision = precision_score(Y_test, Y_pred)
print("precision:", precision)

precision: 0.9666666666666667

Recall#

recall = recall_score(Y_test, Y_pred)
print("recall:", recall)

recall: 0.8285714285714286