Evaluation metrics on classification of breast cancer dataset#
The confusion matrix is a table that contains the performance of the model: The columns represent the instances that belong to a predicted class; and the rows refer to the instances that actually belong to that class (ground truth).
Accuracy measures the model’s ability to correctly classify all instances, and not always a useful metric when the objective is to minimize/maximize the occurrence of one class independently of its performance on other classes.
Precision measures the model’s ability to correctly classify positive labels by comparing it with the total number of instances predicted as positive. This is represented by the ratio between the true positives and the sum of the true positives and false positives.
Recall measures the number of correctly predicted positive labels against all positive labels. This is represented by the ratio between true positives and the sum of true positives and false negatives.
Importing libraries and packages#
1# Mathematical operations and data manipulation 2import pandas as pd 3 4# Dataset 5from sklearn.datasets import load_breast_cancer 6 7# Model 8from sklearn.model_selection import train_test_split 9from sklearn import tree 10from sklearn.metrics import confusion_matrix 11from sklearn.metrics import accuracy_score 12from sklearn.metrics import precision_score 13from sklearn.metrics import recall_score 14 15# Warnings 16import warnings 17 18warnings.filterwarnings("ignore")
1# Path to datasets directory 2data_path = "./datasets" 3# Path to assets directory (for saving results to) 4assets_path = "./assets"
sklearn.datasets.load_breast_cancer - The output is a dictionary-like object, which separates the features (callable as data) from the target (callable as target) into two attributes.
1dataset = load_breast_cancer()
Partitioning and training#
1# Convert each attribute (data and target) into a Pandas DataFrame 2X = pd.DataFrame(dataset.data) 3Y = pd.DataFrame(dataset.target) 4 5print("Shape of X: ", X.shape) 6print("Shape of Y: ", Y.shape)
Shape of X: (569, 30) Shape of Y: (569, 1)
1# First split of the data using the train_test_split function 2X_train, X_test, Y_train, Y_test = train_test_split( 3 X, Y, test_size=0.1, random_state=0 4) 5 6print("Shape of X_train: ", X_train.shape) 7print("Shape of X_test: ", X_test.shape) 8print("Shape of Y_train: ", Y.shape) 9print("Shape of Y_test: ", Y_test.shape)
Shape of X_train: (512, 30) Shape of X_test: (57, 30) Shape of Y_train: (569, 1) Shape of Y_test: (57, 1)
1model = tree.DecisionTreeClassifier(random_state=0) 2model = model.fit(X_train, Y_train) 3Y_pred = model.predict(X_test)
array([[21, 1], [ 6, 29]])
1accuracy = accuracy_score(Y_test, Y_pred) 2print("accuracy:", accuracy)
1precision = precision_score(Y_test, Y_pred) 2print("precision:", precision)
1recall = recall_score(Y_test, Y_pred) 2print("recall:", recall)