Evaluation metrics on regression of boston dataset#

Mean Absolute Error measures the average absolute difference between a prediction and the ground truth, without taking into account the direction of the error. The MAE assigns the same weight of importance to all errors.

Root Mean Squared Error is a quadratic metric that also measures the average magnitude of error between the ground truth and the prediction. The RMSE squares the error, assigning higher weights to larger errors and is especially useful in cases where outliers are taken into account in the measurement of performance.

Importing libraries and packages#

 1# Mathematical operations and data manipulation
 2import numpy as np
 3import pandas as pd
 4
 5# Dataset
 6from sklearn.datasets import load_boston
 7
 8# Model
 9from sklearn.model_selection import train_test_split
10from sklearn import linear_model
11from sklearn.metrics import mean_absolute_error
12from sklearn.metrics import mean_squared_error
13
14# Warnings
15import warnings
16
17warnings.filterwarnings("ignore")
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[1], line 6
      3 import pandas as pd
      5 # Dataset
----> 6 from sklearn.datasets import load_boston
      8 # Model
      9 from sklearn.model_selection import train_test_split

File ~/checkouts/readthedocs.org/user_builds/modelling/conda/latest/lib/python3.9/site-packages/sklearn/datasets/__init__.py:156, in __getattr__(name)
    105 if name == "load_boston":
    106     msg = textwrap.dedent(
    107         """
    108         `load_boston` has been removed from scikit-learn since version 1.2.
   (...)
    154         """
    155     )
--> 156     raise ImportError(msg)
    157 try:
    158     return globals()[name]

ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>

Set paths#

1# Path to datasets directory
2data_path = "./datasets"
3# Path to assets directory (for saving results to)
4assets_path = "./assets"

Loading dataset#

sklearn.datasets.load_boston - The output is a dictionary-like object, which separates the features (callable as data) from the target (callable as target) into two attributes.

1dataset = load_boston()

Partitioning and training#

1# Convert each attribute (data and target) into a Pandas DataFrame
2X = pd.DataFrame(dataset.data)
3Y = pd.DataFrame(dataset.target)
4
5print("Shape of X: ", X.shape)
6print("Shape of Y: ", Y.shape)
Shape of X:  (506, 13)
Shape of Y:  (506, 1)
1# First split of the data using the train_test_split function
2X_train, X_test, Y_train, Y_test = train_test_split(
3    X, Y, test_size=0.1, random_state=0
4)
5
6print("Shape of X_train: ", X_train.shape)
7print("Shape of X_test: ", X_test.shape)
8print("Shape of Y_train: ", Y.shape)
9print("Shape of Y_test: ", Y_test.shape)
Shape of X_train:  (455, 13)
Shape of X_test:  (51, 13)
Shape of Y_train:  (506, 1)
Shape of Y_test:  (51, 1)
1model = linear_model.LinearRegression()
2model = model.fit(X_train, Y_train)
3Y_pred = model.predict(X_test)

Metrics#

Mean Absolute Error (MAE)#

1MAE = mean_absolute_error(Y_test, Y_pred)
2print("MAE:", MAE)
MAE: 3.9357920841192966

Root Mean Squared Error (RMSE)#

1RMSE = np.sqrt(mean_squared_error(Y_test, Y_pred))
2print("RMSE:", RMSE)
RMSE: 6.459456343676129