EDA Ames housing set#

Analysis:

The target variable
Variable types (categorical and numerical)
Missing data
Numerical variables
- Discrete
- Continuous
- Distributions
- Transformations
Categorical variables
- Cardinality
- Rare Labels
- Special mappings
Additional Reading Resources

Imports and packages#

# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# for the yeo-johnson transformation
import scipy.stats as stats

# to display all the columns of the dataframe in the notebook
pd.pandas.set_option("display.max_columns", None)

Paths#

# Path to datasets directory
data_path = "./datasets"
# Path to assets directory (for saving results to)
assets_path = "./assets"

Prepare the data set#

# load dataset
data = pd.read_csv(f"{data_path}/train.csv")

# rows and columns of the data
print(data.shape)

# visualise the dataset
data.head()

(1460, 81)

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtUnfSF	TotalBsmtSF	Heating	HeatingQC	CentralAir	Electrical	1stFlrSF	2ndFlrSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2003	2003	Gable	CompShg	VinylSd	VinylSd	BrkFace	196.0	Gd	TA	PConc	Gd	TA	No	GLQ	706	Unf	150	856	GasA	Ex	Y	SBrkr	856	854	1710	1	0	2	1	3	1	Gd	8	Typ	0	NaN	Attchd	2003.0	RFn	2	548	TA	TA	Y	0	61	0	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	FR2	Gtl	Veenker	Feedr	Norm	1Fam	1Story	6	8	1976	1976	Gable	CompShg	MetalSd	MetalSd	None	0.0	TA	TA	CBlock	Gd	TA	Gd	ALQ	978	Unf	284	1262	GasA	Ex	Y	SBrkr	1262	0	1262	0	1	2	0	3	1	TA	6	Typ	1	TA	Attchd	1976.0	RFn	2	460	TA	TA	Y	298	0	0	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2001	2002	Gable	CompShg	VinylSd	VinylSd	BrkFace	162.0	Gd	TA	PConc	Gd	TA	Mn	GLQ	486	Unf	434	920	GasA	Ex	Y	SBrkr	920	866	1786	1	0	2	1	3	1	Gd	6	Typ	1	TA	Attchd	2001.0	RFn	2	608	TA	TA	Y	0	42	0	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	Corner	Gtl	Crawfor	Norm	Norm	1Fam	2Story	7	5	1915	1970	Gable	CompShg	Wd Sdng	Wd Shng	None	0.0	TA	TA	BrkTil	TA	Gd	No	ALQ	216	Unf	540	756	GasA	Gd	Y	SBrkr	961	756	1717	1	0	1	0	3	1	Gd	7	Typ	1	Gd	Detchd	1998.0	Unf	3	642	TA	TA	Y	0	35	272	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	FR2	Gtl	NoRidge	Norm	Norm	1Fam	2Story	8	5	2000	2000	Gable	CompShg	VinylSd	VinylSd	BrkFace	350.0	Gd	TA	PConc	Gd	TA	Av	GLQ	655	Unf	490	1145	GasA	Ex	Y	SBrkr	1145	1053	2198	1	0	2	1	4	1	Gd	9	Typ	1	TA	Attchd	2000.0	RFn	3	836	TA	TA	Y	192	84	0	NaN	NaN	NaN	12	2008	WD	Normal	250000

# drop id, it is just a number given to identify each house
data.drop("Id", axis=1, inplace=True)

data.shape

(1460, 80)

The house price dataset contains 1460 rows, that is, houses, and 80 columns, i.e., variables.

79 are predictive variables and 1 is the target variable: SalePrice

Target#

# histogran to evaluate target distribution

data["SalePrice"].hist(bins=50, density=True)
plt.ylabel("Number of houses")
plt.xlabel("Sale Price")
plt.show()

../../_images/d9aedf394c25aec217bd37e55be9304de51ba88da3a21aed0669d26f2b9f2146.png

The target is continuous, and the distribution is skewed towards the right.

The value spread can be improved with a mathematical transformation.

# Transforming the target using logarithm

np.log(data["SalePrice"]).hist(bins=50, density=True)
plt.ylabel("Number of houses")
plt.xlabel("Log of Sale Price")
plt.show()

../../_images/5909b3a60a7b52422c005f2a6d23a058073e5237bf9fa5a24c1c4efcee16392c.png

Now the distribution looks more Gaussian.

Variable Types#

# Identifying the categorical variables, capturing those of type *object*

cat_vars = [var for var in data.columns if data[var].dtype == "O"]

# MSSubClass is also categorical by definition, despite its numeric values
# (you can find the definitions of the variables in the data_description.txt
# file available on Kaggle, in the same website where you downloaded the data)

# Adding MSSubClass to the list of categorical variables
cat_vars = cat_vars + ["MSSubClass"]

# Number of categorical variables
len(cat_vars)

# Casting categorical variables
data[cat_vars] = data[cat_vars].astype("O")

# Identifying numerical variables
num_vars = [
    var for var in data.columns if var not in cat_vars and var != "SalePrice"
]

# Number of numerical variables
len(num_vars)

Missing values#

# List of the variables that contain missing values
vars_with_na = [var for var in data.columns if data[var].isnull().sum() > 0]

# Percentage of missing values (expressed as decimals)
# and displaying the result ordered by % of missing data

data[vars_with_na].isnull().mean().sort_values(ascending=False)

PoolQC          0.995205
MiscFeature     0.963014
Alley           0.937671
Fence           0.807534
FireplaceQu     0.472603
LotFrontage     0.177397
GarageType      0.055479
GarageYrBlt     0.055479
GarageFinish    0.055479
GarageQual      0.055479
GarageCond      0.055479
BsmtExposure    0.026027
BsmtFinType2    0.026027
BsmtFinType1    0.025342
BsmtCond        0.025342
BsmtQual        0.025342
MasVnrArea      0.005479
MasVnrType      0.005479
Electrical      0.000685
dtype: float64

The dataset contains a few variables with a big proportion of missing values (4 variables at the top). And some other variables with a small percentage of missing observations.

This means that to train a machine learning model with this data set, the missing data in these variables must be imputed.

Visualizing the percentage of missing values in the variables:

# Plotting
data[vars_with_na].isnull().mean().sort_values(ascending=False).plot.bar(
    figsize=(10, 4)
)
plt.ylabel("Percentage of missing data")
plt.axhline(y=0.90, color="r", linestyle="-")
plt.axhline(y=0.80, color="g", linestyle="-")

plt.show()

../../_images/da6fd7425680e0c3a5a5152902761d606610050051ce00c03a9f79d4e842a0e3.png

# Determining which variables, from those with missing data,
# are numerical and which are categorical
cat_na = [var for var in cat_vars if var in vars_with_na]
num_na = [var for var in num_vars if var in vars_with_na]

print("Number of categorical variables with na: ", len(cat_na))
print("Number of numerical variables with na: ", len(num_na))

Number of categorical variables with na:  16
Number of numerical variables with na:  3

1num_na

['LotFrontage', 'MasVnrArea', 'GarageYrBlt']

1cat_na

['Alley',
 'MasVnrType',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Electrical',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PoolQC',
 'Fence',
 'MiscFeature']

Relationship between missing data and Sale Price#

The price of the house in those observations where the information is missing,for each variable that shows missing data.

def analyse_na_value(df, var):

    # copy of the dataframe, so as not to override the original data
    # see the link for more details about pandas.copy()
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html
    df = df.copy()

    # making an interim variable that indicates 1 if the
    # observation was missing or 0 otherwise
    df[var] = np.where(df[var].isnull(), 1, 0)

    # comparing the median SalePrice in the observations where data is missing
    # vs the observations where data is available

    # determining the median price in the groups 1 and 0,
    # and the standard deviation of the sale price,
    # and capturing the results in a temporary dataset
    temp = df.groupby(var)["SalePrice"].agg(["mean", "std"])

    # plotting into a bar graph
    temp.plot(
        kind="barh",
        y="mean",
        legend=False,
        xerr="std",
        title="Sale Price",
        color="green",
    )

    plt.show()

# Running the function on each variable with missing data
for var in vars_with_na:
    analyse_na_value(data, var)

../../_images/ab2255f0da539293f9d60b763e57cf8a4f8cc8bbce7f6816dbe1c6b50e230ddc.png

../../_images/9dcbd74117b8b32be56d012e49fc8f787c2dbf88a004e97f13de6ab5734c160d.png

../../_images/7b26c2bd154c582c320b662201edce2468a974307d73bd5f135d7e0cfcb1c703.png

../../_images/868245ea1431b44460316e8440a6eeed63ca467a1d1f76805fa8459b9c3286ab.png

../../_images/94ab9afbcb001986000e68cfba0781b816e90f998fcbdfca82ff4cd366771cd3.png

../../_images/a1ffe1fa080842acd6fb4f3bb5059436146c063a2bb507325f7c3ec35f70a5e5.png

../../_images/eb07600bb60dd79bb40c08a123ecd9b223fb65509ffc6d282fa02101a8869e50.png

../../_images/65e242da4c3461b455dca88b45089c3cf12974ead3c56632c6d35f18608451d4.png

../../_images/3c7759e596c70d4c75ace6f998c27acb41a785ce2fd3278be28391cd3a88b8da.png

../../_images/b81e6aedc9fe413554a694fd0276de3328574220cce5662bbb1f0c384f315024.png

../../_images/ae67c1a1c76ce2d511c490f01f00b9052a08b772dbeb5496efc32c470e4f3dc1.png

../../_images/dafe65fe968407507728387ae294a1884d4d902b955a52517c8e5c9288a497c6.png

../../_images/66dd5c97b1c543d829eec6140720e7807ae68bebac7b53118dfdf00ca8782cc9.png

../../_images/9d985e5b34330af9f6710a0e6650c21a80cfc2a393caaff5d3b768ae9e06ffca.png

../../_images/fab7efb1ce5124c92dfdf94c7dd75188dfc3ffd88834148660b9ccc3161febbf.png

../../_images/b92fe589be903311dbdd52d1e0dd89be93a8c28eb5c785e5f44c5c3618aec232.png

../../_images/dbc6a5f63323859a41ed001825e7a2571d4e2bf7e3641d5c3bdda47525a6e11b.png

../../_images/0e24f8fd0a6e1e1c833f650e676db917463099a6cbe4225c7c8e72ddedb883bb.png

../../_images/3305115adac9da6630f2753f163c64b4a8856c42d785c927c49f4de7775b3555.png

In some variables, the average Sale Price in houses where the information is missing, differs from the average Sale Price in houses where information exists. This suggests that data being missing could be a good predictor of Sale Price.

Numerical variables#

print("Number of numerical variables: ", len(num_vars))

# Visualizing the numerical variables
data[num_vars].head()

Number of numerical variables:  35

	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	BsmtUnfSF	TotalBsmtSF	1stFlrSF	2ndFlrSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	TotRmsAbvGrd	Fireplaces	GarageYrBlt	GarageCars	GarageArea	WoodDeckSF	OpenPorchSF	EnclosedPorch	MoSold	YrSold
0	65.0	8450	7	5	2003	2003	196.0	706	150	856	856	854	1710	1	0	2	1	3	1	8	0	2003.0	2	548	0	61	0	2	2008
1	80.0	9600	6	8	1976	1976	0.0	978	284	1262	1262	0	1262	0	1	2	0	3	1	6	1	1976.0	2	460	298	0	0	5	2007
2	68.0	11250	7	5	2001	2002	162.0	486	434	920	920	866	1786	1	0	2	1	3	1	6	1	2001.0	2	608	0	42	0	9	2008
3	60.0	9550	7	5	1915	1970	0.0	216	540	756	961	756	1717	1	0	1	0	3	1	7	1	1998.0	3	642	0	35	272	2	2006
4	84.0	14260	8	5	2000	2000	350.0	655	490	1145	1145	1053	2198	1	0	2	1	4	1	9	1	2000.0	3	836	192	84	0	12	2008

Temporal variables#

There are 4 year variables in the dataset:

YearBuilt: year in which the house was built
YearRemodAdd: year in which the house was remodeled
GarageYrBlt: year in which a garage was built
YrSold: year in which the house was sold

Extracting information from date variables in their raw format: For example, capturing the difference in years between the year the house was built and the year the house was sold.

# List of variables that contain year information

year_vars = [var for var in num_vars if "Yr" in var or "Year" in var]

year_vars

['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']

# Exploring the values of these temporal variables
for var in year_vars:
    print(var, data[var].unique())
    print()

YearBuilt [2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 1965 2005 1962 2006
1929 1970 1967 1958 1930 2002 1968 2007 1951 1957 1927 1920 1966
1994 1954 1953 1955 1983 1975 1997 1934 1963 1981 1964 1999 1972
1945 1982 1998 1956 1948 1910 1995 1991 2009 1950 1961 1977 1985
1885 1919 1990 1969 1935 1988 1971 1952 1936 1923 1924 1984 1926
1941 1987 1986 2008 1908 1892 1916 1932 1918 1912 1947 1925 1900
1989 1992 1949 1880 1928 1978 1922 1996 2010 1946 1913 1937 1942
1974 1893 1914 1906 1890 1898 1904 1882 1875 1911 1917 1872 1905]

YearRemodAdd [2003 1976 2002 1970 2000 1995 2005 1973 1950 1965 2006 1962 2007 1960
1967 2004 2008 1997 1959 1990 1955 1983 1980 1966 1963 1987 1964
1996 1998 1989 1953 1956 1968 1981 1992 2009 1982 1961 1993 1999
1979 1977 1969 1958 1991 1971 1952 1975 2010 1984 1986 1994 1988
1957 1951 1978 1974]

GarageYrBlt [2003. 1976. 2001. 1998. 2000. 1993. 2004. 1973. 1931. 1939. 1965. 2005.
2006. 1960. 1991. 1970. 1967. 1958. 1930. 2002. 1968. 2007. 2008.
1920. 1966. 1959. 1995. 1954. 1953.   nan 1983. 1977. 1997. 1985.
1981. 1964. 1999. 1935. 1990. 1945. 1987. 1989. 1915. 1956. 1948.
2009. 1950. 1961. 1921. 1900. 1979. 1951. 1969. 1936. 1975. 1971.
1984. 1926. 1955. 1986. 1988. 1916. 1932. 1972. 1918. 1980. 1924.
1940. 1949. 1994. 1910. 1978. 1982. 1992. 1925. 1941. 2010. 1927.
1937. 1942. 1938. 1952. 1928. 1922. 1934. 1906. 1914. 1946. 1908.
1933.]

YrSold [2008 2007 2006 2009 2010]

As expected, the values are years.

Exploring the evolution of the sale price with the years in which the house was sold:

# plotting median sale price vs year in which it was sold
data.groupby("YrSold")["SalePrice"].median().plot()
plt.ylabel("Median House Price")

Text(0, 0.5, 'Median House Price')

../../_images/6526852d752eaf853ee1f7e79bc8e8e1ffc50d19336434ca104fd50dc96229d5.png

There has been a drop in the value of the houses. That is unusual, in real life, house prices typically go up as years go by.

Plotting the price of sale vs year in which it was built

# plotting median sale price vs year in which it was built

data.groupby("YearBuilt")["SalePrice"].median().plot()
plt.ylabel("Median House Price")

Text(0, 0.5, 'Median House Price')

../../_images/9ed583062d0a1fd0c66380c40bcae4cb882111faa40260e62131d7b7d3995692.png

Newly built / younger houses tend to be more expensive.

Could it be that lately older houses were sold?

Capture the elapsed years between the Year variables and the year in which the house was sold:

def analyse_year_vars(df, var):

    df = df.copy()

    # capturing difference between a year variable and year
    # in which the house was sold
    df[var] = df["YrSold"] - df[var]

    df.groupby("YrSold")[var].median().plot()
    plt.ylabel("Time from " + var)
    plt.show()


for var in year_vars:
    if var != "YrSold":
        analyse_year_vars(data, var)

../../_images/e032a7a799b6ee8fc6347621822249fa265ca52633c7da2467f1e4f5d939dd0b.png

../../_images/74541c937cc5f6c1bd6cb998c8d02aa633731b6cca3837a6ef7de3761d67996b.png

../../_images/f5af84bd074f530af4e5012211b94cee336986193f84de78c023901940b48c11.png

Towards 2010, the houses sold had older garages, and had not been remodelled recently, that might explain why we see cheaper sales prices in recent years, at least in this dataset.

Plotting instead the time since last remodelled, or time since built, and sale price, to see if there is a relationship.

def analyse_year_vars(df, var):

    df = df.copy()

    # capturing difference between a year variable and year
    # in which the house was sold
    df[var] = df["YrSold"] - df[var]

    plt.scatter(df[var], df["SalePrice"])
    plt.ylabel("SalePrice")
    plt.xlabel(var)
    plt.show()


for var in year_vars:
    if var != "YrSold":
        analyse_year_vars(data, var)

../../_images/c741c40c6c3f7a9212491f2e831d515d06f328e0e9e2afc2eb12320167987d30.png

../../_images/07dd8f7cc8647c39702f615d0c85972df320868c04786440b19c8b8ee49ee76d.png

../../_images/279913dbe91f7df08d893ef6421f4b9aa69e3727045e13a7cf78bd68b662438a.png

There seems to be a tendency to a decrease in price, with older houses. In other words, the longer the time between the house was built or remodeled and sale date, the lower the sale Price.

Which makes sense, cause this means that the house will have an older look, and potentially needs repairs.

Discrete variables#

#  List of discrete variables
discrete_vars = [
    var
    for var in num_vars
    if len(data[var].unique()) < 20 and var not in year_vars
]

print("Number of discrete variables: ", len(discrete_vars))

Number of discrete variables:  13

# Visualise the discrete variables

data[discrete_vars].head()

	OverallQual	OverallCond	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	TotRmsAbvGrd	Fireplaces	GarageCars	MoSold
0	7	5	1	0	2	1	3	1	8	0	2	2
1	6	8	0	1	2	0	3	1	6	1	2	5
2	7	5	1	0	2	1	3	1	6	1	2	9
3	7	5	1	0	1	0	3	1	7	1	3	2
4	8	5	1	0	2	1	4	1	9	1	3	12

These discrete variables tend to be qualifications (Qual) or grading scales (Cond), or refer to the number of rooms, or units (FullBath, GarageCars), or indicate the area of the room (KitchenAbvGr).

We expect higher prices, with bigger numbers. Analysing their contribution to the house price.

MoSold is the month in which the house was sold.

for var in discrete_vars:
    # make boxplot with Catplot
    sns.catplot(
        x=var, y="SalePrice", data=data, kind="box", height=4, aspect=1.5
    )
    # add data points to boxplot with stripplot
    sns.stripplot(
        x=var, y="SalePrice", data=data, jitter=0.1, alpha=0.3, color="k"
    )
    plt.show()

../../_images/fb79507ea92fcf68fe3ad39033576549965be590622730b462946f08abf3135c.png

../../_images/b1c4514177769ff1c79780415cb73974c6cef67fcf9844dc8938eec1c49655d6.png

../../_images/6cb4fa0b5a052597d3dd32ab4472705b87ba4f1ac5314a3e4c755622549c7089.png

../../_images/7aac81071e40084adcf3bfa172557042d98ca4f751791a91f6f0da43a7e987c9.png

../../_images/1006500d9e6536bf9f5a6ba0af3b6b6205466491c61a067581ead6973cc387a0.png

../../_images/44e7773d29967fb3a8df632723c8714cb147d6f841557224483eeeb9e4e2fd34.png

../../_images/f5b81ba14fbaf3b9241a696bbd8133de54a2475597176786728b59cf59bc7147.png

../../_images/509d2b7c01dc60333e62ffe3560aee13abe884270624bddcbbc0999601e20f8c.png

../../_images/aa1f7ab6dc309fabe1e08a17568481a10e61fed1e6be7b7569c9ae5cdbed6b0f.png

../../_images/6d792390eeeb576e74853d149c52a889692be3ac87e34b1e1794ea0f7e29f32d.png

../../_images/9f83076787265952018e60697a93d6a58532cd38faa95a60e6c9185175d498bd.png

../../_images/9d9c11e4ff441c20d2ea1c2de0b4f80f3b0f8e3335cd98a3727129cc1ecaf0fd.png

../../_images/236ba4c11a7681d232f84fabd5073e6f90a86e824e191373b8dd328c67b42ae4.png

For most discrete numerical variables, the sale price increases with the quality, or overall condition, or number of rooms, or surface.

For some variables, this tendency can not be seen. Most likely that variable is not a good predictor of sale price.

Continuous variables#

Distribution of the continuous variables.

# List of continuous variables
cont_vars = [var for var in num_vars if var not in discrete_vars + year_vars]

print("Number of continuous variables: ", len(cont_vars))

Number of continuous variables:  18

# Visualising the continuous variables

data[cont_vars].head()

	LotFrontage	LotArea	MasVnrArea	BsmtFinSF1	BsmtUnfSF	TotalBsmtSF	1stFlrSF	2ndFlrSF	GrLivArea	GarageArea	WoodDeckSF	OpenPorchSF	EnclosedPorch
0	65.0	8450	196.0	706	150	856	856	854	1710	548	0	61	0
1	80.0	9600	0.0	978	284	1262	1262	0	1262	460	298	0	0
2	68.0	11250	162.0	486	434	920	920	866	1786	608	0	42	0
3	60.0	9550	0.0	216	540	756	961	756	1717	642	0	35	272
4	84.0	14260	350.0	655	490	1145	1145	1053	2198	836	192	84	0

# Plotting histograms for all continuous variables

data[cont_vars].hist(bins=30, figsize=(15, 15))
plt.show()

../../_images/3c35bef82402de59bbb44560fe240f5a1f914e6d1644c0921d7012f168a76564.png

The variables are not normally distributed. And there are a particular few that are extremely skewed like 3SsnPorch, ScreenPorch and MiscVal.

Sometimes, transforming the variables to improve the value spread, improves the model performance. But it is unlikely that a transformation will help change the distribution of the super skewed variables dramatically.

Applying a Yeo-Johnson transformation to variables like LotFrontage, LotArea, BsmUnfSF, and a binary transformation to variables like 3SsnPorch, ScreenPorch and MiscVal.

# List with the super skewed variables for later

skewed = [
    "BsmtFinSF2",
    "LowQualFinSF",
    "EnclosedPorch",
    "3SsnPorch",
    "ScreenPorch",
    "MiscVal",
]

# Capturing the remaining continuous variables

cont_vars = [
    "LotFrontage",
    "LotArea",
    "MasVnrArea",
    "BsmtFinSF1",
    "BsmtUnfSF",
    "TotalBsmtSF",
    "1stFlrSF",
    "2ndFlrSF",
    "GrLivArea",
    "GarageArea",
    "WoodDeckSF",
    "OpenPorchSF",
]

Yeo-Johnson transformation#

# Analysing the distributions of the variables
# after applying a yeo-johnson transformation

# temporary copy of the data
tmp = data.copy()

for var in cont_vars:

    # transform the variable - yeo-johsnon
    tmp[var], param = stats.yeojohnson(data[var])


# plot the histograms of the transformed variables
tmp[cont_vars].hist(bins=30, figsize=(15, 15))
plt.show()

../../_images/cd2b470b487ebd21823478a0a3b335fb880028ac194640421b7b7ab4344bf126.png

For LotFrontage and MasVnrArea the transformation did not do an amazing job.

For the others, the values seem to be spread more evenly in the range.

Whether this helps improve the predictive power, remains to be seen. To determine if this is the case, train a model with the original values and one with the transformed values, and determine model performance, and feature importance.

Here, we do a quick visual exploration instead:

# Plotting the original or transformed variables
# vs sale price, and see if there is a relationship

for var in cont_vars:

    plt.figure(figsize=(12, 4))

    # plot the original variable vs sale price
    plt.subplot(1, 2, 1)
    plt.scatter(data[var], np.log(data["SalePrice"]))
    plt.ylabel("Sale Price")
    plt.xlabel("Original " + var)

    # plot transformed variable vs sale price
    plt.subplot(1, 2, 2)
    plt.scatter(tmp[var], np.log(tmp["SalePrice"]))
    plt.ylabel("Sale Price")
    plt.xlabel("Transformed " + var)

    plt.show()

../../_images/06a835854652595dcc0a3971b4d238331d21c241fb5c01217d66717bea007bec.png

../../_images/ca736d06fa31a41d3f071f7087ed08d51c374c030a4594d4764bc393149f5119.png

../../_images/8b890f81e9fd1d7d425edbca08d9496fafdea25092e1f393b57688a4f4afe04d.png

../../_images/103b5394cf338b3279fb7239e243f0a6f9f6a5aafc7e4ef6d2eb3facaed6c481.png

../../_images/397978c6d8ec050c76ed518adcf3ced1cf5216811aa7095b4157f6751f112f9c.png

../../_images/44a98d3c2d8e5cf1e2123f9332a6871135a90a7c19aa4363ba26fa6117eb502b.png

../../_images/23edf9ad35429d79b545d4ffd5a99dec4025be07f559bf69dd2ebbdb02172ba2.png

../../_images/43d34dfc63eb3e12192a8c9deb272f9894fd01dec274a8953db3547b18148028.png

../../_images/f213ae8714deb6a76b800cab270f0e996b4c825dcf71c5ef4d573e9edbabb8d5.png

../../_images/2246d8e379abb268a1e4c26d659bea15e83f88ed8799c94d35d3341ca805f63e.png

../../_images/523811cc583aa97364468483e244473564e8533001f7b4db5b67f98b47a2e844.png

../../_images/17be57e8b41326143b38a604c65b554e2205abc7646f2621972f38c98bf91640.png

By eye, the transformations seems to improve the relationship only for LotArea.

A different transformation: Most variables contain the value 0, and thus the logarithmic transformation can not be applied, but it can be applied for the following variables:

[“LotFrontage”, “1stFlrSF”, “GrLivArea”]

Check if that changes the variable distribution and its relationship with the target.

Logarithmic transformation#

# Analysing the distributions of these variables
# after applying a logarithmic transformation

tmp = data.copy()

for var in ["LotFrontage", "1stFlrSF", "GrLivArea"]:

    # transform the variable with logarithm
    tmp[var] = np.log(data[var])

tmp[["LotFrontage", "1stFlrSF", "GrLivArea"]].hist(bins=30)
plt.show()

../../_images/e2e90e500f319ea29f78be155e79f07292454b5b863ed69d66d5dd7bc4a3b464.png

The distribution of the variables are now more “Gaussian” looking.

Evaluating their relationship with the target.

# Plotting the original or transformed variables
# vs sale price, and see if there is a relationship

for var in ["LotFrontage", "1stFlrSF", "GrLivArea"]:

    plt.figure(figsize=(12, 4))

    # plot the original variable vs sale price
    plt.subplot(1, 2, 1)
    plt.scatter(data[var], np.log(data["SalePrice"]))
    plt.ylabel("Sale Price")
    plt.xlabel("Original " + var)

    # plot transformed variable vs sale price
    plt.subplot(1, 2, 2)
    plt.scatter(tmp[var], np.log(tmp["SalePrice"]))
    plt.ylabel("Sale Price")
    plt.xlabel("Transformed " + var)

    plt.show()

../../_images/745d6ec3114b359b08d1bd39b5cd8a630c9152bc49a4b4b68c1c2c1aec39f18d.png

../../_images/48613481556786f1b0310ba52a9590455fcb3859f4f40f7d09ea25e824f477e5.png

../../_images/02e12a02739d1c9963d24c8490c69b05434b88370b152e1dfc6e6aaa1244d768.png

The transformed variables have a better spread of the values, which may in turn, help make better predictions.

Skewed variables#

Transforming skewed vars into binary variables to check how predictive they are:

for var in skewed:

    tmp = data.copy()

    # map the variable values into 0 and 1
    tmp[var] = np.where(data[var] == 0, 0, 1)

    # determine mean sale price in the mapped values
    tmp = tmp.groupby(var)["SalePrice"].agg(["mean", "std"])

    # plot into a bar graph
    tmp.plot(
        kind="barh",
        y="mean",
        legend=False,
        xerr="std",
        title="Sale Price",
        color="green",
    )

    plt.show()

../../_images/88c42bb17ac130a7a0f92f67c2f34948a9c51b30213de2c5e1c4c92c20e3bfe9.png

../../_images/c2d80126220088d8c712663a506b7b283be3f5d0e892613fdb36b777bb888454.png

../../_images/97004c82b611d13770f8b5bcd7d6d82ce4d64d2c2248698da7b54c36f8e6be51.png

../../_images/429d0bda500d780570470cae50f7d2bef1f6d57afcf33cb7e6780d6785e90a49.png

../../_images/318fb98c5e2b7abdd5ff8fd9c89c1e7a011fc7663eaed0ac2dcd64b08c529f9c.png

../../_images/144b101d85c896a060ce98c0346a962f8b37fb86004813bd1155c116c8bbb9ca.png

There seem to be a difference in Sale Price in the mapped values, but the confidence intervals overlap, so most likely this is not significant or predictive.

Categorical variables#

Analysing the categorical variables present in the dataset.

print("Number of categorical variables: ", len(cat_vars))

Number of categorical variables:  44

# Visualising the values of the categorical variables
data[cat_vars].head()

	MSZoning	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinType2	Heating	HeatingQC	CentralAir	Electrical	KitchenQual	Functional	FireplaceQu	GarageType	GarageFinish	GarageQual	GarageCond	PavedDrive	PoolQC	Fence	MiscFeature	SaleType	SaleCondition	MSSubClass
0	RL	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	Gable	CompShg	VinylSd	VinylSd	BrkFace	Gd	TA	PConc	Gd	TA	No	GLQ	Unf	GasA	Ex	Y	SBrkr	Gd	Typ	NaN	Attchd	RFn	TA	TA	Y	NaN	NaN	NaN	WD	Normal	60
1	RL	Pave	NaN	Reg	Lvl	AllPub	FR2	Gtl	Veenker	Feedr	Norm	1Fam	1Story	Gable	CompShg	MetalSd	MetalSd	None	TA	TA	CBlock	Gd	TA	Gd	ALQ	Unf	GasA	Ex	Y	SBrkr	TA	Typ	TA	Attchd	RFn	TA	TA	Y	NaN	NaN	NaN	WD	Normal	20
2	RL	Pave	NaN	IR1	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	Gable	CompShg	VinylSd	VinylSd	BrkFace	Gd	TA	PConc	Gd	TA	Mn	GLQ	Unf	GasA	Ex	Y	SBrkr	Gd	Typ	TA	Attchd	RFn	TA	TA	Y	NaN	NaN	NaN	WD	Normal	60
3	RL	Pave	NaN	IR1	Lvl	AllPub	Corner	Gtl	Crawfor	Norm	Norm	1Fam	2Story	Gable	CompShg	Wd Sdng	Wd Shng	None	TA	TA	BrkTil	TA	Gd	No	ALQ	Unf	GasA	Gd	Y	SBrkr	Gd	Typ	Gd	Detchd	Unf	TA	TA	Y	NaN	NaN	NaN	WD	Abnorml	70
4	RL	Pave	NaN	IR1	Lvl	AllPub	FR2	Gtl	NoRidge	Norm	Norm	1Fam	2Story	Gable	CompShg	VinylSd	VinylSd	BrkFace	Gd	TA	PConc	Gd	TA	Av	GLQ	Unf	GasA	Ex	Y	SBrkr	Gd	Typ	TA	Attchd	RFn	TA	TA	Y	NaN	NaN	NaN	WD	Normal	60

Number of labels: cardinality#

How many categories are present in each of the variables?

# counting unique categories with pandas unique()
# and then plotting them in descending order

data[cat_vars].nunique().sort_values(ascending=False).plot.bar(figsize=(12, 5))

<AxesSubplot: >

../../_images/c669bcb9ac618d2eca8641b053c281f43debcc6857cd875eafd2a1b6d2ec9f68.png

All the categorical variables show low cardinality, this means that they have only few different labels.

Quality variables#

There are a number of variables that refer to the quality of some aspect of the house, for example the garage, or the fence, or the kitchen. Replace these categories by numbers increasing with the quality of the place or room.

The mappings can be obtained from the Kaggle Website. One example:

Ex = Excellent
Gd = Good
TA = Average/Typical
Fa = Fair
Po = Poor

# Re-mapping strings to number, which determine quality

qual_mappings = {
    "Po": 1,
    "Fa": 2,
    "TA": 3,
    "Gd": 4,
    "Ex": 5,
    "Missing": 0,
    "NA": 0,
}

qual_vars = [
    "ExterQual",
    "ExterCond",
    "BsmtQual",
    "BsmtCond",
    "HeatingQC",
    "KitchenQual",
    "FireplaceQu",
    "GarageQual",
    "GarageCond",
]

for var in qual_vars:
    data[var] = data[var].map(qual_mappings)

exposure_mappings = {"No": 1, "Mn": 2, "Av": 3, "Gd": 4, "Missing": 0, "NA": 0}

var = "BsmtExposure"

data[var] = data[var].map(exposure_mappings)

finish_mappings = {
    "Missing": 0,
    "NA": 0,
    "Unf": 1,
    "LwQ": 2,
    "Rec": 3,
    "BLQ": 4,
    "ALQ": 5,
    "GLQ": 6,
}

finish_vars = ["BsmtFinType1", "BsmtFinType2"]

for var in finish_vars:
    data[var] = data[var].map(finish_mappings)

garage_mappings = {"Missing": 0, "NA": 0, "Unf": 1, "RFn": 2, "Fin": 3}

var = "GarageFinish"

data[var] = data[var].map(garage_mappings)

fence_mappings = {
    "Missing": 0,
    "NA": 0,
    "MnWw": 1,
    "GdWo": 2,
    "MnPrv": 3,
    "GdPrv": 4,
}

var = "Fence"

data[var] = data[var].map(fence_mappings)

# capture all quality variables

qual_vars = qual_vars + finish_vars + ["BsmtExposure", "GarageFinish", "Fence"]

# Plotting the house mean sale price based on the quality of the
# various attributes

for var in qual_vars:
    # boxplot with Catplot
    sns.catplot(
        x=var, y="SalePrice", data=data, kind="box", height=4, aspect=1.5
    )
    # data points to boxplot with stripplot
    sns.stripplot(
        x=var, y="SalePrice", data=data, jitter=0.1, alpha=0.3, color="k"
    )
    plt.show()

../../_images/9100db0c07f03e63fafb6706e39e853c3b2e464a2df01bdf72d404e2539fdd20.png

../../_images/794d3666faf36fc7ea4743aa9cd85430ab70598011b737c29a057fe5295637ef.png

../../_images/c9b945bb50288d7eab2f840220cc5d71337f6bb063f7ad819262eb233a9f96f8.png

../../_images/c467cc50ddf5558c160a801323e2e64e1f895671ce3a43257638112bcd2053bb.png

../../_images/8fd1900002d0261123df332770ddf5cae8f8b7bdea92ad8be21c068844f774b8.png

../../_images/c46b7339ef6c719ef7956dfa8fdc05e73730dc3fd5513f3971efe9bdd560a795.png

../../_images/d5a8232a8893b2cebe1927b4cf64eb8c526755b078726762081ed20fc951eb2f.png

../../_images/9cb49947f45b8ff619163e3143a0d7a7881cab67e5824a71bf3fa786f4bbc811.png

../../_images/c3dce994b8e767a97ba768c144ef558083a938d6a3f5039cdac5f7646987eb5e.png

../../_images/46112c2b963de3477570a72ef90f37843d344e682751a125c4e5d8a25964f73d.png

../../_images/7f077ed2c261c3de21ac2807cc799ed0d537e2d2d2cc93061f348843afe49837.png

../../_images/4556839c9de76802bd94a743b91454dbe51f0a29119c588470bd37e86a183192.png

../../_images/de1fc5dca528d424a841ebdea93f8a6f2b18df17a5c9160e1d8345c0b5ff767d.png

../../_images/25d821ee0b6bbcd9b9fe970eab4f1fdef77c0ec14ff2727ec6ee907432877c08.png

For most attributes, the increase in the house price with the value of the variable, is quite clear.

# Capturing the remaining categorical variables
# (those that we did not re-map)
cat_others = [var for var in cat_vars if var not in qual_vars]

len(cat_others)

Rare labels#

Investigating if there are labels that are present only in a small number of houses:

def analyse_rare_labels(df, var, rare_perc):
    df = df.copy()

    # determine the % of observations per category
    temp = df.groupby(var)["SalePrice"].count() / len(df)

    # return categories that are rare
    return temp[temp < rare_perc]


# print categories that are present in less than
# 1 % of the observations

for var in cat_others:
    print(analyse_rare_labels(data, var, 0.01))
    print()

MSZoning
C (all)    0.006849
Name: SalePrice, dtype: float64

Street
Grvl    0.00411
Name: SalePrice, dtype: float64

Series([], Name: SalePrice, dtype: float64)

LotShape
IR3    0.006849
Name: SalePrice, dtype: float64

Series([], Name: SalePrice, dtype: float64)

Utilities
NoSeWa    0.000685
Name: SalePrice, dtype: float64

LotConfig
FR3    0.00274
Name: SalePrice, dtype: float64

LandSlope
Sev    0.008904
Name: SalePrice, dtype: float64

Neighborhood
Blueste    0.001370
NPkVill    0.006164
Veenker    0.007534
Name: SalePrice, dtype: float64

Condition1
PosA    0.005479
RRAe    0.007534
RRNe    0.001370
RRNn    0.003425
Name: SalePrice, dtype: float64

Condition2
Artery    0.001370
Feedr     0.004110
PosA      0.000685
PosN      0.001370
RRAe      0.000685
RRAn      0.000685
RRNn      0.001370
Name: SalePrice, dtype: float64

Series([], Name: SalePrice, dtype: float64)

HouseStyle
1.5Unf    0.009589
2.5Fin    0.005479
2.5Unf    0.007534
Name: SalePrice, dtype: float64

RoofStyle
Flat       0.008904
Gambrel    0.007534
Mansard    0.004795
Shed       0.001370
Name: SalePrice, dtype: float64

RoofMatl
ClyTile    0.000685
Membran    0.000685
Metal      0.000685
Roll       0.000685
Tar&Grv    0.007534
WdShake    0.003425
WdShngl    0.004110
Name: SalePrice, dtype: float64

Exterior1st
AsphShn    0.000685
BrkComm    0.001370
CBlock     0.000685
ImStucc    0.000685
Stone      0.001370
Name: SalePrice, dtype: float64

Exterior2nd
AsphShn    0.002055
Brk Cmn    0.004795
CBlock     0.000685
ImStucc    0.006849
Other      0.000685
Stone      0.003425
Name: SalePrice, dtype: float64

Series([], Name: SalePrice, dtype: float64)

Foundation
Stone    0.004110
Wood     0.002055
Name: SalePrice, dtype: float64

Heating
Floor    0.000685
Grav     0.004795
OthW     0.001370
Wall     0.002740
Name: SalePrice, dtype: float64

Series([], Name: SalePrice, dtype: float64)

Electrical
FuseP    0.002055
Mix      0.000685
Name: SalePrice, dtype: float64

Functional
Maj1    0.009589
Maj2    0.003425
Sev     0.000685
Name: SalePrice, dtype: float64

GarageType
2Types     0.004110
CarPort    0.006164
Name: SalePrice, dtype: float64

Series([], Name: SalePrice, dtype: float64)

PoolQC
Ex    0.001370
Fa    0.001370
Gd    0.002055
Name: SalePrice, dtype: float64

MiscFeature
Gar2    0.001370
Othr    0.001370
TenC    0.000685
Name: SalePrice, dtype: float64

SaleType
CWD      0.002740
Con      0.001370
ConLD    0.006164
ConLI    0.003425
ConLw    0.003425
Oth      0.002055
Name: SalePrice, dtype: float64

SaleCondition
AdjLand    0.002740
Alloca     0.008219
Name: SalePrice, dtype: float64

MSSubClass
40     0.002740
45     0.008219
180    0.006849
Name: SalePrice, dtype: float64

Some categorical variables show multiple labels that are present in less than 1% of the houses.

Labels that are under-represented in the dataset tend to cause over-fitting of machine learning models.

Removing.

Exploring the relationship between the categories of the different variables and the house sale price:

for var in cat_others:
    # make boxplot with Catplot
    sns.catplot(
        x=var, y="SalePrice", data=data, kind="box", height=4, aspect=1.5
    )
    # add data points to boxplot with stripplot
    sns.stripplot(
        x=var, y="SalePrice", data=data, jitter=0.1, alpha=0.3, color="k"
    )
    plt.show()

../../_images/b4df98e8042efb1ce7fcb232a3c4bf1e9b7006e2236fbb224d6cbc48638145bf.png

../../_images/cefd1d35a3927b698cfcdd369209915b42732200747d933500fb465106a58f79.png

../../_images/9bebf453a12d1c6158bd1b613a34872363e7b9fffc091eaaaa4f0f44fa714e04.png

../../_images/45fe94761ecb9023200e4a74e4812cd57614d23b5ef8d9031fd22f10703f27fe.png

../../_images/d8ff9040f8c64c4223e004372ba79ad5b26e7523478eabd6a1fd50c3dd767f24.png

../../_images/c9f929c3834ba51155a86d3b6ac6aad6de10c36984ffc3e440511863fec5b3b9.png

../../_images/1d255175a85b9740d85ab6363a81631d98b46b9ee119472ba8550d98922121d7.png

../../_images/7159f68fe2c402c10946ac55d8ecd6718c6a8b9540b02df380fa87b27b9f68b1.png

../../_images/8e69edada35e8675ccb5e24b8a2d6cc71f7ee4cbf1684471ca4d3c79f651f9b2.png

../../_images/83a72714ac2464fad73261fdc575a41f9644855e340736d840f4050199d2d449.png

../../_images/4d2d23e338bf58df16281a6919e55def0dc6f6532b9b47a3af898ac26d50fb57.png

../../_images/1810183844a40886ab74a8fb5c8b1ad03e5a1ea3f0b9034e4376705fb1c37321.png

../../_images/2a62ae985ffee5cbba4fcdba0e1f3645fce069efeef9c3ba50b6d15ac6abb1a8.png

../../_images/9580b2e33aa11436aad581e0581fa07b13576097eb79aa4280fb76b8fc4b073b.png

../../_images/59a68ec206c3e1081b094af27222fccb99924be0a0b1d7d9dd791449974b364c.png

../../_images/0259c4ef9e2ff7202daa67565bb6454e78833ae21930472b8aeb09420777edc6.png

../../_images/e87f3911edb169986e58d166ccfb281388459f8b1286feebfd019000cc6bb067.png

../../_images/5c739ee30d2f6d17cd38d388f308c5d47c50d8f4549317644fed4b21719d330b.png

../../_images/024a9c190c04a9cbdd46f0496cf20686afbde7c508d925dde8133d2f1666269c.png

../../_images/16c8aafe5980e0e062f53743856cae996f98b2dfd17a6689edb33f5ab9b3d401.png

../../_images/88b976216ee83f19fabda935cba5d936b50d110d150fc89df77b03ca5f9dcc88.png

../../_images/7eba85cb8ab6f9030d42d5e1541bd5a4d061af28604f587c20726003b9b971e5.png

../../_images/76b516522913ce5d4e4bdedde44fb2857feab01ec4edc9e27e96e67ba9269e97.png

../../_images/4cc6c549452fa45d834d7f0bd0ec8350e9865f0ccbd9e7f9cea4b19c2e2d9c38.png

../../_images/a15ea773a6777e6898d70f1a807e75c2dbcce37184bc5207e10e69c4bdf7bc9f.png

../../_images/539ebd96f1acc2e051d82c73054b3f4c11c3875a7e8e632b0a5aabaaecaa1a05.png

../../_images/d9fb994312b226b834c86ee0471f8c05f883739b2db91ae2cd49e920413b63d0.png

../../_images/899ebe338e0f20f0da30da977a0ab72a1f88945367431e8fe007f26cb1186f38.png

../../_images/72dcd2dd5c6d76e79430184eb9c6494735ee6faa52cce864a5eda5042bd2bb6b.png

../../_images/a9d51059de919ee0676bfd68d89739f2229f464b2bf2d72280edfb1820559927.png

Clearly, the categories give information on the SalePrice, as different categories show different median sale prices.

Disclaimer:

There is certainly more that can be done to understand the nature of this data and the relationship of these variables with the target, SalePrice. And also about the distribution of the variables themselves.

Additional Resources#

Feature Engineering for Machine Learning - Online Course
Packt Feature Engineering Cookbook - Book
Predict house price with Feature-engine - Kaggle kernel
Comprehensive data exploration with Python - Kaggle kernel
How I made top 0.3% on a Kaggle competition - Kaggle kernel

Table of Contents

Books