EDA Ames housing set#

Analysis:

  1. The target variable

  2. Variable types (categorical and numerical)

  3. Missing data

  4. Numerical variables

    • Discrete

    • Continuous

    • Distributions

    • Transformations

  5. Categorical variables

    • Cardinality

    • Rare Labels

    • Special mappings

  6. Additional Reading Resources

Imports and packages#

 1# to handle datasets
 2import pandas as pd
 3import numpy as np
 4
 5# for plotting
 6import matplotlib.pyplot as plt
 7import seaborn as sns
 8
 9# for the yeo-johnson transformation
10import scipy.stats as stats
11
12# to display all the columns of the dataframe in the notebook
13pd.pandas.set_option("display.max_columns", None)

Paths#

1# Path to datasets directory
2data_path = "./datasets"
3# Path to assets directory (for saving results to)
4assets_path = "./assets"

Prepare the data set#

1# load dataset
2data = pd.read_csv(f"{data_path}/train.csv")
3
4# rows and columns of the data
5print(data.shape)
6
7# visualise the dataset
8data.head()
(1460, 81)
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.0 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NaN Attchd 2003.0 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.0 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.0 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.0 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.0 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.0 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.0 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal 250000
1# drop id, it is just a number given to identify each house
2data.drop("Id", axis=1, inplace=True)
3
4data.shape
(1460, 80)

The house price dataset contains 1460 rows, that is, houses, and 80 columns, i.e., variables.

79 are predictive variables and 1 is the target variable: SalePrice

Target#

1# histogran to evaluate target distribution
2
3data["SalePrice"].hist(bins=50, density=True)
4plt.ylabel("Number of houses")
5plt.xlabel("Sale Price")
6plt.show()
../../_images/d9aedf394c25aec217bd37e55be9304de51ba88da3a21aed0669d26f2b9f2146.png

The target is continuous, and the distribution is skewed towards the right.

The value spread can be improved with a mathematical transformation.

1# Transforming the target using logarithm
2
3np.log(data["SalePrice"]).hist(bins=50, density=True)
4plt.ylabel("Number of houses")
5plt.xlabel("Log of Sale Price")
6plt.show()
../../_images/5909b3a60a7b52422c005f2a6d23a058073e5237bf9fa5a24c1c4efcee16392c.png

Now the distribution looks more Gaussian.

Variable Types#

 1# Identifying the categorical variables, capturing those of type *object*
 2
 3cat_vars = [var for var in data.columns if data[var].dtype == "O"]
 4
 5# MSSubClass is also categorical by definition, despite its numeric values
 6# (you can find the definitions of the variables in the data_description.txt
 7# file available on Kaggle, in the same website where you downloaded the data)
 8
 9# Adding MSSubClass to the list of categorical variables
10cat_vars = cat_vars + ["MSSubClass"]
11
12# Number of categorical variables
13len(cat_vars)
44
1# Casting categorical variables
2data[cat_vars] = data[cat_vars].astype("O")
1# Identifying numerical variables
2num_vars = [
3    var for var in data.columns if var not in cat_vars and var != "SalePrice"
4]
5
6# Number of numerical variables
7len(num_vars)
35

Missing values#

1# List of the variables that contain missing values
2vars_with_na = [var for var in data.columns if data[var].isnull().sum() > 0]
3
4# Percentage of missing values (expressed as decimals)
5# and displaying the result ordered by % of missing data
6
7data[vars_with_na].isnull().mean().sort_values(ascending=False)
PoolQC          0.995205
MiscFeature     0.963014
Alley           0.937671
Fence           0.807534
FireplaceQu     0.472603
LotFrontage     0.177397
GarageType      0.055479
GarageYrBlt     0.055479
GarageFinish    0.055479
GarageQual      0.055479
GarageCond      0.055479
BsmtExposure    0.026027
BsmtFinType2    0.026027
BsmtFinType1    0.025342
BsmtCond        0.025342
BsmtQual        0.025342
MasVnrArea      0.005479
MasVnrType      0.005479
Electrical      0.000685
dtype: float64

The dataset contains a few variables with a big proportion of missing values (4 variables at the top). And some other variables with a small percentage of missing observations.

This means that to train a machine learning model with this data set, the missing data in these variables must be imputed.

Visualizing the percentage of missing values in the variables:

1# Plotting
2data[vars_with_na].isnull().mean().sort_values(ascending=False).plot.bar(
3    figsize=(10, 4)
4)
5plt.ylabel("Percentage of missing data")
6plt.axhline(y=0.90, color="r", linestyle="-")
7plt.axhline(y=0.80, color="g", linestyle="-")
8
9plt.show()
../../_images/da6fd7425680e0c3a5a5152902761d606610050051ce00c03a9f79d4e842a0e3.png
1# Determining which variables, from those with missing data,
2# are numerical and which are categorical
3cat_na = [var for var in cat_vars if var in vars_with_na]
4num_na = [var for var in num_vars if var in vars_with_na]
5
6print("Number of categorical variables with na: ", len(cat_na))
7print("Number of numerical variables with na: ", len(num_na))
Number of categorical variables with na:  16
Number of numerical variables with na:  3
1num_na
['LotFrontage', 'MasVnrArea', 'GarageYrBlt']
1cat_na
['Alley',
 'MasVnrType',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Electrical',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PoolQC',
 'Fence',
 'MiscFeature']

Relationship between missing data and Sale Price#

The price of the house in those observations where the information is missing,for each variable that shows missing data.

 1def analyse_na_value(df, var):
 2
 3    # copy of the dataframe, so as not to override the original data
 4    # see the link for more details about pandas.copy()
 5    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html
 6    df = df.copy()
 7
 8    # making an interim variable that indicates 1 if the
 9    # observation was missing or 0 otherwise
10    df[var] = np.where(df[var].isnull(), 1, 0)
11
12    # comparing the median SalePrice in the observations where data is missing
13    # vs the observations where data is available
14
15    # determining the median price in the groups 1 and 0,
16    # and the standard deviation of the sale price,
17    # and capturing the results in a temporary dataset
18    temp = df.groupby(var)["SalePrice"].agg(["mean", "std"])
19
20    # plotting into a bar graph
21    temp.plot(
22        kind="barh",
23        y="mean",
24        legend=False,
25        xerr="std",
26        title="Sale Price",
27        color="green",
28    )
29
30    plt.show()
1# Running the function on each variable with missing data
2for var in vars_with_na:
3    analyse_na_value(data, var)
../../_images/ab2255f0da539293f9d60b763e57cf8a4f8cc8bbce7f6816dbe1c6b50e230ddc.png ../../_images/9dcbd74117b8b32be56d012e49fc8f787c2dbf88a004e97f13de6ab5734c160d.png ../../_images/7b26c2bd154c582c320b662201edce2468a974307d73bd5f135d7e0cfcb1c703.png ../../_images/868245ea1431b44460316e8440a6eeed63ca467a1d1f76805fa8459b9c3286ab.png ../../_images/94ab9afbcb001986000e68cfba0781b816e90f998fcbdfca82ff4cd366771cd3.png ../../_images/a1ffe1fa080842acd6fb4f3bb5059436146c063a2bb507325f7c3ec35f70a5e5.png ../../_images/eb07600bb60dd79bb40c08a123ecd9b223fb65509ffc6d282fa02101a8869e50.png ../../_images/65e242da4c3461b455dca88b45089c3cf12974ead3c56632c6d35f18608451d4.png ../../_images/3c7759e596c70d4c75ace6f998c27acb41a785ce2fd3278be28391cd3a88b8da.png ../../_images/b81e6aedc9fe413554a694fd0276de3328574220cce5662bbb1f0c384f315024.png ../../_images/ae67c1a1c76ce2d511c490f01f00b9052a08b772dbeb5496efc32c470e4f3dc1.png ../../_images/dafe65fe968407507728387ae294a1884d4d902b955a52517c8e5c9288a497c6.png ../../_images/66dd5c97b1c543d829eec6140720e7807ae68bebac7b53118dfdf00ca8782cc9.png ../../_images/9d985e5b34330af9f6710a0e6650c21a80cfc2a393caaff5d3b768ae9e06ffca.png ../../_images/fab7efb1ce5124c92dfdf94c7dd75188dfc3ffd88834148660b9ccc3161febbf.png ../../_images/b92fe589be903311dbdd52d1e0dd89be93a8c28eb5c785e5f44c5c3618aec232.png ../../_images/dbc6a5f63323859a41ed001825e7a2571d4e2bf7e3641d5c3bdda47525a6e11b.png ../../_images/0e24f8fd0a6e1e1c833f650e676db917463099a6cbe4225c7c8e72ddedb883bb.png ../../_images/3305115adac9da6630f2753f163c64b4a8856c42d785c927c49f4de7775b3555.png

In some variables, the average Sale Price in houses where the information is missing, differs from the average Sale Price in houses where information exists. This suggests that data being missing could be a good predictor of Sale Price.

Numerical variables#

1print("Number of numerical variables: ", len(num_vars))
2
3# Visualizing the numerical variables
4data[num_vars].head()
Number of numerical variables:  35
LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold
0 65.0 8450 7 5 2003 2003 196.0 706 0 150 856 856 854 0 1710 1 0 2 1 3 1 8 0 2003.0 2 548 0 61 0 0 0 0 0 2 2008
1 80.0 9600 6 8 1976 1976 0.0 978 0 284 1262 1262 0 0 1262 0 1 2 0 3 1 6 1 1976.0 2 460 298 0 0 0 0 0 0 5 2007
2 68.0 11250 7 5 2001 2002 162.0 486 0 434 920 920 866 0 1786 1 0 2 1 3 1 6 1 2001.0 2 608 0 42 0 0 0 0 0 9 2008
3 60.0 9550 7 5 1915 1970 0.0 216 0 540 756 961 756 0 1717 1 0 1 0 3 1 7 1 1998.0 3 642 0 35 272 0 0 0 0 2 2006
4 84.0 14260 8 5 2000 2000 350.0 655 0 490 1145 1145 1053 0 2198 1 0 2 1 4 1 9 1 2000.0 3 836 192 84 0 0 0 0 0 12 2008

Temporal variables#

There are 4 year variables in the dataset:

  • YearBuilt: year in which the house was built

  • YearRemodAdd: year in which the house was remodeled

  • GarageYrBlt: year in which a garage was built

  • YrSold: year in which the house was sold

Extracting information from date variables in their raw format: For example, capturing the difference in years between the year the house was built and the year the house was sold.

1# List of variables that contain year information
2
3year_vars = [var for var in num_vars if "Yr" in var or "Year" in var]
4
5year_vars
['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']
1# Exploring the values of these temporal variables
2for var in year_vars:
3    print(var, data[var].unique())
4    print()
YearBuilt [2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 1965 2005 1962 2006
 1960 1929 1970 1967 1958 1930 2002 1968 2007 1951 1957 1927 1920 1966
 1959 1994 1954 1953 1955 1983 1975 1997 1934 1963 1981 1964 1999 1972
 1921 1945 1982 1998 1956 1948 1910 1995 1991 2009 1950 1961 1977 1985
 1979 1885 1919 1990 1969 1935 1988 1971 1952 1936 1923 1924 1984 1926
 1940 1941 1987 1986 2008 1908 1892 1916 1932 1918 1912 1947 1925 1900
 1980 1989 1992 1949 1880 1928 1978 1922 1996 2010 1946 1913 1937 1942
 1938 1974 1893 1914 1906 1890 1898 1904 1882 1875 1911 1917 1872 1905]

YearRemodAdd [2003 1976 2002 1970 2000 1995 2005 1973 1950 1965 2006 1962 2007 1960
 2001 1967 2004 2008 1997 1959 1990 1955 1983 1980 1966 1963 1987 1964
 1972 1996 1998 1989 1953 1956 1968 1981 1992 2009 1982 1961 1993 1999
 1985 1979 1977 1969 1958 1991 1971 1952 1975 2010 1984 1986 1994 1988
 1954 1957 1951 1978 1974]

GarageYrBlt [2003. 1976. 2001. 1998. 2000. 1993. 2004. 1973. 1931. 1939. 1965. 2005.
 1962. 2006. 1960. 1991. 1970. 1967. 1958. 1930. 2002. 1968. 2007. 2008.
 1957. 1920. 1966. 1959. 1995. 1954. 1953.   nan 1983. 1977. 1997. 1985.
 1963. 1981. 1964. 1999. 1935. 1990. 1945. 1987. 1989. 1915. 1956. 1948.
 1974. 2009. 1950. 1961. 1921. 1900. 1979. 1951. 1969. 1936. 1975. 1971.
 1923. 1984. 1926. 1955. 1986. 1988. 1916. 1932. 1972. 1918. 1980. 1924.
 1996. 1940. 1949. 1994. 1910. 1978. 1982. 1992. 1925. 1941. 2010. 1927.
 1947. 1937. 1942. 1938. 1952. 1928. 1922. 1934. 1906. 1914. 1946. 1908.
 1929. 1933.]

YrSold [2008 2007 2006 2009 2010]

As expected, the values are years.

Exploring the evolution of the sale price with the years in which the house was sold:

1# plotting median sale price vs year in which it was sold
2data.groupby("YrSold")["SalePrice"].median().plot()
3plt.ylabel("Median House Price")
Text(0, 0.5, 'Median House Price')
../../_images/6526852d752eaf853ee1f7e79bc8e8e1ffc50d19336434ca104fd50dc96229d5.png

There has been a drop in the value of the houses. That is unusual, in real life, house prices typically go up as years go by.

Plotting the price of sale vs year in which it was built

1# plotting median sale price vs year in which it was built
2
3data.groupby("YearBuilt")["SalePrice"].median().plot()
4plt.ylabel("Median House Price")
Text(0, 0.5, 'Median House Price')
../../_images/9ed583062d0a1fd0c66380c40bcae4cb882111faa40260e62131d7b7d3995692.png

Newly built / younger houses tend to be more expensive.

Could it be that lately older houses were sold?

Capture the elapsed years between the Year variables and the year in which the house was sold:

 1def analyse_year_vars(df, var):
 2
 3    df = df.copy()
 4
 5    # capturing difference between a year variable and year
 6    # in which the house was sold
 7    df[var] = df["YrSold"] - df[var]
 8
 9    df.groupby("YrSold")[var].median().plot()
10    plt.ylabel("Time from " + var)
11    plt.show()
12
13
14for var in year_vars:
15    if var != "YrSold":
16        analyse_year_vars(data, var)
../../_images/e032a7a799b6ee8fc6347621822249fa265ca52633c7da2467f1e4f5d939dd0b.png ../../_images/74541c937cc5f6c1bd6cb998c8d02aa633731b6cca3837a6ef7de3761d67996b.png ../../_images/f5af84bd074f530af4e5012211b94cee336986193f84de78c023901940b48c11.png

Towards 2010, the houses sold had older garages, and had not been remodelled recently, that might explain why we see cheaper sales prices in recent years, at least in this dataset.

Plotting instead the time since last remodelled, or time since built, and sale price, to see if there is a relationship.

 1def analyse_year_vars(df, var):
 2
 3    df = df.copy()
 4
 5    # capturing difference between a year variable and year
 6    # in which the house was sold
 7    df[var] = df["YrSold"] - df[var]
 8
 9    plt.scatter(df[var], df["SalePrice"])
10    plt.ylabel("SalePrice")
11    plt.xlabel(var)
12    plt.show()
13
14
15for var in year_vars:
16    if var != "YrSold":
17        analyse_year_vars(data, var)
../../_images/c741c40c6c3f7a9212491f2e831d515d06f328e0e9e2afc2eb12320167987d30.png ../../_images/07dd8f7cc8647c39702f615d0c85972df320868c04786440b19c8b8ee49ee76d.png ../../_images/279913dbe91f7df08d893ef6421f4b9aa69e3727045e13a7cf78bd68b662438a.png

There seems to be a tendency to a decrease in price, with older houses. In other words, the longer the time between the house was built or remodeled and sale date, the lower the sale Price.

Which makes sense, cause this means that the house will have an older look, and potentially needs repairs.

Discrete variables#

1#  List of discrete variables
2discrete_vars = [
3    var
4    for var in num_vars
5    if len(data[var].unique()) < 20 and var not in year_vars
6]
7
8print("Number of discrete variables: ", len(discrete_vars))
Number of discrete variables:  13
1# Visualise the discrete variables
2
3data[discrete_vars].head()
OverallQual OverallCond BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces GarageCars PoolArea MoSold
0 7 5 1 0 2 1 3 1 8 0 2 0 2
1 6 8 0 1 2 0 3 1 6 1 2 0 5
2 7 5 1 0 2 1 3 1 6 1 2 0 9
3 7 5 1 0 1 0 3 1 7 1 3 0 2
4 8 5 1 0 2 1 4 1 9 1 3 0 12

These discrete variables tend to be qualifications (Qual) or grading scales (Cond), or refer to the number of rooms, or units (FullBath, GarageCars), or indicate the area of the room (KitchenAbvGr).

We expect higher prices, with bigger numbers. Analysing their contribution to the house price.

MoSold is the month in which the house was sold.

 1for var in discrete_vars:
 2    # make boxplot with Catplot
 3    sns.catplot(
 4        x=var, y="SalePrice", data=data, kind="box", height=4, aspect=1.5
 5    )
 6    # add data points to boxplot with stripplot
 7    sns.stripplot(
 8        x=var, y="SalePrice", data=data, jitter=0.1, alpha=0.3, color="k"
 9    )
10    plt.show()
../../_images/fb79507ea92fcf68fe3ad39033576549965be590622730b462946f08abf3135c.png ../../_images/b1c4514177769ff1c79780415cb73974c6cef67fcf9844dc8938eec1c49655d6.png ../../_images/6cb4fa0b5a052597d3dd32ab4472705b87ba4f1ac5314a3e4c755622549c7089.png ../../_images/7aac81071e40084adcf3bfa172557042d98ca4f751791a91f6f0da43a7e987c9.png ../../_images/1006500d9e6536bf9f5a6ba0af3b6b6205466491c61a067581ead6973cc387a0.png ../../_images/44e7773d29967fb3a8df632723c8714cb147d6f841557224483eeeb9e4e2fd34.png ../../_images/f5b81ba14fbaf3b9241a696bbd8133de54a2475597176786728b59cf59bc7147.png ../../_images/509d2b7c01dc60333e62ffe3560aee13abe884270624bddcbbc0999601e20f8c.png ../../_images/aa1f7ab6dc309fabe1e08a17568481a10e61fed1e6be7b7569c9ae5cdbed6b0f.png ../../_images/6d792390eeeb576e74853d149c52a889692be3ac87e34b1e1794ea0f7e29f32d.png ../../_images/9f83076787265952018e60697a93d6a58532cd38faa95a60e6c9185175d498bd.png ../../_images/9d9c11e4ff441c20d2ea1c2de0b4f80f3b0f8e3335cd98a3727129cc1ecaf0fd.png ../../_images/236ba4c11a7681d232f84fabd5073e6f90a86e824e191373b8dd328c67b42ae4.png

For most discrete numerical variables, the sale price increases with the quality, or overall condition, or number of rooms, or surface.

For some variables, this tendency can not be seen. Most likely that variable is not a good predictor of sale price.

Continuous variables#

Distribution of the continuous variables.

1# List of continuous variables
2cont_vars = [var for var in num_vars if var not in discrete_vars + year_vars]
3
4print("Number of continuous variables: ", len(cont_vars))
Number of continuous variables:  18
1# Visualising the continuous variables
2
3data[cont_vars].head()
LotFrontage LotArea MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch MiscVal
0 65.0 8450 196.0 706 0 150 856 856 854 0 1710 548 0 61 0 0 0 0
1 80.0 9600 0.0 978 0 284 1262 1262 0 0 1262 460 298 0 0 0 0 0
2 68.0 11250 162.0 486 0 434 920 920 866 0 1786 608 0 42 0 0 0 0
3 60.0 9550 0.0 216 0 540 756 961 756 0 1717 642 0 35 272 0 0 0
4 84.0 14260 350.0 655 0 490 1145 1145 1053 0 2198 836 192 84 0 0 0 0
1# Plotting histograms for all continuous variables
2
3data[cont_vars].hist(bins=30, figsize=(15, 15))
4plt.show()
../../_images/3c35bef82402de59bbb44560fe240f5a1f914e6d1644c0921d7012f168a76564.png

The variables are not normally distributed. And there are a particular few that are extremely skewed like 3SsnPorch, ScreenPorch and MiscVal.

Sometimes, transforming the variables to improve the value spread, improves the model performance. But it is unlikely that a transformation will help change the distribution of the super skewed variables dramatically.

Applying a Yeo-Johnson transformation to variables like LotFrontage, LotArea, BsmUnfSF, and a binary transformation to variables like 3SsnPorch, ScreenPorch and MiscVal.

 1# List with the super skewed variables for later
 2
 3skewed = [
 4    "BsmtFinSF2",
 5    "LowQualFinSF",
 6    "EnclosedPorch",
 7    "3SsnPorch",
 8    "ScreenPorch",
 9    "MiscVal",
10]
 1# Capturing the remaining continuous variables
 2
 3cont_vars = [
 4    "LotFrontage",
 5    "LotArea",
 6    "MasVnrArea",
 7    "BsmtFinSF1",
 8    "BsmtUnfSF",
 9    "TotalBsmtSF",
10    "1stFlrSF",
11    "2ndFlrSF",
12    "GrLivArea",
13    "GarageArea",
14    "WoodDeckSF",
15    "OpenPorchSF",
16]

Yeo-Johnson transformation#

 1# Analysing the distributions of the variables
 2# after applying a yeo-johnson transformation
 3
 4# temporary copy of the data
 5tmp = data.copy()
 6
 7for var in cont_vars:
 8
 9    # transform the variable - yeo-johsnon
10    tmp[var], param = stats.yeojohnson(data[var])
11
12
13# plot the histograms of the transformed variables
14tmp[cont_vars].hist(bins=30, figsize=(15, 15))
15plt.show()
../../_images/cd2b470b487ebd21823478a0a3b335fb880028ac194640421b7b7ab4344bf126.png

For LotFrontage and MasVnrArea the transformation did not do an amazing job.

For the others, the values seem to be spread more evenly in the range.

Whether this helps improve the predictive power, remains to be seen. To determine if this is the case, train a model with the original values and one with the transformed values, and determine model performance, and feature importance.

Here, we do a quick visual exploration instead:

 1# Plotting the original or transformed variables
 2# vs sale price, and see if there is a relationship
 3
 4for var in cont_vars:
 5
 6    plt.figure(figsize=(12, 4))
 7
 8    # plot the original variable vs sale price
 9    plt.subplot(1, 2, 1)
10    plt.scatter(data[var], np.log(data["SalePrice"]))
11    plt.ylabel("Sale Price")
12    plt.xlabel("Original " + var)
13
14    # plot transformed variable vs sale price
15    plt.subplot(1, 2, 2)
16    plt.scatter(tmp[var], np.log(tmp["SalePrice"]))
17    plt.ylabel("Sale Price")
18    plt.xlabel("Transformed " + var)
19
20    plt.show()
../../_images/06a835854652595dcc0a3971b4d238331d21c241fb5c01217d66717bea007bec.png ../../_images/ca736d06fa31a41d3f071f7087ed08d51c374c030a4594d4764bc393149f5119.png ../../_images/8b890f81e9fd1d7d425edbca08d9496fafdea25092e1f393b57688a4f4afe04d.png ../../_images/103b5394cf338b3279fb7239e243f0a6f9f6a5aafc7e4ef6d2eb3facaed6c481.png ../../_images/397978c6d8ec050c76ed518adcf3ced1cf5216811aa7095b4157f6751f112f9c.png ../../_images/44a98d3c2d8e5cf1e2123f9332a6871135a90a7c19aa4363ba26fa6117eb502b.png ../../_images/23edf9ad35429d79b545d4ffd5a99dec4025be07f559bf69dd2ebbdb02172ba2.png ../../_images/43d34dfc63eb3e12192a8c9deb272f9894fd01dec274a8953db3547b18148028.png ../../_images/f213ae8714deb6a76b800cab270f0e996b4c825dcf71c5ef4d573e9edbabb8d5.png ../../_images/2246d8e379abb268a1e4c26d659bea15e83f88ed8799c94d35d3341ca805f63e.png ../../_images/523811cc583aa97364468483e244473564e8533001f7b4db5b67f98b47a2e844.png ../../_images/17be57e8b41326143b38a604c65b554e2205abc7646f2621972f38c98bf91640.png

By eye, the transformations seems to improve the relationship only for LotArea.

A different transformation: Most variables contain the value 0, and thus the logarithmic transformation can not be applied, but it can be applied for the following variables:

[“LotFrontage”, “1stFlrSF”, “GrLivArea”]

Check if that changes the variable distribution and its relationship with the target.

Logarithmic transformation#

 1# Analysing the distributions of these variables
 2# after applying a logarithmic transformation
 3
 4tmp = data.copy()
 5
 6for var in ["LotFrontage", "1stFlrSF", "GrLivArea"]:
 7
 8    # transform the variable with logarithm
 9    tmp[var] = np.log(data[var])
10
11tmp[["LotFrontage", "1stFlrSF", "GrLivArea"]].hist(bins=30)
12plt.show()
../../_images/e2e90e500f319ea29f78be155e79f07292454b5b863ed69d66d5dd7bc4a3b464.png

The distribution of the variables are now more “Gaussian” looking.

Evaluating their relationship with the target.

 1# Plotting the original or transformed variables
 2# vs sale price, and see if there is a relationship
 3
 4for var in ["LotFrontage", "1stFlrSF", "GrLivArea"]:
 5
 6    plt.figure(figsize=(12, 4))
 7
 8    # plot the original variable vs sale price
 9    plt.subplot(1, 2, 1)
10    plt.scatter(data[var], np.log(data["SalePrice"]))
11    plt.ylabel("Sale Price")
12    plt.xlabel("Original " + var)
13
14    # plot transformed variable vs sale price
15    plt.subplot(1, 2, 2)
16    plt.scatter(tmp[var], np.log(tmp["SalePrice"]))
17    plt.ylabel("Sale Price")
18    plt.xlabel("Transformed " + var)
19
20    plt.show()
../../_images/745d6ec3114b359b08d1bd39b5cd8a630c9152bc49a4b4b68c1c2c1aec39f18d.png ../../_images/48613481556786f1b0310ba52a9590455fcb3859f4f40f7d09ea25e824f477e5.png ../../_images/02e12a02739d1c9963d24c8490c69b05434b88370b152e1dfc6e6aaa1244d768.png

The transformed variables have a better spread of the values, which may in turn, help make better predictions.

Skewed variables#

Transforming skewed vars into binary variables to check how predictive they are:

 1for var in skewed:
 2
 3    tmp = data.copy()
 4
 5    # map the variable values into 0 and 1
 6    tmp[var] = np.where(data[var] == 0, 0, 1)
 7
 8    # determine mean sale price in the mapped values
 9    tmp = tmp.groupby(var)["SalePrice"].agg(["mean", "std"])
10
11    # plot into a bar graph
12    tmp.plot(
13        kind="barh",
14        y="mean",
15        legend=False,
16        xerr="std",
17        title="Sale Price",
18        color="green",
19    )
20
21    plt.show()
../../_images/88c42bb17ac130a7a0f92f67c2f34948a9c51b30213de2c5e1c4c92c20e3bfe9.png ../../_images/c2d80126220088d8c712663a506b7b283be3f5d0e892613fdb36b777bb888454.png ../../_images/97004c82b611d13770f8b5bcd7d6d82ce4d64d2c2248698da7b54c36f8e6be51.png ../../_images/429d0bda500d780570470cae50f7d2bef1f6d57afcf33cb7e6780d6785e90a49.png ../../_images/318fb98c5e2b7abdd5ff8fd9c89c1e7a011fc7663eaed0ac2dcd64b08c529f9c.png ../../_images/144b101d85c896a060ce98c0346a962f8b37fb86004813bd1155c116c8bbb9ca.png

There seem to be a difference in Sale Price in the mapped values, but the confidence intervals overlap, so most likely this is not significant or predictive.

Categorical variables#

Analysing the categorical variables present in the dataset.

1print("Number of categorical variables: ", len(cat_vars))
Number of categorical variables:  44
1# Visualising the values of the categorical variables
2data[cat_vars].head()
MSZoning Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 Heating HeatingQC CentralAir Electrical KitchenQual Functional FireplaceQu GarageType GarageFinish GarageQual GarageCond PavedDrive PoolQC Fence MiscFeature SaleType SaleCondition MSSubClass
0 RL Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story Gable CompShg VinylSd VinylSd BrkFace Gd TA PConc Gd TA No GLQ Unf GasA Ex Y SBrkr Gd Typ NaN Attchd RFn TA TA Y NaN NaN NaN WD Normal 60
1 RL Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story Gable CompShg MetalSd MetalSd None TA TA CBlock Gd TA Gd ALQ Unf GasA Ex Y SBrkr TA Typ TA Attchd RFn TA TA Y NaN NaN NaN WD Normal 20
2 RL Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story Gable CompShg VinylSd VinylSd BrkFace Gd TA PConc Gd TA Mn GLQ Unf GasA Ex Y SBrkr Gd Typ TA Attchd RFn TA TA Y NaN NaN NaN WD Normal 60
3 RL Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story Gable CompShg Wd Sdng Wd Shng None TA TA BrkTil TA Gd No ALQ Unf GasA Gd Y SBrkr Gd Typ Gd Detchd Unf TA TA Y NaN NaN NaN WD Abnorml 70
4 RL Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story Gable CompShg VinylSd VinylSd BrkFace Gd TA PConc Gd TA Av GLQ Unf GasA Ex Y SBrkr Gd Typ TA Attchd RFn TA TA Y NaN NaN NaN WD Normal 60

Number of labels: cardinality#

How many categories are present in each of the variables?

1# counting unique categories with pandas unique()
2# and then plotting them in descending order
3
4data[cat_vars].nunique().sort_values(ascending=False).plot.bar(figsize=(12, 5))
<AxesSubplot: >
../../_images/c669bcb9ac618d2eca8641b053c281f43debcc6857cd875eafd2a1b6d2ec9f68.png

All the categorical variables show low cardinality, this means that they have only few different labels.

Quality variables#

There are a number of variables that refer to the quality of some aspect of the house, for example the garage, or the fence, or the kitchen. Replace these categories by numbers increasing with the quality of the place or room.

The mappings can be obtained from the Kaggle Website. One example:

  • Ex = Excellent

  • Gd = Good

  • TA = Average/Typical

  • Fa = Fair

  • Po = Poor

 1# Re-mapping strings to number, which determine quality
 2
 3qual_mappings = {
 4    "Po": 1,
 5    "Fa": 2,
 6    "TA": 3,
 7    "Gd": 4,
 8    "Ex": 5,
 9    "Missing": 0,
10    "NA": 0,
11}
12
13qual_vars = [
14    "ExterQual",
15    "ExterCond",
16    "BsmtQual",
17    "BsmtCond",
18    "HeatingQC",
19    "KitchenQual",
20    "FireplaceQu",
21    "GarageQual",
22    "GarageCond",
23]
24
25for var in qual_vars:
26    data[var] = data[var].map(qual_mappings)
1exposure_mappings = {"No": 1, "Mn": 2, "Av": 3, "Gd": 4, "Missing": 0, "NA": 0}
2
3var = "BsmtExposure"
4
5data[var] = data[var].map(exposure_mappings)
 1finish_mappings = {
 2    "Missing": 0,
 3    "NA": 0,
 4    "Unf": 1,
 5    "LwQ": 2,
 6    "Rec": 3,
 7    "BLQ": 4,
 8    "ALQ": 5,
 9    "GLQ": 6,
10}
11
12finish_vars = ["BsmtFinType1", "BsmtFinType2"]
13
14for var in finish_vars:
15    data[var] = data[var].map(finish_mappings)
1garage_mappings = {"Missing": 0, "NA": 0, "Unf": 1, "RFn": 2, "Fin": 3}
2
3var = "GarageFinish"
4
5data[var] = data[var].map(garage_mappings)
 1fence_mappings = {
 2    "Missing": 0,
 3    "NA": 0,
 4    "MnWw": 1,
 5    "GdWo": 2,
 6    "MnPrv": 3,
 7    "GdPrv": 4,
 8}
 9
10var = "Fence"
11
12data[var] = data[var].map(fence_mappings)
1# capture all quality variables
2
3qual_vars = qual_vars + finish_vars + ["BsmtExposure", "GarageFinish", "Fence"]
 1# Plotting the house mean sale price based on the quality of the
 2# various attributes
 3
 4for var in qual_vars:
 5    # boxplot with Catplot
 6    sns.catplot(
 7        x=var, y="SalePrice", data=data, kind="box", height=4, aspect=1.5
 8    )
 9    # data points to boxplot with stripplot
10    sns.stripplot(
11        x=var, y="SalePrice", data=data, jitter=0.1, alpha=0.3, color="k"
12    )
13    plt.show()
../../_images/9100db0c07f03e63fafb6706e39e853c3b2e464a2df01bdf72d404e2539fdd20.png ../../_images/794d3666faf36fc7ea4743aa9cd85430ab70598011b737c29a057fe5295637ef.png ../../_images/c9b945bb50288d7eab2f840220cc5d71337f6bb063f7ad819262eb233a9f96f8.png ../../_images/c467cc50ddf5558c160a801323e2e64e1f895671ce3a43257638112bcd2053bb.png ../../_images/8fd1900002d0261123df332770ddf5cae8f8b7bdea92ad8be21c068844f774b8.png ../../_images/c46b7339ef6c719ef7956dfa8fdc05e73730dc3fd5513f3971efe9bdd560a795.png ../../_images/d5a8232a8893b2cebe1927b4cf64eb8c526755b078726762081ed20fc951eb2f.png ../../_images/9cb49947f45b8ff619163e3143a0d7a7881cab67e5824a71bf3fa786f4bbc811.png ../../_images/c3dce994b8e767a97ba768c144ef558083a938d6a3f5039cdac5f7646987eb5e.png ../../_images/46112c2b963de3477570a72ef90f37843d344e682751a125c4e5d8a25964f73d.png ../../_images/7f077ed2c261c3de21ac2807cc799ed0d537e2d2d2cc93061f348843afe49837.png ../../_images/4556839c9de76802bd94a743b91454dbe51f0a29119c588470bd37e86a183192.png ../../_images/de1fc5dca528d424a841ebdea93f8a6f2b18df17a5c9160e1d8345c0b5ff767d.png ../../_images/25d821ee0b6bbcd9b9fe970eab4f1fdef77c0ec14ff2727ec6ee907432877c08.png

For most attributes, the increase in the house price with the value of the variable, is quite clear.

1# Capturing the remaining categorical variables
2# (those that we did not re-map)
3cat_others = [var for var in cat_vars if var not in qual_vars]
4
5len(cat_others)
30

Rare labels#

Investigating if there are labels that are present only in a small number of houses:

 1def analyse_rare_labels(df, var, rare_perc):
 2    df = df.copy()
 3
 4    # determine the % of observations per category
 5    temp = df.groupby(var)["SalePrice"].count() / len(df)
 6
 7    # return categories that are rare
 8    return temp[temp < rare_perc]
 9
10
11# print categories that are present in less than
12# 1 % of the observations
13
14for var in cat_others:
15    print(analyse_rare_labels(data, var, 0.01))
16    print()
MSZoning
C (all)    0.006849
Name: SalePrice, dtype: float64

Street
Grvl    0.00411
Name: SalePrice, dtype: float64

Series([], Name: SalePrice, dtype: float64)

LotShape
IR3    0.006849
Name: SalePrice, dtype: float64

Series([], Name: SalePrice, dtype: float64)

Utilities
NoSeWa    0.000685
Name: SalePrice, dtype: float64

LotConfig
FR3    0.00274
Name: SalePrice, dtype: float64

LandSlope
Sev    0.008904
Name: SalePrice, dtype: float64

Neighborhood
Blueste    0.001370
NPkVill    0.006164
Veenker    0.007534
Name: SalePrice, dtype: float64

Condition1
PosA    0.005479
RRAe    0.007534
RRNe    0.001370
RRNn    0.003425
Name: SalePrice, dtype: float64

Condition2
Artery    0.001370
Feedr     0.004110
PosA      0.000685
PosN      0.001370
RRAe      0.000685
RRAn      0.000685
RRNn      0.001370
Name: SalePrice, dtype: float64

Series([], Name: SalePrice, dtype: float64)

HouseStyle
1.5Unf    0.009589
2.5Fin    0.005479
2.5Unf    0.007534
Name: SalePrice, dtype: float64

RoofStyle
Flat       0.008904
Gambrel    0.007534
Mansard    0.004795
Shed       0.001370
Name: SalePrice, dtype: float64

RoofMatl
ClyTile    0.000685
Membran    0.000685
Metal      0.000685
Roll       0.000685
Tar&Grv    0.007534
WdShake    0.003425
WdShngl    0.004110
Name: SalePrice, dtype: float64

Exterior1st
AsphShn    0.000685
BrkComm    0.001370
CBlock     0.000685
ImStucc    0.000685
Stone      0.001370
Name: SalePrice, dtype: float64

Exterior2nd
AsphShn    0.002055
Brk Cmn    0.004795
CBlock     0.000685
ImStucc    0.006849
Other      0.000685
Stone      0.003425
Name: SalePrice, dtype: float64

Series([], Name: SalePrice, dtype: float64)

Foundation
Stone    0.004110
Wood     0.002055
Name: SalePrice, dtype: float64

Heating
Floor    0.000685
Grav     0.004795
OthW     0.001370
Wall     0.002740
Name: SalePrice, dtype: float64

Series([], Name: SalePrice, dtype: float64)

Electrical
FuseP    0.002055
Mix      0.000685
Name: SalePrice, dtype: float64

Functional
Maj1    0.009589
Maj2    0.003425
Sev     0.000685
Name: SalePrice, dtype: float64

GarageType
2Types     0.004110
CarPort    0.006164
Name: SalePrice, dtype: float64

Series([], Name: SalePrice, dtype: float64)

PoolQC
Ex    0.001370
Fa    0.001370
Gd    0.002055
Name: SalePrice, dtype: float64

MiscFeature
Gar2    0.001370
Othr    0.001370
TenC    0.000685
Name: SalePrice, dtype: float64

SaleType
CWD      0.002740
Con      0.001370
ConLD    0.006164
ConLI    0.003425
ConLw    0.003425
Oth      0.002055
Name: SalePrice, dtype: float64

SaleCondition
AdjLand    0.002740
Alloca     0.008219
Name: SalePrice, dtype: float64

MSSubClass
40     0.002740
45     0.008219
180    0.006849
Name: SalePrice, dtype: float64

Some categorical variables show multiple labels that are present in less than 1% of the houses.

Labels that are under-represented in the dataset tend to cause over-fitting of machine learning models.

Removing.

Exploring the relationship between the categories of the different variables and the house sale price:

 1for var in cat_others:
 2    # make boxplot with Catplot
 3    sns.catplot(
 4        x=var, y="SalePrice", data=data, kind="box", height=4, aspect=1.5
 5    )
 6    # add data points to boxplot with stripplot
 7    sns.stripplot(
 8        x=var, y="SalePrice", data=data, jitter=0.1, alpha=0.3, color="k"
 9    )
10    plt.show()
../../_images/b4df98e8042efb1ce7fcb232a3c4bf1e9b7006e2236fbb224d6cbc48638145bf.png ../../_images/cefd1d35a3927b698cfcdd369209915b42732200747d933500fb465106a58f79.png ../../_images/9bebf453a12d1c6158bd1b613a34872363e7b9fffc091eaaaa4f0f44fa714e04.png ../../_images/45fe94761ecb9023200e4a74e4812cd57614d23b5ef8d9031fd22f10703f27fe.png ../../_images/d8ff9040f8c64c4223e004372ba79ad5b26e7523478eabd6a1fd50c3dd767f24.png ../../_images/c9f929c3834ba51155a86d3b6ac6aad6de10c36984ffc3e440511863fec5b3b9.png ../../_images/1d255175a85b9740d85ab6363a81631d98b46b9ee119472ba8550d98922121d7.png ../../_images/7159f68fe2c402c10946ac55d8ecd6718c6a8b9540b02df380fa87b27b9f68b1.png ../../_images/8e69edada35e8675ccb5e24b8a2d6cc71f7ee4cbf1684471ca4d3c79f651f9b2.png ../../_images/83a72714ac2464fad73261fdc575a41f9644855e340736d840f4050199d2d449.png ../../_images/4d2d23e338bf58df16281a6919e55def0dc6f6532b9b47a3af898ac26d50fb57.png ../../_images/1810183844a40886ab74a8fb5c8b1ad03e5a1ea3f0b9034e4376705fb1c37321.png ../../_images/2a62ae985ffee5cbba4fcdba0e1f3645fce069efeef9c3ba50b6d15ac6abb1a8.png ../../_images/9580b2e33aa11436aad581e0581fa07b13576097eb79aa4280fb76b8fc4b073b.png ../../_images/59a68ec206c3e1081b094af27222fccb99924be0a0b1d7d9dd791449974b364c.png ../../_images/0259c4ef9e2ff7202daa67565bb6454e78833ae21930472b8aeb09420777edc6.png ../../_images/e87f3911edb169986e58d166ccfb281388459f8b1286feebfd019000cc6bb067.png ../../_images/5c739ee30d2f6d17cd38d388f308c5d47c50d8f4549317644fed4b21719d330b.png ../../_images/024a9c190c04a9cbdd46f0496cf20686afbde7c508d925dde8133d2f1666269c.png ../../_images/16c8aafe5980e0e062f53743856cae996f98b2dfd17a6689edb33f5ab9b3d401.png ../../_images/88b976216ee83f19fabda935cba5d936b50d110d150fc89df77b03ca5f9dcc88.png ../../_images/7eba85cb8ab6f9030d42d5e1541bd5a4d061af28604f587c20726003b9b971e5.png ../../_images/76b516522913ce5d4e4bdedde44fb2857feab01ec4edc9e27e96e67ba9269e97.png ../../_images/4cc6c549452fa45d834d7f0bd0ec8350e9865f0ccbd9e7f9cea4b19c2e2d9c38.png ../../_images/a15ea773a6777e6898d70f1a807e75c2dbcce37184bc5207e10e69c4bdf7bc9f.png ../../_images/539ebd96f1acc2e051d82c73054b3f4c11c3875a7e8e632b0a5aabaaecaa1a05.png ../../_images/d9fb994312b226b834c86ee0471f8c05f883739b2db91ae2cd49e920413b63d0.png ../../_images/899ebe338e0f20f0da30da977a0ab72a1f88945367431e8fe007f26cb1186f38.png ../../_images/72dcd2dd5c6d76e79430184eb9c6494735ee6faa52cce864a5eda5042bd2bb6b.png ../../_images/a9d51059de919ee0676bfd68d89739f2229f464b2bf2d72280edfb1820559927.png

Clearly, the categories give information on the SalePrice, as different categories show different median sale prices.

Disclaimer:

There is certainly more that can be done to understand the nature of this data and the relationship of these variables with the target, SalePrice. And also about the distribution of the variables themselves.

Additional Resources#