House Price Prediction - Ames Housing Dataset

House Price Prediction - Ames Housing Dataset
Photo by Tom Rumble / Unsplash

Predicting a value from data is a very common data science problem for example, predicting house prices based on neighbourhood or number of acres. A popular  dataset of this sort is the Ames Housing dataset which can be found on Kaggle. In this article, I will walkthrough, step by step, the code I built for predicting house prices.


As usual in any data science project, import the required libraries.

# Importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Import the training and testing datasets as a dataframe. The datasets can be found on Kaggle.

# Importing the datasets and joining them together

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

Lets summarise the training dataset, so that we can get an overview of the data.

train.info()

The result is shown below:

Data Info

This shows that there are 80 columns in the dataset if you don't include the index. Now the goal here is to predict the 'SalePrice' for houses in the test dataset based on the other 79 features. That's a lot of variables to consider!

So where do we even begin?
The answer here is Exploratory Data Analysis (EDA). Before modelling the data it is always best practice to explore the dataset in order to get understanding of the how the different features may correlate with the y-variable, which in this case is the SalePrice. This also allows one to build a working hypothesis as it helps guide the project in terms of what variables you need to assess and potentially model.

A very powerful visualisation for this and one which I would use in any data science project is a correlation heatmap between all numerical columns in the training dataset and the SalePrice.

# Subset of training dataset with only numerical columns

train_num = train.select_dtypes(exclude='object').reset_index()

# Correlation Heatmap for Sale Price

train_num.corr()[['SalePrice']].sort_values(by='SalePrice', ascending=False)
plt.figure(figsize=(16,14), dpi=500)
chart4 = sns.heatmap(train_num.corr()[['SalePrice']].sort_values(by='SalePrice', ascending=False),
                     vmin=-1, vmax=1, cmap='BrBG',
                     )
chart4.set_title('Sale Price Correlation Heatmap', fontdict={'fontsize': 12},
                 pad=12)
plt.show()

The result is as shown below:

Correlation Heatmap of SalePrice

As side from SalePrice itself, this visual shows that the top 5 numerical features which correlates strongly with SalePrice are: OverallQual, GrLivArea, GarageCars, GarageArea, TotalBsmtSF. Now, we managed to narrow down the list of potential features from 79 to just 5 from just a single visual, hence very powerful.

As a working hypothesis, one could say that a larger living space means higher house prices. We can explore this relationship further by plotting a scatter graph between SalePrice and GrLivArea with the following code:

plt.scatter(train['GrLivArea'], train['SalePrice'], color = 'red')
plt.title('Sale Price vs GrLivArea - Training Set')
plt.xlabel('GrLivArea')
plt.ylabel('Sale Price')
plt.show()

The result is as shown below:

As shown in the figure above, aside from a few outliers, there is a linear relation between GrLivArea and the SalePrice. This supports our working hypothesis.

Now it's time for modelling. For simplicity, I will just use the 5 features we found so far to train a model.

features = ['OverallQual', 'GrLivArea', 
            'GarageCars', 'GarageArea', 'TotalBsmtSF']

X_train = pd.DataFrame(train[features])
y_train = train.iloc[:, -1].values
X_test = pd.DataFrame(test[features])

Treat for missing values in the features by using the method of imputation by he median.

print(X_test.isnull().sum())
X_test['GarageCars'] = X_test['GarageCars'].replace(np.NaN, X_test['GarageCars'].median())
X_test['GarageArea'] = X_test['GarageArea'].replace(np.NaN, X_test['GarageArea'].median())
X_test['TotalBsmtSF'] = X_test['TotalBsmtSF'].replace(np.NaN, X_test['TotalBsmtSF'].median())

For my model, I used Random Forest Regressor.

# Training Random Forest Regression
from sklearn.ensemble import RandomForestRegressor
rfg = RandomForestRegressor()
rfg.fit(X_train, y_train)

After training the model, I applied it on the X_test to predict the values for SalePrice and submitted my results to Kaggle.

# Predicting test set results
y_pred = rfg.predict(X_test)
print(y_pred)

My score on Kaggle is shown below:

Kaggle Leader Board Score

Now whilst the score isn't the best, there are a couple takeaways that could help improve the score:

  1. Remove outliers from the data as they influence the accuracy of the model.
  2. Use a different model like XGBoost.
  3. Explore and incorporate categorical variables into the model.

Summary

When trying to model on data with 70+ variables, it is crucial to conduct EDA in order to reduce the number of variables from the many to just a few. This helps with building an intuitive model that can be easily translated to business users.