Introduction
In this blog post, we will explore how to create a linear regression model using Python/R to predict home prices using the Boston Housing Dataset. This dataset contains information about various houses in Boston, including 14 feature variables and their corresponding prices. Our objective is to develop a predictive model that can accurately estimate the prices of houses based on these features
Code
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml
boston = fetch_openml(name='boston')
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['PRICE'] = boston.target
X = df.drop('PRICE', axis=1)
y = df['PRICE']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
# Convert X_test to a numpy array
X_test_array = np.array(X_test)
y_pred = lr.predict(X_test_array)
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print('MSE:', mse)
print('R2:', r2)
Understanding the Boston Housing Dataset
The Boston Housing Dataset is a popular dataset used for regression tasks. It consists of 506 samples, each representing a house, and contains 14 feature variables such as crime rate, average number of rooms per dwelling, and accessibility to highways. The target variable, which we aim to predict, is the price of the house.
Loading and Preparing the Dataset
To begin, we import the necessary libraries and load the Boston Housing Dataset. We use the `fetch_openml` function from `sklearn.datasets` to retrieve the dataset. The dataset is then stored in a pandas DataFrame for further analysis. Additionally, we split the dataset into input features (X) and the target variable (y) using the `drop` method.
Splitting the Dataset
To evaluate the performance of our model, we split the dataset into training and testing sets. The `train_test_split` function from `sklearn.model_selection` is used for this purpose. We allocate 80% of the data for training and 20% for testing, ensuring randomness and consistency by setting a random seed value.
Building the Linear Regression Model
We employ linear regression, a widely used regression algorithm, to create our predictive model. We import the `LinearRegression` class from `sklearn.linear_model` and instantiate an object of the class. We then train the model using the training data by calling the `fit` method.
Making Predictions
Once the model is trained, we make predictions on the test set. We convert the features of the test set to a numpy array and pass it to the `predict` method of the linear regression model. The predicted values are stored in the `y_pred` variable.
Evaluating the Model
To assess the performance of our model, we use two commonly used metrics: Mean Squared Error (MSE) and R-squared (R2). The `mean_squared_error` and `r2_score` functions from `sklearn.metrics` are used to calculate these metrics. MSE measures the average squared difference between the predicted and actual values, while R2 represents the proportion of the variance in the target variable that is predictable from the input features.
Results and Interpretation
After evaluating our model, we obtain the following results:
- Mean Squared Error (MSE): 24.29
- R-squared (R2): 0.67
The MSE value of 24.29 indicates the average squared difference between the predicted and actual prices of houses. The lower the MSE, the better the model performance. The R2 value of 0.67 suggests that approximately 67% of the variance in the house prices can be explained by the features in our model. A higher R2 value indicates a better fit of the model to the data.
Conclusion
In this blog post, we explored how to create a linear regression model to predict home prices using the Boston Housing Dataset. By utilizing the 14 feature variables provided in the dataset, we built a model that can estimate house prices with a moderate level of accuracy. The MSE and R2 metrics helped us evaluate the model's performance and understand its predictive power. With further exploration and enhancements, this model can be refined to provide more accurate predictions for housing prices, aiding in real estate decision-making and analysis.