Linear Regression Python Implementation

In my last article, I focused on how the algorithm works and the theory behind linear regression but now in this article, I will focus on implementing it in Python, which is an excellent programming language for Data Science and Machine Learning. If you have not read my last article, I highly recommend to read it here: https://ahaanpandya.medium.com/linear-regression-explained-868914443188

Note- I will be using Pandas and Numpy which are python data science libraries and if you are not comfortable with Numpy and Pandas then you might want to look up a tutorial on those libraries. Some knowledge of Matplotlib and Sklearn will also be useful for visualizing how our model performs and processing our data.

Step 1: Reading the Data

In this example, I will be reading the data from a csv file which is usually the preferred file format for storing and reading data.

In this piece of code, we read the data from a csv file about housing prices into a Pandas DataFrame and then we separate the x and y values of the dataset with the x values having two features, number of sqft and number of bedrooms and the y values having the price. We then convert them into Numpy arrays because they are much easier to work with.

Step 2: Scaling the Data

Before performing Linear Regression, you should always scale all your x values between 0 and 1 because it eases the calculations and it produces more accurate results.

Sklearn’s MinMaxScaler function is used to scale our data between 0 and 1 so now all the data including the sqft is between 0 and 1. We didn’t scale the y values because it doesn’t ease the calculations and isn’t as beneficial.

Step 3: Declaring Constants

In the last article, I discussed how the step size for Gradient Descent is dependent on the learning rate We will declare the learning rate variable here and also the number of training examples.

I chose the learning rate to be 1 because I found it to be a pretty good choice for this dataset.

Step 4: Making the Cost Function

In my previous article, I also talked about the cost or the loss function which is used to determine the error in the model’s predictions. The preferred cost function for Linear Regression is Mean Squared Error.

In this code segment, we make the cost function, loop over all the training examples and sum their squared errors and calculate their mean. The cost function takes in 3 parameters, the x values, the actual y values and the vector theta which is a vector of parameters for our hypothesis function. If any of this sounds new to you, you might want to read my previous article again.

Step 5: Making the Derivative Functions

To optimize our cost function, we can use Gradient Descent which is a good algorithm for optimization.

This the Gradient Descent Algorithm, Image by : https://ahaanpandya.medium.com/linear-regression-explained-868914443188

We can make this algorithm simpler if we solve for the partial derivative terms beforehand.

Gradient Descent Algorithm after evaluating the derivative terms, Image from : https://stackoverflow.com/questions/29583026/implementing-gradient-descent-algorithm-in-matlab

This new gradient descent algorithm is much easier to implement in code. We can make a function for all three of our parameters theta 0, theta 1, and theta 2 because we have two features.

In the above 3 functions, we converted the partial derivative terms into code which can now be used in gradient descent.

Step 6: Training the Model

Now its finally time to train our model and perform the linear regression. We will first initialize our parameters vector for our hypothesis function randomly and then we will improve the parameters using gradient descent. We can run as many gradient descent iterations as we want but after a while, the decrease in loss will be really less and we might overfit to the data. The number of iterations is called epochs in machine learning.

This will train the model for 150 epochs and it will improve our parameters. We can plot the loss to make sure the loss is going down.

I have already run this code before so you should roughly see a graph like this-

My graph of the loss history over 150 epochs

From this graph, we can see that the loss is rapidly going down over 150 epochs. Our model can now predict housing prices with an error of about $100k which is reasonable because we are not using a very big dataset and also we are only using 2 features from the dataset and not the full dataset. The houses also cost about a million dollars so there is more error. If we use the full dataset and all the features in the dataset we can surely reduce the loss even more. We can now save the parameters to a file so that we don’t have to run and train the model again and again because it takes some time to train the model.

Now our parameters are stored in a file so we just have to read the file whenever we need to use the parameters instead of retraining the model.

Step 7: Predicting using the model

We can now finally predict prices using the model using our parameters!

We can now use this function to predict the price of any new house based on the number of bedrooms and the sqft.

Conclusion

Linear Regression is a very useful algorithm in predicting continuous values. This python implementation shows you exactly how it works. I would recommend running this code on a machine with a GPU or Google Colab because Google Colab provides a GPU. Machine Learning is very computationally expensive and GPUs are perfect for the number of operations that need to be performed in machine learning. Try this out and try using another dataset to practice linear regression.

Thanks for Reading!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store