2. Supervised learning: Univariate Linear Regression (Linear Regression with One Variable)

  • 4.6/5
  • 204
  • Mar 23, 2025
Supervised learning is a type of machine learning where the model learns from labeled training data. It maps input variables (X) to an output variable (Y) using a function.

Supervised learning is called 'supervised' because we first provide a dataset with the correct answers. For example, predicting the resale price (output) of a car based on its age (input).

The types of Supervised Learning are Classification (predicting categories) and Regression (predicting continuous values).

A supervised learning model is called a regression model when it predicts numerical outputs, such as resale price in dollars, from a large range of possible values.

It predicts a continuous output (Y) based on one or more input variables (X) by fitting a mathematical function, such as a straight line in Linear Regression or a polynomial curve in Polynomial Regression.

Linear Regression is a supervised learning algorithm used for predicting a continuous output by establishing a relationship between input (x) and output (y).

Read More: 1. Introduction to Machine Learning: Theoretical Foundations

Linear Regression with One Variable (Univariate Linear Regression)

Linear Regression with one variable (also called Univariate Linear Regression) is a machine learning algorithm used to predict a continuous output using a single input feature.

Here is an example of how Linear Regression can be used to predict the resale price of a car based on its age.


The blue dots represent actual data points. The red line is the regression line. The green point marks the predicted resale price for a 50-month-old car, which is $30.51K.

The data table for this graph is given below. The table contains input values, i.e., the age of the car in months, and output values, i.e., the resale price in dollars. This means we have the same number of crosses on the graph as the number of rows in the data table.



The dataset and the data shown above, used to train the model, are called the training set. The notation to denote input in Machine Learning is small x, also called the feature variable, and the notation to denote output is small y, also called the target variable.

Each row in the dataset is a training example, containing one x and one y. The total number of training examples is represented by small m.

For one training example, we represent it using (x, y). So, for the very first training example, we represent it as:

(x, y) = (6, 48.5)

To represent the ith training example in this training set, we use the notation (xi, yi), meaning:

(x1, y1) = (6, 48.5)

In a supervised learning algorithm, we feed the training set (input features and output targets) to the learning algorithm, which then produces a function (f) or hypothesis.

The process can be represented as:

x → f → ŷ (y hat)

Here, ŷ (y hat) is the estimate or prediction of Y, meaning:

input (x) → f (Model) → ŷ (prediction or estimate)

Since ŷ is just an estimate, it may not always be equal to the actual target y. If f is a straight line, the function can be written as:

fw,b(x) = wx + b

Or in a simpler notation:

f(x) = wx + b

So, this model or function represents the linear line drawn in the graph above. More specifically, this model is called Linear Regression.

When there is only one input variable (x), it is called One-Variable Linear Regression or Univariate Linear Regression.

Cost Function

The Cost Function in Linear Regression measures how well the model's predictions match the actual values. It helps us determine how far the predicted values (ŷ) are from the actual values (y) and guides the model in learning the best parameters (w and b).

The equation (model) f(x)=wx+b represents a straight line in a Linear Regression model. It is the function that the model learns to make predictions based on input data.

Here, the model parameters are w and b. Different values of w and b result in different straight lines. Below are three graphs illustrating how changes in w and b affect the line:



- Positive Slope (w > 0, b = 0) → The line moves upward (e.g., f(x) = 2x).
- Negative Slope (w < 0, b = 5) → The line moves downward (e.g., f(x) = -1.5x + 5).
- Zero Slope (w = 0, b = 3) → A horizontal line (e.g., f(x) = 3).

In the equation f(x) = wx + b, both w (slope) and b (intercept) play a crucial role in defining the line that best fits the data. The "w" controls the angle of the line, while "b" shifts the line up or down to better fit the data.

The learning algorithm adjusts w and b to minimize errors and make the best predictions.

How to Choose Values for w and b to Make ŷ (Prediction) Close to y (Target)?

The goal in linear regression is to find the best values of w (slope) and b (intercept) so that the predicted value ŷ (y hat) is as close as possible to the actual target value y.

In other words, the goal of training a machine learning model is to determine the optimal values for w (weight) and b (bias) so that the predicted values are as close as possible to the actual values.

We need to measure how well our model's predictions (y hat) match the actual target values (y). The most commonly used cost function for this is the Squared Error Cost Function, also known as the Mean Squared Error (MSE).

For a single training example (x, y), the cost function is:

J(w, b) = (y - ŷ)² / 2

For m training examples (entire dataset), the cost function is the average of all squared errors:

J(w, b) = (1/2m) * ∑i=1m (yi - ŷi

where:
J(w, b) = cost function (measures the error)
y^i = actual value of the target variable for the i-th example
ŷ^i = predicted value from the model
m = total number of training examples

The best values of w and b are those that minimize J(w, b).

Training Linear Regression (Gradient Descent)

Training a Linear Regression model involves finding the best values for the parameters w (weight) and b (bias) so that the predicted output ŷ is as close as possible to the actual output y. This is done by minimizing a cost function, typically the Mean Squared Error (MSE).

Gradient Descent Algorithm

Gradient Descent is an optimization algorithm used to minimize the cost function by adjusting the parameters w and b. The goal is to find the values of w and b that result in the lowest possible cost.

Gradient Descent Update for 𝑤

The Gradient Descent algorithm follows this iterative update rule:

𝑤 = 𝑤 − α · 𝑑/𝑑𝑤 · 𝐽(𝑤,𝑏)

where:
𝑤 w is the parameter (or weight).
𝛼 is the learning rate, which controls the step size.
𝐽(𝑤,𝑏) is the cost function, measuring the error.
𝑑/𝑑𝑤 · 𝐽(𝑤,𝑏) is the derivative (gradient) of the cost function with respect to 𝑤.

A derivative in calculus represents the rate of change of a function with respect to its variable. It tells us how a function 𝑓(𝑥) changes as 𝑥 changes.

Role of Learning Rate (α)
In Gradient Descent, α (alpha) is called the learning rate, and it controls the step size in each iteration when updating the model parameters.

Too High (Large α)
The algorithm may overshoot the minimum. It might never converge, leading to oscillations or divergence.

In Gradient Descent, we update the parameter 𝑤 by moving in the direction of the negative gradient to minimize the cost function 𝐽(𝑤).

However, if the learning rate (𝛼) is too large, the steps taken in each iteration are too big, and instead of moving smoothly toward the minimum, the algorithm jumps past it.

When the learning rate is too high, instead of smoothly decreasing, the algorithm keeps bouncing back and forth across the minimum. This is called oscillation.

Example: Imagine rolling a ball down a valley, but instead of smoothly reaching the bottom, it jumps from one side to the other, never settling.

Too Low (Small α)
The algorithm converges very slowly. It might get stuck in a local minimum. If the learning rate (𝛼) is too small, the steps taken are tiny, meaning the algorithm will take a long time to reach the minimum.

In some cases, the cost function is not convex (it has multiple valleys or "local minima"). If the learning rate is too small, the algorithm might get stuck in a local minimum instead of finding the global minimum.

Example: Imagine hiking in a mountainous area. If you take small steps, you might get stuck in a small valley, thinking it's the lowest point, when in reality, there is a deeper valley nearby.

Gradient Descent Update for 𝑏

Just like we compute the gradient with respect to 𝑤, we also compute the gradient with respect to 𝑏:

𝑏 = 𝑏 − α · 𝑑/𝑑𝑏 · 𝐽(𝑤,𝑏)

We keep updating these two parameters:

𝑤 = 𝑤 − α · 𝑑/𝑑𝑤 · 𝐽(𝑤,𝑏)
𝑏 = 𝑏 − α · 𝑑/𝑑𝑏 · 𝐽(𝑤,𝑏)

to converge to the optimal values of 𝑤 and 𝑏 that minimize the cost function 𝐽(𝑤,𝑏).

Convergence means that after several updates, the changes in 𝑤 and 𝑏 become very small, meaning we've found the best possible values. At convergence, the gradient (slope) of the cost function approaches zero, meaning further updates don't significantly reduce the error.

We must update 𝑤 and 𝑏 simultaneously in each iteration of Gradient Descent to ensure proper convergence.

Correct simultaneous update:

temp_𝑤 = 𝑤 − α · 𝑑/𝑑𝑤 · 𝐽(𝑤,𝑏)
temp_𝑏 = 𝑏 − α · 𝑑/𝑑𝑏 · 𝐽(𝑤,𝑏)

𝑤 = temp_𝑤
𝑏 = temp_𝑏

Incorrect update:

temp_𝑤 = 𝑤 − α · 𝑑/𝑑𝑤 · 𝐽(𝑤,𝑏)
𝑤 = temp_𝑤

temp_𝑏 = 𝑏 − α · 𝑑/𝑑𝑏 · 𝐽(𝑤,𝑏)
𝑏 = temp_𝑏

Understanding Gradient Descent Algorithm

Gradient Descent is an optimization algorithm used to minimize a function by iteratively updating its parameters. Let's break down the understanding behind its working, focusing on how the derivative term 𝑑/𝑑𝑤 · 𝐽(𝑤,𝑏) and the learning rate 𝛼 interact to adjust 𝑤.

𝑤 = 𝑤 − α · 𝑑/𝑑𝑤 · 𝐽(𝑤,𝑏)

Role of derivative i.e. 𝑑/𝑑𝑤 · 𝐽(𝑤,𝑏)

In many machine learning problems, the cost function is written as 𝐽(𝑤,𝑏), where: 𝑤 represents the model's weight(s) and 𝑏 represents the bias (intercept). However, to build an intuitive understanding of gradient descent, we temporarily ignored 𝑏. This allows us to:

- Work with a single-variable function 𝐽(𝑤) instead of a multi-variable function 𝐽(𝑤,𝑏).
- Easily visualize the cost function as a 2D curve (since graphing 𝐽(𝑤,𝑏) requires a 3D plot).

For a function 𝐽(𝑤), the update rule for gradient descent is:

𝑤 = 𝑤 − α · 𝑑/𝑑𝑤 · 𝐽(𝑤)

Here is the 2D graph of the cost function 𝐽(𝑤), where we have temporarily set 𝑏=0.



In gradient descent, we iteratively adjust 𝑤 to minimize 𝐽(𝑤), moving step by step toward the lowest point on the curve.

The horizontal axis (𝑤) represents the parameter we are adjusting in gradient descent. The vertical axis 𝐽(𝑤) represents the cost (or error) associated with each value of 𝑤.

The minimum point (𝑤=2, 𝐽(2)=1) is the optimal value where the cost is lowest. The slope (derivative 𝑑/𝑑𝑤 · 𝐽(𝑤)) tells us in which direction 𝑤 should move to minimize the cost.

Here's a graph illustrating the meaning of 𝑑/𝑑𝑤 · 𝐽(𝑤). The green arrow represents the derivative (gradient) at a specific point, showing how the function's slope determines the direction and magnitude of updates in gradient descent.



If we pick a small segment of the tangent line and form a right triangle, the rise (height change) divided by the run (horizontal change) gives us the slope.

Example: If moving 1 unit to the right causes a 2-unit increase in 𝐽(𝑤), the slope is 2/1 = 2.

If the slope is positive (tangent line points up and to the right), the derivative is greater than zero. If the slope (derivative) is positive, 𝑤 decreases. Using the gradient descent update rule:

𝑤 = 𝑤 − α · 𝑑/𝑑𝑤 · 𝐽(𝑤)

Since 𝑑/𝑑𝑤 · 𝐽(𝑤) is positive, multiplying it by the learning rate (𝛼) still results in a positive number. Subtracting this positive value from 𝑤 causes 𝑤 to decrease. This means the parameter 𝑤 moves left, getting closer to the minimum of 𝐽 (𝑤).

Here's a graph with 𝑤 positioned to the left of the minimum. The green arrow represents the negative gradient (derivative), indicating that the slope is negative at this point.



If the slope is negative (tangent line points down and to the right), the derivative is less than zero. If the slope (derivative) is negative, 𝑤 increases. Using the gradient descent update rule:

𝑤 = 𝑤 − α · 𝑑/𝑑𝑤 · 𝐽(𝑤)

Since 𝑑/𝑑𝑤 · 𝐽(𝑤) is negative, multiplying it by the learning rate (𝛼) still gives a negative number. Subtracting a negative number is the same as adding a positive number. This causes 𝑤 to increase, moving right toward the minimum of 𝐽 (𝑤).

This behavior ensures that gradient descent always moves in the direction that reduces the cost function 𝐽(𝑤)—decreasing 𝑤 when the slope is positive and increasing 𝑤 when the slope is negative.

In conclusion, the derivative 𝑑/𝑑𝑤 · 𝐽(𝑤) represents the slope of the cost function 𝐽 (𝑤) at a given point 𝑤. It tells us whether the function is increasing or decreasing at that point: If the derivative is positive, 𝐽 (𝑤) is increasing, so we need to decrease 𝑤 to move toward the minimum.

If the derivative is negative, 𝐽 (𝑤) is decreasing, so we need to increase 𝑤 to move toward the minimum. This ensures that gradient descent always moves toward the minimum of 𝐽 (𝑤).

Understanding the Impact of Learning Rate (α) in Gradient Descent

Gradient descent is a crucial optimization algorithm used in machine learning to minimize cost functions. The choice of the learning rate, denoted as (α), significantly impacts the efficiency and effectiveness of gradient descent.

Choosing an appropriate (α) ensures that the algorithm converges to a minimum efficiently, whereas a poorly chosen (α) can lead to slow convergence or divergence.

The gradient descent update rule for a single parameter 𝑤 is given by:

𝑤 = 𝑤 − α · 𝑑/𝑑𝑤 · 𝐽(𝑤)

𝛼 is the learning rate. 𝑑/𝑑𝑤 · 𝐽(𝑤) represents the gradient (slope) of the cost function 𝐽 (𝑤).

Case 1: Learning Rate Too Small
If the learning rate is too small (e.g., 𝛼 = 0.000001), the updates to 𝑤 are minimal, leading to slow convergence. The algorithm will eventually reach the minimum, but it requires many iterations.



Case 2: Learning Rate Too Large
If the learning rate is too large (e.g., 𝛼=1), the update steps may overshoot the minimum, causing the algorithm to oscillate or even diverge.



Case 3: Optimal Learning Rate
An appropriately chosen learning rate allows for efficient and smooth convergence to the minimum in a reasonable number of steps.



Effect of Gradient Magnitude on Step Size
As the algorithm nears the minimum, the gradient 𝑑/𝑑𝑤 · 𝐽(𝑤) becomes smaller. Consequently, the step sizes reduce naturally, preventing overshooting even with a fixed 𝛼.

As we get nearer to a local minimum, gradient descent automatically takes smaller steps because the derivative decreases. This ensures that, even with a fixed learning rate 𝛼, the algorithm converges smoothly without overshooting.



What Happens at the Minimum?
If 𝑤 reaches a minimum, the derivative 𝑑/𝑑𝑤 · 𝐽(𝑤) becomes zero. The update rule simplifies to: 𝑤 = 𝑤−𝛼⋅0 = 𝑤 This means that 𝑤 remains unchanged, effectively stopping further updates, which is the desired behavior.

Gradient Descent for Linear Regression

Previously, we explored the linear regression model, the cost function, and the gradient descent algorithm. In this section, we will bring these concepts together and apply the squared error cost function to the linear regression model with gradient descent.

This will allow us to train the model to fit a straight line to the training data. The linear regression model is represented as follows:

fwb(x)= wx + b

To optimize this model, we use the squared error cost function and apply gradient descent, shown below. When we compute the derivatives of the cost function with respect to 𝑤 and 𝑏, we get:

J(w, b) = (1/2m) * ∑i=1m (yi - ŷi

The below gradient descent algorithm updates parameters 𝑤 and 𝑏 based on their derivatives. We repeat these updates until convergence.

Repeat these updates until convergence {

𝑤 = 𝑤 − α · 𝑑/𝑑𝑤 · 𝐽(𝑤,𝑏)
𝑏 = 𝑏 − α · 𝑑/𝑑𝑏 · 𝐽(𝑤,𝑏)

}

Global vs. Local Minima

One concern with gradient descent is that it can sometimes converge to a local minimum instead of the global minimum. However, when using the squared error cost function for linear regression, the function is convex, meaning it has a single global minimum.

Here is a 3D graph representing a cost function 𝐽(𝑤,𝑏) with multiple local minima. The surface shows how gradient descent might get trapped in a local minimum instead of reaching the global minimum.



Here is the 3D graph of the squared error cost function, which is convex and has a single global minimum.



This bowl-shaped cost function ensures that gradient descent will always reach the global minimum, provided the learning rate is chosen appropriately.

Running Gradient Descent in Linear Regression

Gradient descent is an iterative optimization algorithm used in linear regression to minimize the cost function and find the best-fitting line. Below is a step-by-step explanation:

1) Start with an initial guess for the parameters 𝑤 (weight/slope) and 𝑏 (bias/intercept). A common initialization is 𝑤 = 0 and 𝑏 = 0 (or small random values). The initial model equation is: 𝑓 (𝑥) = 𝑤𝑥 + 𝑏

2) The cost function 𝐽 (𝑤,𝑏) measures how well the model fits the training data. For linear regression, we use the Mean Squared Error (MSE):

J(w, b) = (1/2m) * ∑i=1m (yi - ŷi

A contour plot of the cost function shows a bowl shape. The goal is to reach the global minimum, where the cost is lowest.

3) To find the direction of steepest descent, compute the gradients (partial derivatives of 𝐽 (𝑤,𝑏): 𝑑/𝑑𝑤 and 𝑑/𝑑𝑏. These gradients indicate how much to adjust 𝑤 and 𝑏 to decrease the cost function.

4) Update 𝑤 and 𝑏 using the gradient descent formula:

𝑤 = 𝑤 − α · 𝑑/𝑑𝑤 · 𝐽(𝑤,𝑏)
𝑏 = 𝑏 − α · 𝑑/𝑑𝑏 · 𝐽(𝑤,𝑏)

5) Repeat steps 2-4 for multiple iterations. The cost function keeps decreasing. The values of 𝑤 and 𝑏 get closer to optimal values. Stopping criteria: When the cost function stops decreasing significantly. When changes in 𝑤 and 𝑏 become negligible.

6) The best-fit line is achieved once the process converges.

The graphs illustrate how gradient descent optimizes linear regression:



Left Graph (Linear Regression Progression): The blue dots represent actual data points. The colored lines show the fitted regression line at different iterations. As iterations progress, the line adjusts to minimize the error.

Right Graph (Cost Function Reduction): The red curve represents the cost function decreasing over iterations. As gradient descent progresses, the cost reduces, indicating that the model is improving and approaching the optimal parameters.

7) Make Predictions: Once optimal 𝑤 and 𝑏 are found, use: 𝑓 (𝑥) = 𝑤𝑥 + 𝑏. Example: Predict house price for size 1250 sq. ft: 𝑓 (1250) = (𝑤×1250) + 𝑏. The model can now generalize and predict for new inputs.

The above process is called Batch Gradient Descent, where the gradient is computed using the entire dataset before updating the parameters. It is called "batch" because it processes all training examples at once in each iteration.

There are alternative gradient descent methods:

- Stochastic Gradient Descent (SGD): Updates the parameters using one example per iteration, making it faster but more noisy.
- Mini-Batch Gradient Descent: Uses a small batch of data for each update, balancing speed and accuracy.
Index
1. Introduction to Machine Learning: Theoretical Foundations

18 min

2. Supervised learning: Univariate Linear Regression (Linear Regression with One Variable)

17 min

3. Supervised learning: Multiple features (Linear Regression with Multiple Variable)

13 min