3. Supervised learning: Multiple features (Linear Regression with Multiple Variable)

  • 4.1/5
  • 137
  • Mar 29, 2025
Multiple Linear Regression is an extension of simple linear regression that allows us to model the relationship between one dependent variable and multiple independent variables (features).

Single-feature regression uses only one factor to predict a value. For example, predicting a car's price based solely on its age:

Car Age (Years) Predicted Price (in $K)
148.0
540.0
1030.0
1520.0
2010.0
1226.81


This table presents the car age and its predicted price. The predicted price for a 12-year-old car based on the simple linear regression model is approximately $26.81K.


As we have seen before in detail, the formula for simple linear regression is: fw,b(x) = wx + b

Multi-feature regression

Multi-feature regression considers multiple factors for better accuracy. For instance, predicting a car's price based on age, mileage, horsepower, and brand score:

Car Age (Years) Mileage (Miles) Horsepower (HP) Brand Score Predicted Price ($)
110,000200960,000
220,0001908.555,000
330,0001808.252,000
440,000170850,000
550,0001607.545,000
We're going to use the variables 𝑋1 , 𝑋2 , 𝑋3 and 𝑋4 to denote the four features. For simplicity, let's introduce a bit more notation. We'll write 𝑋𝑗 (or sometimes say "X sub j" for short) to represent the list of features.

Lowercase n represents the total number of features, which is 4 in this case. Xi denotes the i-th training example, which consists of a set of four values, also known as a vector, containing all the features of that example.

For instance, XΒ² (the second training example) is a vector with four features:

XΒ² = [Car Age, Mileage, Horsepower, Brand Score].

or XΒ² = [2, 20,000, 190, 8.5]

To refer to a specific feature in the 𝑖th training example, the notation 𝑋𝑗𝑖 is used. For example, 𝑋32 represents the value of the third feature (e.g., Horsepower (HP)) in the second training example.

or 𝑋32 = 190

Now that multiple features are included, the model's definition changes. Previously, when using a single feature, 𝑋 was just one number. However, with multiple features, the model is now represented as:

fw,b(x) = w1x1 + w2x2 + w3x3 + w4x4 + b

Here, each 𝑋𝑗 represents a different feature, and the corresponding 𝑀𝑗 is its weight, contributing to the final prediction.

Let's assume the weights and bias are:

𝑀1 = βˆ’2000 (each extra year reduces the price by $2000)
𝑀2 = βˆ’0.05 (each extra mile reduces the price by $0.05)
𝑀3 = 150 (each extra HP increases the price by $150)
𝑀4 = 5000 (each brand score point increases the price by $5000)

𝑏 = 30,000 (base price)

Predicting the price of a car with:

Age = 4 years
Mileage = 40,000 miles
Horsepower = 170 HP
Brand Score = 8

fw,b(x) = (βˆ’ 2000 Γ— 4) + (βˆ’ 0.05 Γ— 40 , 000) + (150 Γ— 170) + (5000 Γ— 8) + 30,000 = 85,500

So, the predicted price of the car is $85,500.

If there are n features, the multiple linear regression formula is:

fw,b(x) = w1x1 + w2x2 + w3x3 + . . .+ wnxn + b

To simplify this, let's introduce vector notation:

W (weight vector) = [w1, w2, w3, ..., wn]

X (feature vector) = [X1, X2, X3, ..., Xn]

b (bias) = a single number

To represent vectors, an arrow can be drawn on top.

Using the dot product from linear algebra, the model is rewritten as:

fw,b(x) = W . X + b

The dot product is calculated as:

W . X = w1X1 + w2X2 + w3X3 + ... + wnXn + b

Vectorization

Vectorization is a crucial concept in machine learning that optimizes computations by leveraging modern numerical linear algebra libraries and parallel processing capabilities. It offers two key benefits:

Concise Code – Reduces complex operations into a single line, improving readability. Performance Boost – Uses optimized hardware (CPU/GPU) to accelerate computations.

Given parameters w (weights) and b (bias), and input features x, the prediction formula is:

W . X = w1X1 + w2X2 + w3X3 + ... + wnXn + b

1) Non-Vectorized Implementation (Manual Computation)

For n = 3, a naive approach in Python would be:

w = np.array([w1, w2, w3]) 
x = np.array([x1, x2, x3]) 
b = b_value  

f = w[0] * x[0] + w[1] * x[1] + w[2] * x[2] + b
This is Inefficient for large n (e.g., n = 100,000).

2) Non-Vectorized Using a For Loop

A slightly better approach uses a loop:

import numpy as np

w = np.array([w1, w2, w3])  
x = np.array([x1, x2, x3]) 
b = b_value  

f = 0
for j in range(len(w)): 
    f += w[j] * x[j] 
f += b
More scalable but still inefficient because of explicit looping. If 𝑗 ranges from 0 to 15, this loop performs operations sequentially. At time 𝑑0, it processes the values at index 0. At the next time step, it calculates values corresponding to index 1, and so on until the 15th step, computing each operation one after another.

3) Vectorized Implementation (Optimized Using NumPy)

Using NumPy’s dot product, we eliminate the loop:

f = np.dot(w, x) + b

Highly efficient – Uses optimized parallel processing behind the scenes. Instead of processing elements sequentially, the computer loads all values of vectors 𝑀 and π‘₯ and performs element-wise multiplication in a single step using parallel processing. Then, it efficiently sums these 16 values using specialized hardware, avoiding the need for sequential addition.

Vectorization is faster because it leverages optimized numerical libraries like NumPy, which is specifically designed for efficient mathematical computations.

Unlike traditional loops, vectorized operations utilize parallelism on modern CPUs and GPUs, allowing multiple computations to be executed simultaneously instead of sequentially.

Additionally, vectorization minimizes function call overhead by replacing explicit loops with optimized low-level implementations, further enhancing performance and efficiency in machine learning and numerical computing.

Implementing Gradient Descent for Multiple Linear Regression

Let's put it all together to implement gradient descent for multiple linear regression using vectorization. Before we dive into the implementation, let’s quickly review the fundamentals of multiple linear regression.

In multiple linear regression, we model the relationship between a dependent variable y and multiple independent variables (features) x1, x2,..., xn. Instead of treating each weight w1, w2, ..., wn as separate parameters, we collect them into a single vector W, making it a vector of length n.

Additionally, we retain the bias term b as a separate scalar.

Using vector notation, we can express the prediction function as:

fw,b(x) = W . X + b

Here, the dot product operation represents the summation of the element-wise product of the weight vector and the feature vector.

Cost Function

The cost function measures the difference between predicted and actual values. For multiple linear regression, the cost function is:

J(w1,..., wn, b)

or

J(W,b), where 'W' is the weight vector (size n).

Gradient Descent for Multiple Features

Gradient descent minimizes the cost function by iteratively updating the parameters. For each weight wj, the update rule is:

repeat{
𝑀j = 𝑀j βˆ’ Ξ± Β· 𝑑/𝑑𝑀j Β· 𝐽(w1,..., wn, b)
𝑏 = 𝑏 βˆ’ Ξ± Β· 𝑑/𝑑𝑏 Β· 𝐽(w1,..., wn, b)
}

or

repeat{
𝑀j = 𝑀j βˆ’ Ξ± Β· 𝑑/𝑑𝑀j Β· 𝐽(W, b)
𝑏 = 𝑏 βˆ’ Ξ± Β· 𝑑/𝑑𝑏 Β· 𝐽(W, b)
}

or

repeat{
𝑀1 = 𝑀1 βˆ’ Ξ± Β· 𝑑/𝑑𝑀1 Β· 𝐽(W, b)
𝑀2 = 𝑀2 βˆ’ Ξ± Β· 𝑑/𝑑𝑀2 Β· 𝐽(W, b)
.
.
.
𝑀n = 𝑀n βˆ’ Ξ± Β· 𝑑/𝑑𝑀n Β· 𝐽(W, b)

𝑏 = 𝑏 βˆ’ Ξ± Β· 𝑑/𝑑𝑏 Β· 𝐽(W, b)
}

We must update 𝑀j (for j=1,..,n) and 𝑏 simultaneously in each iteration of Gradient Descent to ensure proper convergence.

Feature Scaling

When training a machine learning model using gradient descent, one crucial preprocessing step is feature scaling. Without proper scaling, gradient descent can take longer to converge or even fail to reach the optimal solution. This section explains why feature scaling is necessary using a car price prediction example.

Imagine we are building a model to predict the price of a car based on two features:

- Engine size (in cubic centimeters, cc): Typically ranges from 800 cc to 5000 cc.
- Number of doors: Usually ranges from 2 to 5.

The target variable is the car price in thousands of dollars.

Consider a training example of a car with an engine size of 4000 cc, four doors, and a price of $30,000. Suppose our model follows the equation:

Ε· = 𝑀1x1 + 𝑀2x2 + b

where:
x1 is the engine size (cc),
x2 is the number of doors,
𝑀1 and 𝑀2 are the parameters,
b is the bias term.

Let's examine two different sets of weight values:

a) Large 𝑀1, small 𝑀2:

𝑀1 = 500, small 𝑀2 = 10, b=50

Ε· = (500Γ—200)+(10Γ—4)+50
Ε· = 100000+40+50=100090

So, the predicted price is $100,090$.

b) Small 𝑀1, Large 𝑀2:

𝑀1 = 10, small 𝑀2 = 500, b=50

Ε· = (10Γ—200)+(500Γ—4)+50
Ε· = 2000+2000+50=4050

So, the predicted price is $4,050$.

This example illustrates that when a feature has a large range (e.g., engine size), the corresponding weight needs to be smaller. Conversely, when a feature has a small range (e.g., number of doors), the weight needs to be larger.

When features have vastly different ranges, the cost function contour plot becomes elongated, making gradient descent inefficient. The updates to 𝑀12
By scaling the features so that both take on similar ranges (e.g., 0 to 1), the cost function contours become more circular, allowing gradient descent to converge much faster.



The left graph shows the cost function before feature scaling, where the contours are elongated due to the large difference in feature ranges (engine size vs. number of doors). This causes inefficient gradient descent.

The right graph shows the cost function after feature scaling, where the contours become more circular, allowing gradient descent to converge efficiently.

Types of Feature Scaling

To address this issue, we can apply different scaling methods:

1) Min-Max Scaling (Normalization)

Formula: x' = x /(max(x)

Example:
Engine size (800 - 5000 cc) β†’ Scaled between 0 and 1 by dividing each value by 5000.
Number of doors (2 - 5) β†’ Scaled between 0 and 1 by dividing each value by 5.

2) Mean Normalization

Formula: x' = (x - ΞΌ) / (max(x)-min(x))
The ΞΌ (mu) is the mean (average) of the feature values in the dataset. It is calculated as:

Example:
If the mean engine size is 2500 cc, then x'1 = (x - 2500) / (5000-800).
If the mean number of doors is 3, then x'2 = (x - 5) / (5-2).

3) Z-Score Normalization (Standardization)

Formula: x' = (x - ΞΌ) / Οƒ (sigma), where is the standard deviation.

The Οƒ (sigma) is the standard deviation, which measures how much the values of a feature vary around the mean.

Example:
If the engine size has a mean of 2500 cc and a standard deviation of 900, then x'1 = (x - 2500) / 900.
If the number of doors has a mean of 3 and a standard deviation of 1, then (x - 3) / 1.

Polynomial Regression

Linear regression models relationships using a straight line. However, many real-world datasets exhibit non-linear patterns that cannot be captured effectively by a straight line. Polynomial regression enhances linear regression by introducing polynomial features, allowing the model to fit curves instead of just lines.


In polynomial regression, we transform the original feature x into higher-order terms, such as:

Quadratic model:
Ε· = 𝑀1x1 + 𝑀2x22 + b

Cubic model:
These additional terms allow the model to capture complex relationships in the data.

Ε· = 𝑀1x1 + 𝑀2x22 + 𝑀3x33 + b

Beyond polynomial terms, other transformations like √x can be used, depending on the data. Selecting appropriate features requires experimentation and evaluation using techniques like cross-validation.
Index
1. Introduction to Machine Learning: Theoretical Foundations

18 min

2. Supervised learning: Univariate Linear Regression (Linear Regression with One Variable)

17 min

3. Supervised learning: Multiple features (Linear Regression with Multiple Variable)

13 min