3. Supervised learning: Multiple features (Linear Regression with Multiple Variable)
- 4.1/5
- 137
- Mar 29, 2025
Multiple Linear Regression is an extension of simple linear regression that allows us to model the relationship between one dependent variable and multiple independent variables (features).
Single-feature regression uses only one factor to predict a value. For example, predicting a car's price based solely on its age:
This table presents the car age and its predicted price. The predicted price for a 12-year-old car based on the simple linear regression model is approximately $26.81K.
As we have seen before in detail, the formula for simple linear regression is: fw,b(x) = wx + b
Single-feature regression uses only one factor to predict a value. For example, predicting a car's price based solely on its age:
Car Age (Years) | Predicted Price (in $K) |
---|---|
1 | 48.0 |
5 | 40.0 |
10 | 30.0 |
15 | 20.0 |
20 | 10.0 |
12 | 26.81 |
This table presents the car age and its predicted price. The predicted price for a 12-year-old car based on the simple linear regression model is approximately $26.81K.
As we have seen before in detail, the formula for simple linear regression is: fw,b(x) = wx + b
Multi-feature regression
Multi-feature regression considers multiple factors for better accuracy. For instance, predicting a car's price based on age, mileage, horsepower, and brand score:Car Age (Years) | Mileage (Miles) | Horsepower (HP) | Brand Score | Predicted Price ($) |
---|---|---|---|---|
1 | 10,000 | 200 | 9 | 60,000 |
2 | 20,000 | 190 | 8.5 | 55,000 |
3 | 30,000 | 180 | 8.2 | 52,000 |
4 | 40,000 | 170 | 8 | 50,000 |
5 | 50,000 | 160 | 7.5 | 45,000 |
Lowercase n represents the total number of features, which is 4 in this case. Xi denotes the i-th training example, which consists of a set of four values, also known as a vector, containing all the features of that example.
For instance, XΒ² (the second training example) is a vector with four features:
XΒ² = [Car Age, Mileage, Horsepower, Brand Score].
or XΒ² = [2, 20,000, 190, 8.5]
To refer to a specific feature in the πth training example, the notation πππ is used. For example, π32 represents the value of the third feature (e.g., Horsepower (HP)) in the second training example.
or π32 = 190
Now that multiple features are included, the model's definition changes. Previously, when using a single feature, π was just one number. However, with multiple features, the model is now represented as:
fw,b(x) = w1x1 + w2x2 + w3x3 + w4x4 + b
Here, each ππ represents a different feature, and the corresponding π€π is its weight, contributing to the final prediction.
Let's assume the weights and bias are:
π€1 = β2000 (each extra year reduces the price by $2000)
π€2 = β0.05 (each extra mile reduces the price by $0.05)
π€3 = 150 (each extra HP increases the price by $150)
π€4 = 5000 (each brand score point increases the price by $5000)
π = 30,000 (base price)
Predicting the price of a car with:
Age = 4 years
Mileage = 40,000 miles
Horsepower = 170 HP
Brand Score = 8
fw,b(x) = (β 2000 Γ 4) + (β 0.05 Γ 40 , 000) + (150 Γ 170) + (5000 Γ 8) + 30,000 = 85,500
So, the predicted price of the car is $85,500.
If there are n features, the multiple linear regression formula is:
fw,b(x) = w1x1 + w2x2 + w3x3 + . . .+ wnxn + b
To simplify this, let's introduce vector notation:
W (weight vector) = [w1, w2, w3, ..., wn]
X (feature vector) = [X1, X2, X3, ..., Xn]
b (bias) = a single number
To represent vectors, an arrow can be drawn on top.
Using the dot product from linear algebra, the model is rewritten as:
fw,b(x) = W . X + b
The dot product is calculated as:
W . X = w1X1 + w2X2 + w3X3 + ... + wnXn + b
Vectorization
Vectorization is a crucial concept in machine learning that optimizes computations by leveraging modern numerical linear algebra libraries and parallel processing capabilities. It offers two key benefits:Concise Code β Reduces complex operations into a single line, improving readability. Performance Boost β Uses optimized hardware (CPU/GPU) to accelerate computations.
Given parameters w (weights) and b (bias), and input features x, the prediction formula is:
W . X = w1X1 + w2X2 + w3X3 + ... + wnXn + b
1) Non-Vectorized Implementation (Manual Computation)
For n = 3, a naive approach in Python would be:w = np.array([w1, w2, w3]) x = np.array([x1, x2, x3]) b = b_value f = w[0] * x[0] + w[1] * x[1] + w[2] * x[2] + bThis is Inefficient for large n (e.g., n = 100,000).
2) Non-Vectorized Using a For Loop
A slightly better approach uses a loop:import numpy as np w = np.array([w1, w2, w3]) x = np.array([x1, x2, x3]) b = b_value f = 0 for j in range(len(w)): f += w[j] * x[j] f += bMore scalable but still inefficient because of explicit looping. If π ranges from 0 to 15, this loop performs operations sequentially. At time π‘0, it processes the values at index 0. At the next time step, it calculates values corresponding to index 1, and so on until the 15th step, computing each operation one after another.
3) Vectorized Implementation (Optimized Using NumPy)
Using NumPyβs dot product, we eliminate the loop:f = np.dot(w, x) + b
Highly efficient β Uses optimized parallel processing behind the scenes. Instead of processing elements sequentially, the computer loads all values of vectors π€ and π₯ and performs element-wise multiplication in a single step using parallel processing. Then, it efficiently sums these 16 values using specialized hardware, avoiding the need for sequential addition.
Vectorization is faster because it leverages optimized numerical libraries like NumPy, which is specifically designed for efficient mathematical computations.
Unlike traditional loops, vectorized operations utilize parallelism on modern CPUs and GPUs, allowing multiple computations to be executed simultaneously instead of sequentially.
Additionally, vectorization minimizes function call overhead by replacing explicit loops with optimized low-level implementations, further enhancing performance and efficiency in machine learning and numerical computing.
Implementing Gradient Descent for Multiple Linear Regression
Let's put it all together to implement gradient descent for multiple linear regression using vectorization. Before we dive into the implementation, letβs quickly review the fundamentals of multiple linear regression.In multiple linear regression, we model the relationship between a dependent variable y and multiple independent variables (features) x1, x2,..., xn. Instead of treating each weight w1, w2, ..., wn as separate parameters, we collect them into a single vector W, making it a vector of length n.
Additionally, we retain the bias term b as a separate scalar.
Using vector notation, we can express the prediction function as:
fw,b(x) = W . X + b
Here, the dot product operation represents the summation of the element-wise product of the weight vector and the feature vector.
Cost Function
The cost function measures the difference between predicted and actual values. For multiple linear regression, the cost function is:J(w1,..., wn, b)
or
J(W,b), where 'W' is the weight vector (size n).
Gradient Descent for Multiple Features
Gradient descent minimizes the cost function by iteratively updating the parameters. For each weight wj, the update rule is:repeat{
π€j = π€j β Ξ± Β· π/ππ€j Β· π½(w1,..., wn, b)
π = π β Ξ± Β· π/ππ Β· π½(w1,..., wn, b)
}
or
repeat{
π€j = π€j β Ξ± Β· π/ππ€j Β· π½(W, b)
π = π β Ξ± Β· π/ππ Β· π½(W, b)
}
or
repeat{
π€1 = π€1 β Ξ± Β· π/ππ€1 Β· π½(W, b)
π€2 = π€2 β Ξ± Β· π/ππ€2 Β· π½(W, b)
.
.
.
π€n = π€n β Ξ± Β· π/ππ€n Β· π½(W, b)
π = π β Ξ± Β· π/ππ Β· π½(W, b)
}
We must update π€j (for j=1,..,n) and π simultaneously in each iteration of Gradient Descent to ensure proper convergence.
Feature Scaling
When training a machine learning model using gradient descent, one crucial preprocessing step is feature scaling. Without proper scaling, gradient descent can take longer to converge or even fail to reach the optimal solution. This section explains why feature scaling is necessary using a car price prediction example.Imagine we are building a model to predict the price of a car based on two features:
- Engine size (in cubic centimeters, cc): Typically ranges from 800 cc to 5000 cc.
- Number of doors: Usually ranges from 2 to 5.
The target variable is the car price in thousands of dollars.
Consider a training example of a car with an engine size of 4000 cc, four doors, and a price of $30,000. Suppose our model follows the equation:
Ε· = π€1x1 + π€2x2 + b
where:
x1 is the engine size (cc),
x2 is the number of doors,
π€1 and π€2 are the parameters,
b is the bias term.
Let's examine two different sets of weight values:
a) Large π€1, small π€2:
π€1 = 500, small π€2 = 10, b=50
Ε· = (500Γ200)+(10Γ4)+50
Ε· = 100000+40+50=100090
So, the predicted price is $100,090$.
b) Small π€1, Large π€2:
π€1 = 10, small π€2 = 500, b=50
Ε· = (10Γ200)+(500Γ4)+50
Ε· = 2000+2000+50=4050
So, the predicted price is $4,050$.
This example illustrates that when a feature has a large range (e.g., engine size), the corresponding weight needs to be smaller. Conversely, when a feature has a small range (e.g., number of doors), the weight needs to be larger.
When features have vastly different ranges, the cost function contour plot becomes elongated, making gradient descent inefficient. The updates to π€12
By scaling the features so that both take on similar ranges (e.g., 0 to 1), the cost function contours become more circular, allowing gradient descent to converge much faster.
The left graph shows the cost function before feature scaling, where the contours are elongated due to the large difference in feature ranges (engine size vs. number of doors). This causes inefficient gradient descent.
The right graph shows the cost function after feature scaling, where the contours become more circular, allowing gradient descent to converge efficiently.
Types of Feature Scaling
To address this issue, we can apply different scaling methods:1) Min-Max Scaling (Normalization)
Formula: x' = x /(max(x)Example:
Engine size (800 - 5000 cc) β Scaled between 0 and 1 by dividing each value by 5000.
Number of doors (2 - 5) β Scaled between 0 and 1 by dividing each value by 5.
2) Mean Normalization
Formula: x' = (x - ΞΌ) / (max(x)-min(x))The ΞΌ (mu) is the mean (average) of the feature values in the dataset. It is calculated as:
Example:
If the mean engine size is 2500 cc, then x'1 = (x - 2500) / (5000-800).
If the mean number of doors is 3, then x'2 = (x - 5) / (5-2).
3) Z-Score Normalization (Standardization)
Formula: x' = (x - ΞΌ) / Ο (sigma), where is the standard deviation.The Ο (sigma) is the standard deviation, which measures how much the values of a feature vary around the mean.
Example:
If the engine size has a mean of 2500 cc and a standard deviation of 900, then x'1 = (x - 2500) / 900.
If the number of doors has a mean of 3 and a standard deviation of 1, then (x - 3) / 1.
Polynomial Regression
Linear regression models relationships using a straight line. However, many real-world datasets exhibit non-linear patterns that cannot be captured effectively by a straight line. Polynomial regression enhances linear regression by introducing polynomial features, allowing the model to fit curves instead of just lines.In polynomial regression, we transform the original feature x into higher-order terms, such as:
Quadratic model:
Ε· = π€1x1 + π€2x22 + b
Cubic model:
These additional terms allow the model to capture complex relationships in the data.
Ε· = π€1x1 + π€2x22 + π€3x33 + b
Beyond polynomial terms, other transformations like βx can be used, depending on the data. Selecting appropriate features requires experimentation and evaluation using techniques like cross-validation.