4. Supervised learning: Understanding Classification with Logistic Regression
- 4.4/5
- 382
- Apr 06, 2025
In previous articles, we explored linear regression—a technique used to predict continuous numerical values. However, not all machine learning problems involve predicting numbers.
In many real-world scenarios, we aim to predict categories instead. For instance, determining whether a customer will churn or stay, if a loan application should be approved or rejected, whether a product review is positive or negative, or if an online transaction is fraudulent.
In this article, we shift our focus to classification, where the output variable 𝑦 can take on only one of a few discrete values, rather than any number on a continuous scale.
In a classification task, the algorithm learns from a training dataset in which each example includes:
- Input features (X) – for example, age, transaction amount, email content, etc.
- Output label (Y) – a category such as spam vs. not spam, fraudulent vs. legitimate, or cat vs. dog.
Once trained, the model uses what it has learned to assign labels to new, unseen inputs.
Many classification tasks fall under binary classification, where the outcome is one of two possible categories. These categories can be expressed in different forms—such as yes/no, true/false, or more commonly, 1/0.
In many real-world scenarios, we aim to predict categories instead. For instance, determining whether a customer will churn or stay, if a loan application should be approved or rejected, whether a product review is positive or negative, or if an online transaction is fraudulent.
In this article, we shift our focus to classification, where the output variable 𝑦 can take on only one of a few discrete values, rather than any number on a continuous scale.
What is Classification?
Classification is a type of supervised learning where the goal is to predict a label or category for a given input, based on labeled training data.In a classification task, the algorithm learns from a training dataset in which each example includes:
- Input features (X) – for example, age, transaction amount, email content, etc.
- Output label (Y) – a category such as spam vs. not spam, fraudulent vs. legitimate, or cat vs. dog.
Once trained, the model uses what it has learned to assign labels to new, unseen inputs.
Many classification tasks fall under binary classification, where the outcome is one of two possible categories. These categories can be expressed in different forms—such as yes/no, true/false, or more commonly, 1/0.
Using 1 and 0 is standard in computer science and machine learning, as it aligns well with the binary logic used in algorithms.
We also use the terms positive class and negative class:
- The positive class (1) typically indicates the presence of a condition (e.g., spam, fraud, approved).
- The negative class (0) indicates the absence of that condition.
It's important to note that positive and negative don't mean good or bad—they simply refer to whether the condition we’re predicting is present or not.
Also, the choice of which category is labeled as 1 and which as 0 is arbitrary. Engineers and data scientists can decide based on the problem at hand or based on what makes evaluation easier.
Binary classification involves predicting one of two possible categories. Examples include yes/no decisions, spam vs. not spam, or outcomes represented as 0/1.
Multi-class classification deals with predicting one of more than two categories. For instance, identifying a fruit as an apple, banana, or mango falls under this type.
Why Not Use Linear Regression for Classification?
Let's consider an example where we're trying to classify credit card transactions as fraudulent (1) or legitimate (0) based on the transaction amount. If we apply a linear regression model to this problem, it will produce a straight line that predicts a continuous output—potentially ranging from negative to positive infinity—not just 0 or 1.To make it work for classification, you might try setting a threshold, such as 0.5:
- If the model predicts a value < 0.5, classify the transaction as legitimate (0)
fw,b(x) < 0.5 → ŷ = 0
- If the model predicts a value ≥ 0.5, classify it as fraudulent (1)
fw,b(x) ≥ 0.5 → ŷ = 1
This might work initially—until you add just one new data point, like a very large transaction that's actually legitimate. That outlier can significantly alter the regression line, shifting the threshold and resulting in poor predictions for the rest of the data.
This happens because linear regression is highly sensitive to outliers and isn't designed for classification tasks. It doesn't handle the concept of bounded output or decision boundaries well.
What's the Solution? Logistic Regression
Despite its name, logistic regression is not used for regression tasks. It's a classification algorithm designed specifically for binary outcomes.With logistic regression:
- The model output is always between 0 and 1
- This output is interpreted as the probability that the input belongs to the positive class
- It avoids the problem of a shifting decision boundary caused by outliers
Logistic Regression for Classification
Logistic regression is one of the most widely used algorithms for classification problems.In the previous section, we saw why linear regression is not a good fit for classification problems—like predicting whether a transaction is fraudulent (1) or not fraudulent (0) based on a feature such as transaction amount.
In a classification problem like this, the x-axis represents the transaction amount, while the y-axis can only take on two values: 0 or 1.
Instead of fitting a straight line (as in linear regression), logistic regression fits an S-shaped curve to the data—called the sigmoid or logistic function.
The red dashed line at $2,500 represents the decision boundary.
The Sigmoid Function
The Sigmoid function is defined as:g(z)=1/1+e-z
Where:
0 < g(z) < 1
z is the input to the function, typically computed as z = w⋅x + b
𝑒 is the mathematical constant (~2.718)
𝑤 and 𝑏 are the model parameters (weights and bias)
Below is the graph of the sigmoid function:
Interpretation of the sigmoid function output:
When 𝑧 is very large → 𝑔(𝑧) ≈ 1
- Example: g(100)= 1/1+e-100 ≈ 1/1+a tiny number ≈ 1/1 ≈ 1
When 𝑧 is very small → 𝑔(𝑧) ≈ 0
- Example: g(100)= 1/1+e100 ≈ 1/1+a huge number ≈ 1/a huge number ≈ 0
When 𝑧 = 0 → 𝑔(𝑧) = 0.5
- g(100)= 1/1+e-0 ≈ 1/1+1 = 1/2 = 0.5
Notice that this graph has both positive and negative values on the x-axis (representing 𝑧), unlike the transaction amount which is always positive. That’s because 𝑧 is a computed value, not a raw input.
Building the Logistic Regression Model
The logistic regression model makes predictions through a two-step process:1) Compute a linear function (just like in linear regression):
fw,b(x) = 𝑤 ⋅ 𝑥 + 𝑏
Let's take this value in a variable, let's call it z.
z = 𝑤 ⋅ 𝑥 + 𝑏
2) Apply the sigmoid function to the result:
Now, let's take this value of z and pass it to the sigmoid function g(z)=1/1+e-z. This gives us:
fw,b(x) = g(𝑤 ⋅ 𝑥 + 𝑏) = 1/1+e-(𝑤 ⋅ 𝑥 + 𝑏)
Thus, the logistic regression model becomes:
fw,b(x)= ŷ = 1/1+e-(𝑤 ⋅ 𝑥 + 𝑏)
How It Works in Fraud Detection
Suppose we input a transaction amount of $3,000 into the model.First, the model computes:
𝑧 = 𝑤 ⋅ 3000 + 𝑏
Then, it applies the sigmoid function:
Output = 𝑔 (𝑧) = some value between 0 and 1
Let's say the result is 0.7. We interpret this as: "There’s a 70% chance that this transaction is fraudulent."
If we set a classification threshold of 0.5:
- If output ≥ 0.5 → predict fraud (1)
- If output < 0.5 → predict not fraud (0)
The Decision Boundary
Logistic Regression is a two-step computation process. First, we compute a value z from the input features using the equation:z = w ⋅ x + b
Next, we compute the output using the Sigmoid (Logistic) function:
fw,b(x) = g(w ⋅ x + b) = 1 / (1 + e-(w ⋅ x + b))
We interpret this as the probability that 𝑦 = 1 given 𝑥, written as:
f(x) = P(y = 1 | x; w, b)
To make a final classification decision, we choose a threshold (known as the decision boundary), typically:
If 𝑓(𝑥) ≥ 0.5 → predict ŷ = 1
If 𝑓(𝑥) < 0.5 → predict ŷ = 0
When does 𝑓(𝑥) = 0.5?
This happens when:
g(z) = 0.5 ⇒ z = 0 ⇒ w ⋅ x + b = 0
The equation w ⋅ x + b = 0 defines the decision boundary—the region where the model is undecided between class 0 and class 1.
When Does the Model Predict 1 or 0?
We want:f(x) ≥ 0.5
From the sigmoid function's properties:
g(z) ≥ 0.5 ⇔ z ≥ 0
So:
f(x) ≥ 0.5 ⇔ w ⋅ x + b ≥ 0
Thus, the model predicts:
ŷ = 1 when w ⋅ x + b ≥ 0
ŷ = 0 when w ⋅ x + b < 0
Visualizing Decision Boundaries in Logistic Regression
Let's explore how logistic regression works with two input features, which helps visualize how decision boundaries emerge. Imagine a training dataset plotted on a 2D plane:The x-axis represents feature 𝑥₁
The y-axis represents feature 𝑥₂
Red crosses (×) represent positive examples where 𝑦 = 1
Blue circles (○) represent negative examples where 𝑦 = 0
Logistic regression predicts using:
f(x) = g(z) = g(w₁⋅x₁ + w₂⋅x₂ + b)
Let's set:
w₁ = 1
w₂ = 1
b = −3
So:
z = x₁ + x₂ − 3
Now, let's analyze:
When z > 0 → predict 1
When z < 0 → predict 0
When z = 0 → model is unsure → decision boundary
The decision boundary is:
x₁ + x₂ = 3
This defines a straight line in the 2D feature space:
- Points to the right of this line are classified as 𝑦 = 1
- Points to the left are classified as 𝑦 = 0
This line is the decision boundary—a simple threshold separating the two classes.
Nonlinear Decision Boundaries in Logistic Regression
Logistic regression isn't limited to straight-line boundaries. Using polynomial features, we can model more complex shapes.Prediction function:
f(x) = g(z) = g(w₁ ⋅ x₁² + w₂ ⋅ x₂² + b)
Let's set:
w₁ = 1
w₂ = 1
b = −1
Then:
z = x₁² + x₂² − 1
x₁² + x₂² = 1
This forms a circle:
Points outside the circle (z ≥ 0) → predict 𝑦 = 1
Points inside the circle (z < 0) → predict 𝑦 = 0
By adding more polynomial terms like x₁⋅x₂, x₁², x₂², etc., the model can learn even more complex shapes like ellipses or irregular curves:
Without polynomial features, logistic regression can only draw linear decision boundaries. With polynomial features, it captures nonlinear relationships, improving performance on more complex datasets.