5. Supervised learning: Cost function and Gradient descent for Logistic Regression

  • 4.5/5
  • 224
  • Apr 13, 2025
In this article, the main focus is on understanding how to choose a suitable cost function for logistic regression. The cost function plays a crucial role in measuring how well a given set of parametersβ€”typically denoted as 𝑀 and 𝑏—fits the training data.

By evaluating how well these parameters perform across the training set, the cost function gives us a method to iteratively improve them, generally using an optimization algorithm like gradient descent.

Why Squared Error Doesn't Work for Logistic Regression?

Squared error (used in linear regression) produces a non-convex cost function when used with logistic regression. A non-convex function has many local minima, which makes gradient descent unreliable.

In linear regression, the model is:

𝑓(π‘₯) = 𝑀⋅π‘₯ + 𝑏

And the squared error cost function is:

J(w, b) = (1/2m) * βˆ‘i=1m (f(x(i)) - Ε·(i))Β²

This function is convex, meaning it has a nice bowl shape. So, gradient descent works wellβ€”it gradually steps toward the global minimum, no matter where it starts.

In logistic regression, the prediction is:

f(x)= Ε· = 1/1+e-(𝑀⋅π‘₯ + 𝑏)

This function is non-linear and always gives a value between 0 and 1 (interpreted as a probability of fraud in our case).

If we plug this nonlinear 𝑓(π‘₯) into the same squared error formula:

J(w, b) = (1/2m) * βˆ‘i=1m (1/1+e-(𝑀⋅π‘₯ + 𝑏) - Ε·(i))Β²

...then the resulting cost function is no longer convex.


When using squared error as a cost function in logistic regression, the shape of the cost surface becomes non-convexβ€”this means that instead of a smooth bowl shape with a single lowest point (the global minimum), the surface has multiple dips or valleys (local minima).

Gradient descent works by moving in the direction that decreases the cost function the most. If the surface is non-convex, gradient descent can:

- Get stuck in a local minimum (a dip that's not the lowest possible point),
- Fail to reach the best parameters that result in optimal predictions.

Cost Function for Logistic Regression

Let: 𝑓(π‘₯(𝑖)) = prediction for the π‘–π‘‘β„Ž training example (i.e., probability it is fraudulent)
𝑦(𝑖) = true label (1 if fraud, 0 otherwise)
π‘š = number of training examples

For a Single Example Logistic loss Function:

If 𝑦=1: Loss = βˆ’log⁑(𝑓(π‘₯))
If 𝑦=0: Loss = βˆ’log⁑(1βˆ’π‘“(π‘₯))

Loss is a number that tells you how wrong your model's prediction is for a single training example. Loss is for one example. Cost is the average loss over all examples. The goal of training is to minimize loss, and therefore minimize the cost.

Here's the graph showing both log(x) (in blue) and -log(x) (in red):


The loss function commonly used for logistic regression is binary cross-entropy (or log loss), which measures how far the predicted probability is from the actual class label (0 or 1).

Here's the graph of the loss function βˆ’ log ⁑(𝑓) for 𝑦 = 1 (true label 1) in logistic regression:


As you can see, the loss decreases as the predicted probability 𝑓 gets closer to 1, and increases sharply as 𝑓 approaches 0:

The curve shows that the model's loss is minimal when it is confident and correct (predicting probabilities close to 1), and grows significantly when the model is far from the correct prediction.

Here's the graph for the loss function when 𝑦 = 0 y=0:


As the predicted probability 𝑓 β†’ 0, the loss goes to 0 β€” perfect prediction. As 𝑓 β†’ 1, the loss increases sharply β€” the model is confidently wrong.

This curve reflects how logistic regression penalizes incorrect high-confidence predictions when the true label is 0.

The simplified loss function for a single training example is:

Loss (𝑓(w,b)(x(i)),𝑦(i)) = βˆ’ [ 𝑦(i) β‹… log ⁑ (𝑓(w,b)(x(i))) - ( 1 βˆ’ 𝑦(i) ) β‹… log ⁑ (1 βˆ’ 𝑓(w,b)(x(i))) ]

Why This Works:

When 𝑦 = 1:
Loss = βˆ’ log(𝑓) (Just like before!)

When 𝑦 = 0:
Loss = βˆ’ log (1βˆ’π‘“) (Also same as before!)

So this unified expression handles both cases in one line β€” super useful for coding and implementing things like gradient descent!

Now, let’s say you have π‘š training examples. The cost function (average loss across all examples) is:

𝐽 (w,b) = - (1/m) * βˆ‘i=1m [L(𝑓(w,b)(x(i)),𝑦(i))]

or

𝐽 (w,b) = - (1/m) * βˆ‘i=1m [𝑦(i) β‹… log ⁑ (𝑓(w,b)(x(i))) - ( 1 βˆ’ 𝑦(i) ) β‹… log ⁑ (1 βˆ’ 𝑓(w,b)(x(i)))]

Gradient Descent for Logistic Regression

In this section, we'll dive into how to implement logistic regression by optimizing its parameters using gradient descent.

To fit a logistic regression model, we aim to find values for the parameters w (weights) and b (bias) that minimize the cost function 𝐽(𝑀,𝑏). This cost function quantifies how well the model's predictions align with the actual labels in the training data.

To minimize this cost function, we apply gradient descent, an optimization algorithm that iteratively updates the model parameters in the direction that reduces the cost.

The logistic regression model:

fw,b(x)= Ε· = 1/1+e-(𝑀⋅π‘₯ + 𝑏)

Once the model has been trained, we can use it to assess new dataβ€”for example, a new transaction that includes features like the transaction amount, location, time, and device used. The model can then predict whether the transaction is fraudulent or legitimate by estimating the probability that the label y=1 (fraud).

Here's how gradient descent works in this context. We update the parameters w and b using the following rule:

repeat{ 𝑀𝑗 := 𝑀𝑗 βˆ’ 𝛼 ​ βˆ‚π½/βˆ‚π‘€π‘— ​ 𝐽 (w,b)
𝑏 := 𝑏 βˆ’ 𝛼 ​ βˆ‚π½/βˆ‚π‘ ​ 𝐽 (w,b) }

By applying the rules of calculus and working through the logistic regression cost function, this derivative βˆ‚π½/βˆ‚π‘€π‘— ​ 𝐽 (w,b) evaluates to:

βˆ‚π½/βˆ‚π‘€π‘— ​ 𝐽 (w,b) = (1/m) * βˆ‘i=1m ​ (𝑓(w,b)(x(i)) - 𝑦(i)) ​ xj(i)
βˆ‚π½/βˆ‚π‘ ​ 𝐽 (w,b) = (1/m) * βˆ‘i=1m ​ (𝑓(w,b)(x(i)) - 𝑦(i))

As a quick reminder: when updating the parameters 𝑀𝑗 and 𝑏, we don't update them one at a time while computing the gradients. Instead, we:
- First compute all the necessary gradient values (i.e., the right-hand side of the update rules),
- Then simultaneously update all the parameters using those values.

Let's plug the gradients we derived earlier into the gradient descent update rules. The update formulas become:

repeat{
𝑀𝑗 := 𝑀𝑗 βˆ’ 𝛼 ​ [(1/m) * βˆ‘i=1m ​ (𝑓(w,b)(x(i)) - 𝑦(i)) ​ xj(i)]
𝑏 := 𝑏 βˆ’ 𝛼 ​ [(1/m) * βˆ‘i=1m ​ (𝑓(w,b)(x(i)) - 𝑦(i))]
}

These updates form the core of gradient descent for logistic regression.

These equations look really similar to the ones we used for linear regression. Are logistic regression and linear regression actually the same?

The answer is no. While the gradient descent update equations look similar, the fundamental difference lies in the function 𝑓 (π‘₯):
- In linear regression, 𝑓(π‘₯) = 𝑀 ​ π‘₯ + 𝑏
- In logistic regression, 𝑓(π‘₯) = 1/1+e-(𝑀⋅π‘₯ + 𝑏)


The sigmoid function squashes the output to lie between 0 and 1, turning it into a probability estimateβ€”perfect for classification tasks like fraud detection. That change makes a huge difference in the behavior and purpose of the model.

So while the mechanics of gradient descent might feel familiar, logistic regression is fundamentally different from linear regression.

In linear regression, we talked about how to monitor gradient descent to ensure it convergesβ€”i.e., that the cost function is decreasing with each iteration. The same principle applies here. You can:

- Track the value of the cost function 𝐽(𝑀,𝑏) over iterations
- Plot a learning curve
- Stop training once the cost function stabilizes or drops below a desired threshold

This helps confirm that your model is learning correctly.

Another trick we discussed during linear regression is feature scalingβ€”and yes, it's just as valuable here. Feature scaling ensures that all features (e.g., transaction amount, transaction time) lie within a similar range, typically between -1 and 1.

This helps gradient descent converge faster and more reliably, as the optimization landscape becomes smoother and easier to navigate.
Index
1. Introduction to Machine Learning: Theoretical Foundations

18 min

2. Supervised learning: Univariate Linear Regression (Linear Regression with One Variable)

17 min

3. Supervised learning: Multiple features (Linear Regression with Multiple Variable)

13 min

4. Supervised learning: Understanding Classification with Logistic Regression

8 min

5. Supervised learning: Cost function and Gradient descent for Logistic Regression

7 min