Finding the Bottom of a Valley Blindfolded: Understanding Gradient Descent

Table of Contents

ML Fundamentals - This article is part of a series.

Part 1: This Article

Part 2: How Machines Ask Smart Questions: Entropy & Information Gain

Part 3: How Neural Networks Learn from Mistakes: Backpropagation Explained

Imagine you’re blindfolded on a mountain and you need to find the lowest valley. You can’t see anything, but you can feel the ground under your feet. What would you do? You’d feel which direction slopes downward, take a small step that way, and repeat. Congratulations — you just invented gradient descent, the algorithm behind nearly every modern AI system.

Why Should You Care?
#

Optimization is everywhere. When your GPS finds the fastest route, when Netflix recommends a movie, when your phone recognizes your face — behind all of these is an algorithm trying to find the best possible answer from a sea of possibilities. Gradient descent is the workhorse algorithm that makes this happen.

The Blindfolded Hiker
#

Let’s flesh out the analogy with a mapping table:

Hiking (Real World)	Gradient Descent (Math)
Your position on the mountain	Current weights $\mathbf{w}$
Your altitude	Error $E$
The slope of the ground under your feet	Gradient $\nabla E$
Your step size	Learning rate $\eta$
The valley floor	Optimal solution

The hiker’s algorithm is simple:

Feel the slope under your feet (compute the gradient)
Step downhill in the steepest direction (update the weights)
Repeat until the ground feels flat (convergence)

Why Do We Need This?
#

Here’s the problem machines face. Say you’re building a model to predict whether a student will pass an exam based on how many hours they studied. Your model has adjustable knobs called weights — and you need to find the weight values that make the best predictions.

But you can’t just try every possible value. With even a few weights, the number of combinations is astronomical. Instead, you start somewhere and iteratively improve — just like our blindfolded hiker.

The Error Function: How Wrong Are We?
#

First, we need a way to measure “how wrong” our model is. The most common choice is squared error:

$$E(\mathbf{w}) = \frac{1}{2}\sum_{(x,y) \in D}(y - a)^2$$

Symbol	Meaning
$y$	The correct answer (target)
$a$	Our model’s prediction (activation)
$\frac{1}{2}$	A convenience factor that makes the derivative cleaner

This function creates an error surface — imagine a landscape where altitude represents how wrong you are. Every point on this landscape corresponds to a different set of weights. Our goal: find the lowest point.

The Gradient: Which Way Is Downhill?
#

The gradient is a vector that points in the direction where the error increases the fastest:

$$\nabla E = \left[\frac{\partial E}{\partial w_0}, \frac{\partial E}{\partial w_1}, \ldots, \frac{\partial E}{\partial w_n}\right]$$

Since we want to decrease error, we walk in the opposite direction of the gradient:

$$\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla E$$

The minus sign is doing the heavy lifting here — it flips “uphill” into “downhill.”

Common confusion: The gradient tells you the slope, not a destination. It says “error increases fastest in this direction” — it doesn’t say “the minimum is over there.” That’s why we take small steps rather than leaping.

The Delta Rule: A Concrete Formula
#

For a single-neuron model predicting with $a = \sum_i w_i x_i$, we can derive a clean update rule using the chain rule:

$$\frac{\partial E}{\partial w_i} = -(y - a) \cdot x_i$$

Since we move against the gradient:

$$\Delta w_i = \eta(y - a)x_i$$

This is called the Delta Rule, and it’s beautifully intuitive:

$(y - a)$: How wrong are we? If the prediction is too low, this is positive → increase the weight
$x_i$: How much did this input contribute? Bigger inputs get bigger adjustments
$\eta$: How big a step should we take? (the learning rate)

A Step-by-Step Example
#

Problem: Predict if a student passes an exam based on hours studied.

Input: $x = [1, 2]$ (1 is the bias term, 2 hours studied)
Target: $y = 1$ (passed)
Initial weights: $w = [0.1, 0.3]$
Learning rate: $\eta = 0.1$

Iteration 1
#

Predict: $a = 0.1 \times 1 + 0.3 \times 2 = 0.7$

Error: $y - a = 1 - 0.7 = 0.3$ (we predicted too low)

Update: $$\Delta w_0 = 0.1 \times 0.3 \times 1 = 0.03$$ $$\Delta w_1 = 0.1 \times 0.3 \times 2 = 0.06$$

New weights: $w = [0.13, 0.36]$

Watch It Converge
#

Iteration	$w_0$	$w_1$	Prediction $a$	Error
0	0.100	0.300	0.700	0.300
1	0.130	0.360	0.850	0.150
2	0.145	0.390	0.925	0.075
3	0.153	0.405	0.963	0.037
…	…	…	…	→ 0

Notice how the error shrinks each iteration — and the updates get smaller too. When you’re close to the answer, you take smaller steps. The algorithm naturally slows down as it approaches the solution.

Batch vs. Stochastic: Two Flavors of Descent
#

There are two main ways to apply gradient descent, and a restaurant analogy helps explain the difference:

Batch Gradient Descent
#

Batch GD looks at all the training data before making a single update.

Analogy: You’re a chef. You ask every single customer what they thought of the meal, compile all the feedback, then make one careful adjustment to the recipe.

Pro: Stable, smooth path toward the minimum
Con: Slow — you have to process everything before each step

Stochastic Gradient Descent (SGD)
#

SGD updates after each individual training example.

Analogy: You ask one customer what they thought and immediately tweak the recipe. Then the next customer, tweak again. It’s chaotic but fast.

Pro: Much faster, and the randomness can help escape bad solutions
Con: Noisy, zigzag path

graph LR
    subgraph Batch["Batch GD"]
        B1["Smooth, direct
path to minimum"] --> B2["●
Minimum"]
    end
    subgraph SGD["Stochastic GD"]
        S1["Zigzag, noisy
path to minimum"] --> S2["●
Minimum"]
    end

In practice, most systems use mini-batch gradient descent — a middle ground where you look at a small batch (say 32 or 64 examples) at a time. It gets the best of both worlds.

The Local Optima Trap
#

Here’s the catch: gradient descent finds a local minimum, not necessarily the global minimum.

Think about it — our blindfolded hiker can only feel the ground directly underfoot. If they walk into a small ditch, the ground slopes up in every direction, so they stop. But there might be a much deeper valley a mile away that they’ll never find.

graph LR
    A["Start"] --> B["Walk downhill..."]
    B --> C["Stuck in
local minimum!"]
    C -.->|"Can't see this"| D["Global minimum
(the real answer)"]

Solutions to Getting Stuck
#

Strategy	How It Works
Random restarts	Try hiking from multiple random starting points and keep the best result
Momentum	Keep some “inertia” so you can roll through small ditches
Simulated annealing	Occasionally allow uphill steps early on, then settle down over time

The good news: in modern deep learning with millions of parameters, the error landscape is so high-dimensional that true local minima are actually quite rare. Most “valleys” have an escape route in some dimension.

Why This Powers Everything
#

Gradient descent isn’t just a classroom algorithm — it’s literally how every modern neural network learns:

ChatGPT learned to write by gradient descent over billions of text examples
Self-driving cars use it to tune perception models on driving data
Drug discovery uses it to optimize molecular property predictions
Your phone’s keyboard prediction was trained with it

Every time you hear “the model was trained on data,” gradient descent (or a close variant like Adam) is doing the actual training under the hood.

Key Takeaways
#

Gradient descent finds good solutions by repeatedly taking small steps in the direction that reduces error
The gradient tells you which direction is “uphill” — so you go the opposite way
The learning rate controls step size: too big and you overshoot, too small and you’ll take forever
Batch gradient descent is stable but slow; stochastic is fast but noisy
You might get stuck in a local minimum, but there are tricks to escape

Study Notes: This post is based on my personal notes from studying Machine Learning (CS7641) at Georgia Institute of Technology. The concepts are explained in my own words for learning purposes.

ML Fundamentals - This article is part of a series.

Part 1: This Article

Part 2: How Machines Ask Smart Questions: Entropy & Information Gain

Part 3: How Neural Networks Learn from Mistakes: Backpropagation Explained

Hiking (Real World)	Gradient Descent (Math)
Your position on the mountain	Current weights \(\mathbf{w}\)
Your altitude	Error \(E\)
The slope of the ground under your feet	Gradient \(\nabla E\)
Your step size	Learning rate \(\eta\)
The valley floor	Optimal solution

Symbol	Meaning
\(y\)	The correct answer (target)
\(a\)	Our model’s prediction (activation)
\(\frac{1}{2}\)	A convenience factor that makes the derivative cleaner

Why Should You Care? #

The Blindfolded Hiker #

Why Do We Need This? #

The Error Function: How Wrong Are We? #

The Gradient: Which Way Is Downhill? #

The Delta Rule: A Concrete Formula #

A Step-by-Step Example #

Iteration 1 #

Watch It Converge #

Batch vs. Stochastic: Two Flavors of Descent #

Batch Gradient Descent #

Stochastic Gradient Descent (SGD) #

The Local Optima Trap #

Solutions to Getting Stuck #

Why This Powers Everything #

Key Takeaways #

Related