Skip to main content
  1. Posts/

Finding the Bottom of a Valley Blindfolded: Understanding Gradient Descent

Nick Liu
Author
Nick Liu
Building infrastructure for Facebook Feed Ranking at Meta. Previously at Walmart, Twitter, AWS, and eBay. MS in Computer Science at Georgia Tech.
Table of Contents
ML Fundamentals - This article is part of a series.
Part 1: This Article
Imagine you’re blindfolded on a mountain and you need to find the lowest valley. You can’t see anything, but you can feel the ground under your feet. What would you do? You’d feel which direction slopes downward, take a small step that way, and repeat. Congratulations — you just invented gradient descent, the algorithm behind nearly every modern AI system.

Why Should You Care?
#

Optimization is everywhere. When your GPS finds the fastest route, when Netflix recommends a movie, when your phone recognizes your face — behind all of these is an algorithm trying to find the best possible answer from a sea of possibilities. Gradient descent is the workhorse algorithm that makes this happen.

The Blindfolded Hiker
#

Let’s flesh out the analogy with a mapping table:

Hiking (Real World) Gradient Descent (Math)
Your position on the mountain Current weights \(\mathbf{w}\)
Your altitude Error \(E\)
The slope of the ground under your feet Gradient \(\nabla E\)
Your step size Learning rate \(\eta\)
The valley floor Optimal solution

The hiker’s algorithm is simple:

  1. Feel the slope under your feet (compute the gradient)
  2. Step downhill in the steepest direction (update the weights)
  3. Repeat until the ground feels flat (convergence)

Why Do We Need This?
#

Here’s the problem machines face. Say you’re building a model to predict whether a student will pass an exam based on how many hours they studied. Your model has adjustable knobs called weights — and you need to find the weight values that make the best predictions.

But you can’t just try every possible value. With even a few weights, the number of combinations is astronomical. Instead, you start somewhere and iteratively improve — just like our blindfolded hiker.

The Error Function: How Wrong Are We?
#

First, we need a way to measure “how wrong” our model is. The most common choice is squared error:

$$E(\mathbf{w}) = \frac{1}{2}\sum_{(x,y) \in D}(y - a)^2$$

Symbol Meaning
\(y\) The correct answer (target)
\(a\) Our model’s prediction (activation)
\(\frac{1}{2}\) A convenience factor that makes the derivative cleaner

This function creates an error surface — imagine a landscape where altitude represents how wrong you are. Every point on this landscape corresponds to a different set of weights. Our goal: find the lowest point.

The Gradient: Which Way Is Downhill?
#

The gradient is a vector that points in the direction where the error increases the fastest:

$$\nabla E = \left[\frac{\partial E}{\partial w_0}, \frac{\partial E}{\partial w_1}, \ldots, \frac{\partial E}{\partial w_n}\right]$$

Since we want to decrease error, we walk in the opposite direction of the gradient:

$$\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla E$$

The minus sign is doing the heavy lifting here — it flips “uphill” into “downhill.”

Common confusion: The gradient tells you the slope, not a destination. It says “error increases fastest in this direction” — it doesn’t say “the minimum is over there.” That’s why we take small steps rather than leaping.

The Delta Rule: A Concrete Formula
#

For a single-neuron model predicting with \(a = \sum_i w_i x_i\), we can derive a clean update rule using the chain rule:

$$\frac{\partial E}{\partial w_i} = -(y - a) \cdot x_i$$

Since we move against the gradient:

$$\Delta w_i = \eta(y - a)x_i$$

This is called the Delta Rule, and it’s beautifully intuitive:

  • \((y - a)\): How wrong are we? If the prediction is too low, this is positive → increase the weight
  • \(x_i\): How much did this input contribute? Bigger inputs get bigger adjustments
  • \(\eta\): How big a step should we take? (the learning rate)

A Step-by-Step Example
#

Problem: Predict if a student passes an exam based on hours studied.

  • Input: \(x = [1, 2]\) (1 is the bias term, 2 hours studied)
  • Target: \(y = 1\) (passed)
  • Initial weights: \(w = [0.1, 0.3]\)
  • Learning rate: \(\eta = 0.1\)

Iteration 1
#

Predict: \(a = 0.1 \times 1 + 0.3 \times 2 = 0.7\)

Error: \(y - a = 1 - 0.7 = 0.3\) (we predicted too low)

Update: $$\Delta w_0 = 0.1 \times 0.3 \times 1 = 0.03$$ $$\Delta w_1 = 0.1 \times 0.3 \times 2 = 0.06$$

New weights: \(w = [0.13, 0.36]\)

Watch It Converge
#

Iteration \(w_0\) \(w_1\) Prediction \(a\) Error
0 0.100 0.300 0.700 0.300
1 0.130 0.360 0.850 0.150
2 0.145 0.390 0.925 0.075
3 0.153 0.405 0.963 0.037
→ 0

Notice how the error shrinks each iteration — and the updates get smaller too. When you’re close to the answer, you take smaller steps. The algorithm naturally slows down as it approaches the solution.

Batch vs. Stochastic: Two Flavors of Descent
#

There are two main ways to apply gradient descent, and a restaurant analogy helps explain the difference:

Batch Gradient Descent
#

Batch GD looks at all the training data before making a single update.

Analogy: You’re a chef. You ask every single customer what they thought of the meal, compile all the feedback, then make one careful adjustment to the recipe.

  • Pro: Stable, smooth path toward the minimum
  • Con: Slow — you have to process everything before each step

Stochastic Gradient Descent (SGD)
#

SGD updates after each individual training example.

Analogy: You ask one customer what they thought and immediately tweak the recipe. Then the next customer, tweak again. It’s chaotic but fast.

  • Pro: Much faster, and the randomness can help escape bad solutions
  • Con: Noisy, zigzag path
graph LR
    subgraph Batch["Batch GD"]
        B1["Smooth, direct
path to minimum"] --> B2["●
Minimum"] end subgraph SGD["Stochastic GD"] S1["Zigzag, noisy
path to minimum"] --> S2["●
Minimum"] end

In practice, most systems use mini-batch gradient descent — a middle ground where you look at a small batch (say 32 or 64 examples) at a time. It gets the best of both worlds.

The Local Optima Trap
#

Here’s the catch: gradient descent finds a local minimum, not necessarily the global minimum.

Think about it — our blindfolded hiker can only feel the ground directly underfoot. If they walk into a small ditch, the ground slopes up in every direction, so they stop. But there might be a much deeper valley a mile away that they’ll never find.

graph LR
    A["Start"] --> B["Walk downhill..."]
    B --> C["Stuck in
local minimum!"] C -.->|"Can't see this"| D["Global minimum
(the real answer)"]

Solutions to Getting Stuck
#

Strategy How It Works
Random restarts Try hiking from multiple random starting points and keep the best result
Momentum Keep some “inertia” so you can roll through small ditches
Simulated annealing Occasionally allow uphill steps early on, then settle down over time

The good news: in modern deep learning with millions of parameters, the error landscape is so high-dimensional that true local minima are actually quite rare. Most “valleys” have an escape route in some dimension.

Why This Powers Everything
#

Gradient descent isn’t just a classroom algorithm — it’s literally how every modern neural network learns:

  • ChatGPT learned to write by gradient descent over billions of text examples
  • Self-driving cars use it to tune perception models on driving data
  • Drug discovery uses it to optimize molecular property predictions
  • Your phone’s keyboard prediction was trained with it

Every time you hear “the model was trained on data,” gradient descent (or a close variant like Adam) is doing the actual training under the hood.

Key Takeaways
#

  1. Gradient descent finds good solutions by repeatedly taking small steps in the direction that reduces error
  2. The gradient tells you which direction is “uphill” — so you go the opposite way
  3. The learning rate controls step size: too big and you overshoot, too small and you’ll take forever
  4. Batch gradient descent is stable but slow; stochastic is fast but noisy
  5. You might get stuck in a local minimum, but there are tricks to escape
Study Notes: This post is based on my personal notes from studying Machine Learning (CS7641) at Georgia Institute of Technology. The concepts are explained in my own words for learning purposes.
ML Fundamentals - This article is part of a series.
Part 1: This Article

Related

How Machines Ask Smart Questions: Entropy & Information Gain

Imagine you’re playing 20 Questions. You’re trying to guess what animal your friend is thinking of. Would you start with “Is it a golden retriever?” or “Does it live in water?” The second question is obviously smarter — it eliminates roughly half the possibilities in one shot. Decision trees in machine learning work exactly the same way, and they use entropy and information gain to figure out what the smartest question is. What’s the Big Idea? # When a machine learning algorithm builds a Decision Tree , it needs to decide which question to ask first. Should it split the data by color? By size? By temperature? The answer comes from a beautifully simple concept: ask the question that reduces uncertainty the most.

How Neural Networks Learn from Mistakes: Backpropagation Explained

When a factory produces a defective product, how do you trace the problem back through the assembly line to find which worker made the mistake? Neural networks face the exact same challenge. They have layers of “workers” (neurons), and when the final output is wrong, they need to figure out who’s responsible — and by how much. The algorithm that solves this is called backpropagation, and it’s the reason deep learning works at all. Neural Networks Are Everywhere # Before we dive into how neural networks learn, let’s appreciate what they do. The phone in your pocket uses neural networks for face recognition, voice transcription, photo enhancement, and text prediction. Self-driving cars, medical image analysis, language translation — all neural networks.