Skip to main content
  1. Posts/

How Machines Ask Smart Questions: Entropy & Information Gain

Nick Liu
Author
Nick Liu
Building infrastructure for Facebook Feed Ranking at Meta. Previously at Walmart, Twitter, AWS, and eBay. MS in Computer Science at Georgia Tech.
Table of Contents
ML Fundamentals - This article is part of a series.
Part 2: This Article
Imagine you’re playing 20 Questions. You’re trying to guess what animal your friend is thinking of. Would you start with “Is it a golden retriever?” or “Does it live in water?” The second question is obviously smarter — it eliminates roughly half the possibilities in one shot. Decision trees in machine learning work exactly the same way, and they use entropy and information gain to figure out what the smartest question is.

What’s the Big Idea?
#

When a machine learning algorithm builds a Decision Tree , it needs to decide which question to ask first. Should it split the data by color? By size? By temperature? The answer comes from a beautifully simple concept: ask the question that reduces uncertainty the most.

But to measure “uncertainty,” we need a number for it. That number is called entropy.

Entropy: Measuring Chaos
#

The Messy Room Analogy
#

Think of entropy as a messiness score:

Situation Entropy Why
All your clothes are folded in the closet Low You know exactly where everything is
Clothes are scattered everywhere High You have no idea where anything is

In machine learning, “messiness” means how mixed up the categories are in your data.

The Marble Bag
#

Imagine reaching into a bag of marbles without looking:

  • Bag A: All red marbles → Entropy = 0 (you know what you’ll grab)
  • Bag B: Half red, half blue → Entropy = 1 (maximum uncertainty — it’s a coin flip)
  • Bag C: All blue marbles → Entropy = 0 (certain again)

The more mixed up the bag is, the higher the entropy. When everything is the same, entropy is zero — there’s no surprise.

The Math (It’s Simpler Than It Looks)
#

For a dataset \(S\) with multiple classes, entropy is:

$$Entropy(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)$$

Where \(p_i\) is the proportion of class \(i\) in the dataset, and \(c\) is the number of classes.

For the simple case of two classes (yes/no, spam/not-spam):

$$Entropy(S) = -p_+ \log_2(p_+) - p_- \log_2(p_-)$$

Let’s plug in numbers for the marble bags:

  • Bag A (all red): \(Entropy = -1 \times \log_2(1) - 0 \times \log_2(0) = 0\)
  • Bag B (half and half): \(Entropy = -0.5 \times \log_2(0.5) - 0.5 \times \log_2(0.5) = 1\)

The entropy curve looks like a hill — it peaks at 0.5 (maximum confusion) and drops to zero at the extremes (total certainty).

Why log base 2? Because entropy is measured in bits — the same bits as in computer science. An entropy of 1 bit means you need exactly one yes/no question to figure out the answer. This connects directly to the “20 Questions” game!

Information Gain: Picking the Best Question
#

Now for the punchline. Information gain measures how much a particular question reduces entropy:

$$Gain(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \times Entropy(S_v)$$

In plain English: Information Gain = Entropy before asking − Entropy after asking

A bigger information gain means a better question.

Worked Example: Should I Play Tennis?
#

Let’s say you have 8 days of data about whether you played tennis:

  • 5 days you played (✅)
  • 3 days you didn’t (❌)

Starting entropy:

$$Entropy = -\frac{5}{8}\log_2\frac{5}{8} - \frac{3}{8}\log_2\frac{3}{8} \approx 0.954$$

Now let’s compare two possible questions:

Question 1: “What’s the weather?”
#

graph TD
    A["All Data
5✅ 3❌
Entropy = 0.954"] -->|Sunny| B["2✅ 1❌
Entropy = 0.918"] A -->|Overcast| C["2✅ 0❌
Entropy = 0"] A -->|Rainy| D["1✅ 2❌
Entropy = 0.918"]

Weighted entropy after splitting by weather:

$$\frac{3}{8} \times 0.918 + \frac{2}{8} \times 0 + \frac{3}{8} \times 0.918 = 0.689$$

Information Gain = 0.954 − 0.689 = 0.265

Question 2: “Is it windy?”
#

graph TD
    A["All Data
5✅ 3❌
Entropy = 0.954"] -->|Light Wind| B["4✅ 1❌
Entropy = 0.722"] A -->|Strong Wind| C["1✅ 2❌
Entropy = 0.918"]

Weighted entropy after splitting by wind:

$$\frac{5}{8} \times 0.722 + \frac{3}{8} \times 0.918 = 0.795$$

Information Gain = 0.954 − 0.795 = 0.159

The Verdict
#

Attribute Information Gain Winner?
Weather 0.265 ✅ Pick this one!
Wind 0.159

Weather gives more information gain, so the decision tree puts it at the top. It’s the smarter first question — just like in 20 Questions.

What a Good Split Looks Like
#

graph LR
    A["Mixed Data
High Entropy"] -->|Good Split| B["Mostly ✅
Low Entropy"] A -->|Good Split| C["Mostly ❌
Low Entropy"]

A good split takes a messy group and separates it into purer groups. A bad split leaves you with groups that are still mixed up — you haven’t learned much.

Real-World Applications
#

This isn’t just a classroom exercise. Decision trees powered by entropy and information gain are used everywhere:

  • Spam filtering: “Does the email contain the word ’lottery’?” splits your inbox into much purer groups than “Was it sent on a Tuesday?”
  • Medical diagnosis: A doctor’s diagnostic flowchart is essentially a decision tree — “Does the patient have a fever?” is a high-information-gain question for many conditions
  • Recommendation systems: Streaming services split users into groups based on features that best predict what they’ll watch next
  • Credit scoring: Banks use decision trees to determine which factors best separate reliable borrowers from risky ones

Key Takeaways
#

  1. Entropy measures how mixed up (uncertain) a dataset is — from 0 (pure) to 1 (maximum chaos for two classes)
  2. Information gain tells you how much a question reduces that chaos
  3. Decision trees are greedy — they always pick the question with the highest information gain first
  4. This is exactly the strategy of a smart 20 Questions player: ask the question that eliminates the most possibilities
Study Notes: This post is based on my personal notes from studying Machine Learning (CS7641) at Georgia Institute of Technology. The concepts are explained in my own words for learning purposes.
ML Fundamentals - This article is part of a series.
Part 2: This Article

Related

Finding the Bottom of a Valley Blindfolded: Understanding Gradient Descent

Imagine you’re blindfolded on a mountain and you need to find the lowest valley. You can’t see anything, but you can feel the ground under your feet. What would you do? You’d feel which direction slopes downward, take a small step that way, and repeat. Congratulations — you just invented gradient descent, the algorithm behind nearly every modern AI system. Why Should You Care? # Optimization is everywhere. When your GPS finds the fastest route, when Netflix recommends a movie, when your phone recognizes your face — behind all of these is an algorithm trying to find the best possible answer from a sea of possibilities. Gradient descent is the workhorse algorithm that makes this happen.

How Neural Networks Learn from Mistakes: Backpropagation Explained

When a factory produces a defective product, how do you trace the problem back through the assembly line to find which worker made the mistake? Neural networks face the exact same challenge. They have layers of “workers” (neurons), and when the final output is wrong, they need to figure out who’s responsible — and by how much. The algorithm that solves this is called backpropagation, and it’s the reason deep learning works at all. Neural Networks Are Everywhere # Before we dive into how neural networks learn, let’s appreciate what they do. The phone in your pocket uses neural networks for face recognition, voice transcription, photo enhancement, and text prediction. Self-driving cars, medical image analysis, language translation — all neural networks.