Skip to main content
  1. Posts/

Case Study: Building AWS Billing's Unbilled Usage Auditor

Nick Liu
Author
Nick Liu
Building infrastructure for Facebook Feed Ranking at Meta. Previously at Walmart, Twitter, AWS, and eBay. MS in Computer Science at Georgia Tech.
Table of Contents
I spent five years on the AWS Billing team. The hardest problem I tackled was detecting when customers used AWS services but weren’t charged correctly. This post walks through how I designed a system that reduced charge discrepancies by 300x and eliminated 230 million monthly false positives.

The Problem
#

AWS billing is trickier than it looks. When a customer launches an EC2 instance, writes to S3, or queries DynamoDB, each action generates a usage record. These records flow through a pipeline that calculates charges based on the customer’s pricing plan, region, and service tier.

Sometimes, usage records don’t make it through the pipeline correctly. A record might get dropped, delayed, or processed with the wrong pricing. The customer gets under-charged (AWS loses revenue) or over-charged (customer trust erodes). At AWS’s scale, even a tiny error rate across millions of customers adds up fast.

The existing detection system flagged potential discrepancies, but it generated 230 million false positive alerts per month. That volume made the alerts useless in practice. Engineers would triage a few, find nothing, and ignore the rest. Real discrepancies were buried in noise.

The Challenge
#

The core tension was precision vs. recall:

  • Too aggressive: Flag everything, drown in false positives (the status quo)
  • Too conservative: Miss real discrepancies, lose money and customer trust
  • Just right: Flag only genuine issues, with enough context to act quickly

Additionally, the system had to:

  • Process billions of usage records across every AWS service
  • Handle the different pricing models, discount programs, and billing cycles
  • Run continuously without impacting the billing pipeline’s latency
  • Produce actionable alerts, not just “something looks wrong”

The Approach
#

I designed the Unbilled Usage Auditor as a distributed system with three components:

flowchart LR
    A["Smart Sampling\n& Aggregation"] --> B["Multi-Signal\nValidation"]
    B --> C["Automated\nResolution"]
    C --> D{Real issue?}
    D -- Yes --> E["Alert with\ndiagnosis"]
    D -- No --> F["Auto-resolve"]

1. Smart Sampling and Aggregation
#

Instead of checking every individual usage record (which generated the 230M false positives), I aggregated usage at the service-account-period level. This reduced the comparison space by orders of magnitude while preserving the ability to detect genuine discrepancies.

The aggregation was built on DynamoDB for its consistent low-latency reads at any scale. Each aggregated record stored the expected charge (from usage records) and the actual charge (from the billing output), along with metadata about the pricing plan and discount programs applied.

2. Multi-Signal Validation
#

A single mismatch between expected and actual charges doesn’t necessarily indicate a problem. Pricing changes, retroactive discounts, and billing cycle boundaries all create legitimate temporary discrepancies.

I built a validation pipeline using AWS Lambda that checked multiple signals before escalating:

  • Temporal correlation: Is this a timing issue that will self-correct in the next billing cycle?
  • Pricing context: Did a pricing change or discount activation explain the difference?
  • Historical pattern: Has this account/service combination shown similar patterns before?
  • Magnitude thresholds: Is the discrepancy large enough to warrant investigation?

Only records that failed all validation checks were escalated as genuine alerts.

3. Automated Resolution Pipeline
#

For common discrepancy patterns, the system could trigger automated remediation: re-processing usage records, applying missing discounts, or flagging records for manual review with specific context about what went wrong.

So when an engineer did receive an alert, it came with a diagnosis, not just a symptom.

The Results
#

After rolling out the Unbilled Usage Auditor:

Metric Before After Change
Monthly false positives 230,000,000 < 1,000 -99.99%
Charge discrepancies $125,000 $432 -99.65% (300x)
Time to resolve alerts Days Hours ~10x faster
Alert actionability ~0% (noise) ~95% (genuine) Usable

The 300x reduction in charge discrepancies (from $125,000 to $432) came from catching real issues that were previously buried under false positives. The system didn’t just reduce noise. It uncovered signal that had always been there.

What I Learned
#

Correctness Is a Spectrum
#

In billing systems, “correct” isn’t binary. There’s the charge that’s mathematically right given current pricing, the charge that’s right given the customer’s expectation, and the charge that’s right given the business rules (discounts, credits, negotiations). The system had to reason about all three.

False Positive Reduction Is Its Own Feature
#

Reducing false positives wasn’t just an optimization. It was a prerequisite for the system being useful at all. A detection system that cries wolf 230 million times a month is a noise generator, not a detection system. The biggest impact came not from finding new problems, but from making existing problems visible.

Scale Changes the Problem
#

At AWS scale, approaches that work for thousands of records completely break at billions. The shift from per-record checking to aggregated analysis wasn’t just an optimization; it required rethinking the problem from scratch. When your scale changes by 1000x, your architecture probably needs to change, not just your hardware.

The hardest part of this project wasn’t the distributed systems engineering. It was understanding the domain deeply enough to tell a real discrepancy from expected behavior. Technical skill got me the system; domain knowledge got me the 300x improvement.

This project shaped how I think about system design: start with correctness guarantees, then optimize for signal-to-noise ratio, and always build systems that explain their decisions.

For more about my career journey and other projects, see my experience page or projects page.

Related

Case Study: Fleet-Scale Kernel Automation at Twitter

At Twitter, I was responsible for kernel updates across 5,000+ production servers. Updating a kernel is risky on one machine. Doing it across a fleet, without downtime, without data loss, and without breaking the services that millions of people depend on, is a different problem entirely. The Problem # Twitter’s production infrastructure ran on thousands of bare-metal servers across multiple data centers. Each server ran a Linux kernel that needed regular updates for security patches, performance improvements, and hardware compatibility.

Finding the Bottom of a Valley Blindfolded: Understanding Gradient Descent

Imagine you’re blindfolded on a mountain and you need to find the lowest valley. You can’t see anything, but you can feel the ground under your feet. What would you do? You’d feel which direction slopes downward, take a small step that way, and repeat. Congratulations — you just invented gradient descent, the algorithm behind nearly every modern AI system. Why Should You Care? # Optimization is everywhere. When your GPS finds the fastest route, when Netflix recommends a movie, when your phone recognizes your face — behind all of these is an algorithm trying to find the best possible answer from a sea of possibilities. Gradient descent is the workhorse algorithm that makes this happen.