Skip to main content
  1. Projects/

AWS Billing Unbilled Usage Auditor

·276 words·2 mins·
Nick Liu
Author
Nick Liu
Building infrastructure for Facebook Feed Ranking at Meta. Previously at Walmart, Twitter, AWS, and eBay. MS in Computer Science at Georgia Tech.
Table of Contents
Designed and built a distributed system that detects unbilled usage across all AWS services — reducing charge discrepancies by 300x and eliminating 230 million monthly false positives.

Key Metrics
#

300x Reduction in Discrepancies
$125,000 → $432
230M False Positives Eliminated
~95% Alert Actionability

Architecture
#

flowchart LR
    A["Usage Records\n(Billions/day)"] --> B["Smart Sampling\n& Aggregation"]
    B --> C["Multi-Signal\nValidation"]
    C --> D["Automated\nResolution"]
    D --> E{Real issue?}
    E -- Yes --> F["Alert with\nDiagnosis"]
    E -- No --> G["Auto-resolve\n& Log"]
    style B fill:#6366f1,color:#fff
    style C fill:#6366f1,color:#fff
    style D fill:#6366f1,color:#fff

Technical Deep Dive
#

Aggregation over Brute-Force
#

Instead of checking every individual usage record (which generated 230M false positives), the system aggregates at the service-account-period level.

  • Built on DynamoDB for consistent low-latency reads at any scale
  • Each record stores expected charge, actual charge, pricing plan, and discount metadata
  • Reduced comparison space by orders of magnitude while preserving detection capability

Beyond Simple Mismatch Detection
#

A single charge mismatch doesn’t indicate a problem. The validation pipeline (built on AWS Lambda) checks multiple signals:

  • Temporal correlation — Is this a timing issue that self-corrects?
  • Pricing context — Did a pricing change or discount explain the difference?
  • Historical pattern — Has this account shown similar patterns before?
  • Magnitude thresholds — Is the discrepancy large enough to investigate?

Only records failing all validation checks are escalated.

From Symptom to Diagnosis
#

Common discrepancy patterns trigger automated remediation:

  • Re-processing dropped usage records
  • Applying missing discounts retroactively
  • Flagging records for manual review with specific context about what went wrong

Engineers receive alerts with a diagnosis, not just a symptom.

Impact
#

Tech Stack
#

Java
DynamoDB
AWS Lambda
Distributed Systems
Billing Pipeline

Read the Full Story

Related

Twitter Fleet-Scale Kernel Automation

·344 words·2 mins
Built the automation and validation tooling to manage kernel updates across 5,000+ production servers at Twitter — with zero-downtime progressive rollouts and automated canary validation. Key Metrics # 5,000+ Production Hosts Weeks → Days Rollout Time 140+ Tickets in One On-Call Week Zero-Downtime Updates Architecture # flowchart LR A["New Kernel\nVersion"] --> B["Canary\nValidation"] B --> C["Wave 1\n1% Fleet"] C --> D["Wave 2\n5% Fleet"] D --> E["Wave 3\n25% Fleet"] E --> F["Full Fleet\nRollout"] C -- anomaly --> G["Pause &\nAuto-Alert"] D -- anomaly --> G E -- anomaly --> G style B fill:#6366f1,color:#fff style C fill:#6366f1,color:#fff style D fill:#6366f1,color:#fff style E fill:#6366f1,color:#fff style F fill:#6366f1,color:#fff Technical Deep Dive # Validate Before You Roll # A Python library that validates kernel safety before fleet-wide rollout: