Skip to main content
  1. Projects/

Twitter Fleet-Scale Kernel Automation

·344 words·2 mins·
Nick Liu
Author
Nick Liu
Building infrastructure for Facebook Feed Ranking at Meta. Previously at Walmart, Twitter, AWS, and eBay. MS in Computer Science at Georgia Tech.
Table of Contents
Built the automation and validation tooling to manage kernel updates across 5,000+ production servers at Twitter — with zero-downtime progressive rollouts and automated canary validation.

Key Metrics
#

5,000+ Production Hosts
Weeks → Days Rollout Time
140+ Tickets in One On-Call Week
Zero-Downtime Updates

Architecture
#

flowchart LR
    A["New Kernel\nVersion"] --> B["Canary\nValidation"]
    B --> C["Wave 1\n1% Fleet"]
    C --> D["Wave 2\n5% Fleet"]
    D --> E["Wave 3\n25% Fleet"]
    E --> F["Full Fleet\nRollout"]
    C -- anomaly --> G["Pause &\nAuto-Alert"]
    D -- anomaly --> G
    E -- anomaly --> G
    style B fill:#6366f1,color:#fff
    style C fill:#6366f1,color:#fff
    style D fill:#6366f1,color:#fff
    style E fill:#6366f1,color:#fff
    style F fill:#6366f1,color:#fff

Technical Deep Dive
#

Validate Before You Roll
#

A Python library that validates kernel safety before fleet-wide rollout:

  • Provisions canary hosts from each hardware/workload combination
  • Applies kernel update and reboots canary hosts
  • Runs validation suites — system stability, performance benchmarks, application health
  • Compares baselines — CPU, memory, I/O latency, network throughput vs. production

Only after passing canary validation on every host type is a kernel approved for rollout.

Wave-Based Deployment
#

Automated deployment in progressive waves with anomaly detection between each:

Wave Coverage Purpose
Wave 1 1% Smoke test in production
Wave 2 5% Expand to more host types
Wave 3 25% Majority coverage
Wave 4 100% Full fleet completion

Between each wave: automated monitoring for unexpected reboots, performance regression, and application errors. Any signal crossing a threshold pauses the rollout automatically.

Eliminating Drift
#

Configuration drift was the root cause of most fleet management pain. Built tooling to:

  • Audit every host against its declared state
  • Detect drift from intended configuration
  • Auto-remediate safe divergences, flag risky ones for human review

This was a prerequisite for safe automation — you can’t automate kernel updates on hosts whose configuration you don’t fully understand.

Runtime Cache Control
#

Custom commands system for Twitter’s Redis-based cache services using Go:

  • Inspect and modify cache behavior at runtime
  • Debug production issues without service restarts
  • Zero customer impact during investigation

Impact
#

Tech Stack
#

Python
Go
Redis
Linux Kernel
Fleet Management
Bare Metal

Read the Full Story

Related

AWS Billing Unbilled Usage Auditor

·276 words·2 mins
Designed and built a distributed system that detects unbilled usage across all AWS services — reducing charge discrepancies by 300x and eliminating 230 million monthly false positives. Key Metrics # 300x Reduction in Discrepancies $125,000 → $432 230M False Positives Eliminated ~95% Alert Actionability Architecture # flowchart LR A["Usage Records\n(Billions/day)"] --> B["Smart Sampling\n& Aggregation"] B --> C["Multi-Signal\nValidation"] C --> D["Automated\nResolution"] D --> E{Real issue?} E -- Yes --> F["Alert with\nDiagnosis"] E -- No --> G["Auto-resolve\n& Log"] style B fill:#6366f1,color:#fff style C fill:#6366f1,color:#fff style D fill:#6366f1,color:#fff Technical Deep Dive # Aggregation over Brute-Force # Instead of checking every individual usage record (which generated 230M false positives), the system aggregates at the service-account-period level.