Skip to main content
  1. Posts/

launchctl unload returned 0. The daemon was still running. KeepAlive raced.

Nick Liu
Author
Nick Liu
Building infrastructure for Facebook Feed Ranking at Meta. Previously at Walmart, Twitter, AWS, and eBay. MS in Computer Science at Georgia Tech.
Table of Contents
`launchctl unload ~/Library/LaunchAgents/com.nickboy.claudedeck.plist` exited 0. Then `pgrep -f claudedeck-daemon` printed a fresh PID. Three seconds after the "unload succeeded" line. Spoiler: KeepAlive is a polling supervisor, not an event-driven one, and when you tell launchd to tear a job down, there is a window where the supervisor has already noticed the previous PID is gone and started a replacement.

(One-paragraph grounding if launchd isn’t your daily driver: launchd is macOS’s init system — the equivalent of systemd on Linux or Windows Services on Windows. It boots PID 1, brings up daemons, restarts them when they crash. A LaunchAgent is a per-user launchd job, defined by an XML plist — property list — at ~/Library/LaunchAgents/<name>.plist. KeepAlive is one of the plist keys; set it to true and launchd will respawn the job whenever it exits. launchctl is the CLI you use to load, unload, and inspect those jobs. The Linux mental model: think systemctl driving systemd unit files. The Stream Deck plugin and its daemon are described in the TCC cdhash trap post if you want the project context.)

I’m building a Stream Deck plugin called ClaudeDeck. Its daemon runs under a user LaunchAgent so it comes up at login and stays up — RunAtLoad: true, KeepAlive: true, the obvious shape. Writing the plist took ten minutes. Getting the daemon to actually go away when I wanted it to took a weekend.

The story below is one specific bug — the daemon refusing to die. I’ll skip the bootstrap-domain confusion I hit earlier (the gui/$UID vs system thing — launchd has multiple “domains” you can load a job into, one per logged-in GUI user and one system-wide, and pointing launchctl at the wrong one is the most common first-time confusion; well-covered elsewhere) and stick to the part that surprised me.

The symptom
#

I had a disable.sh script for pausing the daemon when I wanted to debug unrelated AppleEvent issues. The first version was three lines:

launchctl unload "$PLIST"
mv "$PLIST" "$PLIST.disabled"
echo "claudedeck: disabled."

Worked on paper. Failed in practice. The verification step I added later — a sanity pgrep — caught it:

claudedeck: launchctl unload...
claudedeck: plist parked at ...com.nickboy.claudedeck.plist.disabled
claudedeck: verification:
  ✅ launchctl no longer lists com.nickboy.claudedeck
  ⚠️  claudedeck-daemon still running — try: pkill -9 -f claudedeck-daemon

Two things were true at once:

  1. launchctl list | grep claudedeck was empty. The job was unloaded.
  2. pgrep -f claudedeck-daemon printed a PID. The process was alive.

If you’ve never seen this combination, it looks like a lie. launchd is the only thing that should have a handle on this daemon, and launchd just told me it doesn’t.

What I thought was happening
#

My mental model was: launchctl unload is synchronous. By the time the command returns, the job is torn down — supervisor gone, child SIGTERM’d (sent the “please shut down gracefully” signal — number 15, the default for kill), exit reaped (the parent called wait() on the child’s exit status so the kernel can release its PID slot). That’s how systemctl stop behaves on Linux. That’s how I assumed launchd behaves too.

Wrong, but in a subtle way. launchctl unload is synchronous about the unload itself. What it’s not synchronous about is whatever the supervisor did in the milliseconds before you called it.

What I tried first
#

Standard debugging moves. None of them were the answer, but the dead ends are part of the shape of the bug.

Try one: maybe KeepAlive: true is too aggressive. I swapped it for KeepAlive: { SuccessfulExit: false }, the variant that only restarts on crash. No change. The straggler PID still appeared.

Try two: maybe launchctl unload is the legacy verb and bootout is the synchronous one. Apple deprecated load/unload years ago in favour of bootstrap/bootout (the newer verbs take an explicit domaingui/$(id -u) for “this logged-in user’s GUI session,” system for the LaunchDaemon scope). I rewrote the script:

launchctl bootout "gui/$(id -u)/com.nickboy.claudedeck"

Same result. The job left launchctl list. The process didn’t leave pgrep.

Try three: Console.app. (Console.app is macOS’s built-in system-log viewer — the GUI front-end for log show. You filter by subsystem in the search bar, and com.apple.xpc.launchd is launchd’s own subsystem identifier; XPC is the macOS inter-process-comms layer launchd is built on top of.) This was the click. I filtered for com.apple.xpc.launchd and ran disable again. The relevant lines:

launchd: Service exited: com.nickboy.claudedeck — Killed: 15
launchd: Service instance exited cleanly
launchd: Service spawned with PID 67421
launchd: Service exited: com.nickboy.claudedeck — Killed: 15

Read that twice. The job got SIGTERM. It exited. Then launchd spawned a new instance — with a new PID, 67421 — and then sent that one SIGTERM too. Two restart cycles inside a single bootout.

Root cause
#

KeepAlive is a polling supervisor, not an event-driven one. (This is the most useful sentence in this post; the rest of it is just unpacking that sentence.) Concretely:

  • launchd keeps a per-job “should this be running?” predicate. With KeepAlive: true it’s always yes.
  • A separate path watches for SIGCHLD on the supervised process. (SIGCHLD is the Unix signal a process receives when one of its children exits — the kernel’s way of saying “you have a corpse to reap.” launchd’s supervisor handler responds by checking the predicate.) When a child exits, launchd evaluates the predicate. If the answer is “yes”, it spawns a replacement.
  • bootout flips the predicate to no, then sends SIGTERM, then reaps.

The race lives in the gap between the child exiting and bootout flipping the predicate. If SIGCHLD lands first — and on a multi-core machine under light load, it usually does — the supervisor sees “child died, predicate still true, restart” before bootout gets to flip the predicate. So you get a fresh PID milliseconds before the unload finishes. From the outside, launchctl list is empty (the job’s gone) but a daemon PID is still in the process table (because nothing told the replacement to die in a way the orphan reaper — the system process that adopts and eventually wait()s on parentless children — would notice quickly).

In my case I was making it worse. The daemon I was disabling had a graceful-shutdown handler:

process.on("SIGTERM", () => void shutdown("SIGTERM"));

shutdown() flushes session state, closes WebSocket connections, awaits in-flight HTTP responses. That takes a few hundred milliseconds. During those few hundred milliseconds, KeepAlive’s poller is happily watching the original PID, which is still alive but draining. The moment that PID exits, the poller fires a respawn — which races bootout’s predicate flip and sometimes wins.

The cleaner the shutdown, the wider the race window.

The fix
#

The fix isn’t on the launchd side. There is no bootout --really-synchronous flag. There’s no KeepAlive: { NoRaces: true }. The fix is to accept that launchctl has told you the truth about the job and then independently verify the process is gone:

# 3. Stop the running daemon. `launchctl unload` is idempotent enough
#    that we ignore its exit code — if it was never loaded we still
#    want to proceed to renaming the plist.
launchctl unload "$PLIST" 2>/dev/null || true

# 4. Belt-and-suspenders: kill any straggler the daemon left behind
#    (e.g. KeepAlive raced our unload, or `bun build --compile` is
#    still running an older copy from a different path).
if pgrep -f claudedeck-daemon >/dev/null 2>&1; then
  pkill -9 -f claudedeck-daemon || true
  echo "claudedeck: killed leftover daemon process"
fi

That’s scripts/disable.sh lines 47–59. SIGKILL (-9, the signal the kernel doesn’t let a process catch or ignore — instant termination, no shutdown handler) is intentional, not lazy: the straggler is by definition the second instance, the one that didn’t get the graceful-shutdown SIGTERM. Sending it SIGTERM would just restart the cycle. The job is already unloaded, so SIGKILL on an orphaned child doesn’t trigger another respawn.

I also renamed the plist to <name>.plist.disabled after the kill — because if I left the plist in place and the kill triggered a SIGCHLD before launchd had fully removed the job from its tables, I’d be back to square one. Rename means “even if a respawn slips through, it’ll fail to find a plist on the next iteration.” Belt and suspenders and a third belt.

Why I didn’t see this immediately
#

Two things hid the bug for a while.

First, the install path uses launchctl bootstrap gui/$UID <plist> — see cli/src/install.ts line 366. The error message on failure is:

launchctl bootstrap exited with 5; continuing

Errno 5 (EIO — “I/O error,” the POSIX errno that launchd reuses as its catch-all for “I refuse”) from bootstrap usually means “service already loaded with this label.” I hit that constantly during development because the daemon was still loaded from the previous install. I assumed it was a stale bootstrap and re-ran. That worked. So the install side never made me look hard at the unload side.

Second, the doctor check (cli/src/doctor.ts line 120) only queries launchctl list — it doesn’t pgrep for the process. So during a respawn window, the doctor reported red (no entry in launchctl list), the install retry succeeded (because the entry showed up), and the orphan PID sat there hogging port 9127. Which presented as “the daemon’s running but not responding to my new install’s HTTP calls”. I blamed the port, twice, before I blamed the supervisor.

Lessons
#

  • KeepAlive is a polling supervisor, not an event-driven one. Restart races are inherent. Any teardown script that doesn’t independently verify the process is gone is one SIGCHLD-timing accident away from leaking a daemon.
  • launchctl list and pgrep are not redundant. The job and the process are different objects. A health check that asks only the supervisor whether a daemon exists will miss orphaned children.
  • Console.app filtered to com.apple.xpc.launchd is the actual debugger. man launchctl documents the verbs, not the lifecycle. The lifecycle shows up in the log — the SIGCHLD → respawn → SIGTERM sequence is what tells you the race exists.
  • Graceful shutdown widens the race window. Anything you do in your SIGTERM handler is time during which KeepAlive’s poller can decide your child has misbehaved and queue a replacement. Make the handler fast, or accept that your teardown script needs a SIGKILL backstop.

References
#

  • launchctl(1) and launchd.plist(5) man pages — verb syntax and KeepAlive sub-keys. The man pages are accurate but say nothing about lifecycle ordering. Apple’s launchd man pages and Console.app (filtered to com.apple.xpc.launchd) are where the actual cause shows up.
  • ClaudeDeck install path: cli/src/install.ts (the launchctl bootstrap call, with the silent “continuing” on non-zero exit)
  • ClaudeDeck plist generator: hooks/launchd.ts (RunAtLoad: true, KeepAlive: true)
  • ClaudeDeck doctor’s launchd check: cli/src/doctor.ts (the launchctl list | grep <label> probe that misses orphan processes)
  • ClaudeDeck disable script: scripts/disable.sh (the pkill -9 belt-and-suspenders)
  • Related post on this site: TCC pins your Accessibility grant to a cdhash. Every rebuild breaks it. — same daemon, different macOS-internals trap.

Related

TCC pins your Accessibility grant to a cdhash. Every rebuild breaks it.

My daemon's preflight log said `osascript is not allowed assistive access. (-1719)`. System Settings disagreed — the entry was right there, toggled on. Spoiler: ad-hoc codesigning pins TCC's designated requirement to the binary's cdhash, and `bun build --compile` produces a different cdhash on every rebuild. I’m building a Stream Deck plugin called ClaudeDeck — Stream Deck is Elgato’s little USB grid of programmable keys with LCD displays under each one. The plugin talks to a background daemon (a long-running process that starts at login and waits for events), and that daemon needs to call System Events via AppleScript to switch Ghostty tabs (Ghostty is my terminal emulator) whenever I press a Stream Deck key. macOS gates that capability — automating other apps — through System Settings → Privacy & Security → Accessibility, the pane you’ve probably toggled for tools like Rectangle or BetterTouchTool. So on first install I added the daemon, toggled it on, and got back to work.

What replaced CGEventPost in my Stream Deck daemon

I press the Stream Deck key. The daemon logs the press, synthesizes `Cmd+Opt+;` through CoreGraphics, and exits cleanly. Wispr Flow does nothing. Three Apple subsystems and one decompiled Electron bundle later, the working trigger turned out to be a one-line URL. The plan was the boring kind: Stream Deck key (the physical button on Elgato’s programmable USB grid) → WebSocket message → my daemon (long-running background process) → synthesized global hotkey → Wispr Flow’s hands-free dictation starts — Wispr Flow is the voice-to-text Mac app that types your speech into the focused window — → I talk → words show up in my editor. I’d done variants of this with osascript (macOS’s command-line AppleScript runner) years ago. Should have taken an afternoon.

Five Stream Deck keys, N Claude sessions: LRU that keeps the order I see

A Stream Deck has five session keys. I usually have six or seven Claude Code sessions running. When a new one shows up, the muscle memory test isn't "does the right session get evicted" — it's "do the four survivors stay on the keys they were already on." (Two bits of context for anyone new to the stack: Stream Deck is Elgato’s USB grid of programmable LCD keys, and a “session” here is a single Claude Code conversation — claude running in one terminal tab, with its own working directory, its own context window, its own history. LRU stands for “least-recently used,” the standard cache-eviction policy: when you need to make room, drop the entry nobody has touched in the longest time.)