(One-paragraph grounding if launchd isn’t your daily driver: launchd is macOS’s init system — the equivalent of systemd on Linux or Windows Services on Windows. It boots PID 1, brings up daemons, restarts them when they crash. A LaunchAgent is a per-user launchd job, defined by an XML plist — property list — at ~/Library/LaunchAgents/<name>.plist. KeepAlive is one of the plist keys; set it to true and launchd will respawn the job whenever it exits. launchctl is the CLI you use to load, unload, and inspect those jobs. The Linux mental model: think systemctl driving systemd unit files. The Stream Deck plugin and its daemon are described in the TCC cdhash trap post if you want the project context.)
I’m building a Stream Deck plugin called ClaudeDeck. Its daemon runs under a user LaunchAgent so it comes up at login and stays up — RunAtLoad: true, KeepAlive: true, the obvious shape. Writing the plist took ten minutes. Getting the daemon to actually go away when I wanted it to took a weekend.
The story below is one specific bug — the daemon refusing to die. I’ll skip the bootstrap-domain confusion I hit earlier (the gui/$UID vs system thing — launchd has multiple “domains” you can load a job into, one per logged-in GUI user and one system-wide, and pointing launchctl at the wrong one is the most common first-time confusion; well-covered elsewhere) and stick to the part that surprised me.
The symptom #
I had a disable.sh script for pausing the daemon when I wanted to debug unrelated AppleEvent issues. The first version was three lines:
launchctl unload "$PLIST"
mv "$PLIST" "$PLIST.disabled"
echo "claudedeck: disabled."Worked on paper. Failed in practice. The verification step I added later — a sanity pgrep — caught it:
claudedeck: launchctl unload...
claudedeck: plist parked at ...com.nickboy.claudedeck.plist.disabled
claudedeck: verification:
✅ launchctl no longer lists com.nickboy.claudedeck
⚠️ claudedeck-daemon still running — try: pkill -9 -f claudedeck-daemonTwo things were true at once:
launchctl list | grep claudedeckwas empty. The job was unloaded.pgrep -f claudedeck-daemonprinted a PID. The process was alive.
If you’ve never seen this combination, it looks like a lie. launchd is the only thing that should have a handle on this daemon, and launchd just told me it doesn’t.
What I thought was happening #
My mental model was: launchctl unload is synchronous. By the time the command returns, the job is torn down — supervisor gone, child SIGTERM’d (sent the “please shut down gracefully” signal — number 15, the default for kill), exit reaped (the parent called wait() on the child’s exit status so the kernel can release its PID slot). That’s how systemctl stop behaves on Linux. That’s how I assumed launchd behaves too.
Wrong, but in a subtle way. launchctl unload is synchronous about the unload itself. What it’s not synchronous about is whatever the supervisor did in the milliseconds before you called it.
What I tried first #
Standard debugging moves. None of them were the answer, but the dead ends are part of the shape of the bug.
Try one: maybe KeepAlive: true is too aggressive. I swapped it for KeepAlive: { SuccessfulExit: false }, the variant that only restarts on crash. No change. The straggler PID still appeared.
Try two: maybe launchctl unload is the legacy verb and bootout is the synchronous one. Apple deprecated load/unload years ago in favour of bootstrap/bootout (the newer verbs take an explicit domain — gui/$(id -u) for “this logged-in user’s GUI session,” system for the LaunchDaemon scope). I rewrote the script:
launchctl bootout "gui/$(id -u)/com.nickboy.claudedeck"Same result. The job left launchctl list. The process didn’t leave pgrep.
Try three: Console.app. (Console.app is macOS’s built-in system-log viewer — the GUI front-end for log show. You filter by subsystem in the search bar, and com.apple.xpc.launchd is launchd’s own subsystem identifier; XPC is the macOS inter-process-comms layer launchd is built on top of.) This was the click. I filtered for com.apple.xpc.launchd and ran disable again. The relevant lines:
launchd: Service exited: com.nickboy.claudedeck — Killed: 15
launchd: Service instance exited cleanly
launchd: Service spawned with PID 67421
launchd: Service exited: com.nickboy.claudedeck — Killed: 15Read that twice. The job got SIGTERM. It exited. Then launchd spawned a new instance — with a new PID, 67421 — and then sent that one SIGTERM too. Two restart cycles inside a single bootout.
Root cause #
KeepAlive is a polling supervisor, not an event-driven one. (This is the most useful sentence in this post; the rest of it is just unpacking that sentence.) Concretely:
- launchd keeps a per-job “should this be running?” predicate. With
KeepAlive: trueit’s always yes. - A separate path watches for SIGCHLD on the supervised process. (SIGCHLD is the Unix signal a process receives when one of its children exits — the kernel’s way of saying “you have a corpse to reap.” launchd’s supervisor handler responds by checking the predicate.) When a child exits, launchd evaluates the predicate. If the answer is “yes”, it spawns a replacement.
bootoutflips the predicate to no, then sends SIGTERM, then reaps.
The race lives in the gap between the child exiting and bootout flipping the predicate. If SIGCHLD lands first — and on a multi-core machine under light load, it usually does — the supervisor sees “child died, predicate still true, restart” before bootout gets to flip the predicate. So you get a fresh PID milliseconds before the unload finishes. From the outside, launchctl list is empty (the job’s gone) but a daemon PID is still in the process table (because nothing told the replacement to die in a way the orphan reaper — the system process that adopts and eventually wait()s on parentless children — would notice quickly).
In my case I was making it worse. The daemon I was disabling had a graceful-shutdown handler:
process.on("SIGTERM", () => void shutdown("SIGTERM"));shutdown() flushes session state, closes WebSocket connections, awaits in-flight HTTP responses. That takes a few hundred milliseconds. During those few hundred milliseconds, KeepAlive’s poller is happily watching the original PID, which is still alive but draining. The moment that PID exits, the poller fires a respawn — which races bootout’s predicate flip and sometimes wins.
The cleaner the shutdown, the wider the race window.
The fix #
The fix isn’t on the launchd side. There is no bootout --really-synchronous flag. There’s no KeepAlive: { NoRaces: true }. The fix is to accept that launchctl has told you the truth about the job and then independently verify the process is gone:
# 3. Stop the running daemon. `launchctl unload` is idempotent enough
# that we ignore its exit code — if it was never loaded we still
# want to proceed to renaming the plist.
launchctl unload "$PLIST" 2>/dev/null || true
# 4. Belt-and-suspenders: kill any straggler the daemon left behind
# (e.g. KeepAlive raced our unload, or `bun build --compile` is
# still running an older copy from a different path).
if pgrep -f claudedeck-daemon >/dev/null 2>&1; then
pkill -9 -f claudedeck-daemon || true
echo "claudedeck: killed leftover daemon process"
fiThat’s scripts/disable.sh lines 47–59. SIGKILL (-9, the signal the kernel doesn’t let a process catch or ignore — instant termination, no shutdown handler) is intentional, not lazy: the straggler is by definition the second instance, the one that didn’t get the graceful-shutdown SIGTERM. Sending it SIGTERM would just restart the cycle. The job is already unloaded, so SIGKILL on an orphaned child doesn’t trigger another respawn.
I also renamed the plist to <name>.plist.disabled after the kill — because if I left the plist in place and the kill triggered a SIGCHLD before launchd had fully removed the job from its tables, I’d be back to square one. Rename means “even if a respawn slips through, it’ll fail to find a plist on the next iteration.” Belt and suspenders and a third belt.
Why I didn’t see this immediately #
Two things hid the bug for a while.
First, the install path uses launchctl bootstrap gui/$UID <plist> — see cli/src/install.ts line 366. The error message on failure is:
launchctl bootstrap exited with 5; continuingErrno 5 (EIO — “I/O error,” the POSIX errno that launchd reuses as its catch-all for “I refuse”) from bootstrap usually means “service already loaded with this label.” I hit that constantly during development because the daemon was still loaded from the previous install. I assumed it was a stale bootstrap and re-ran. That worked. So the install side never made me look hard at the unload side.
Second, the doctor check (cli/src/doctor.ts line 120) only queries launchctl list — it doesn’t pgrep for the process. So during a respawn window, the doctor reported red (no entry in launchctl list), the install retry succeeded (because the entry showed up), and the orphan PID sat there hogging port 9127. Which presented as “the daemon’s running but not responding to my new install’s HTTP calls”. I blamed the port, twice, before I blamed the supervisor.
Lessons #
- KeepAlive is a polling supervisor, not an event-driven one. Restart races are inherent. Any teardown script that doesn’t independently verify the process is gone is one SIGCHLD-timing accident away from leaking a daemon.
launchctl listandpgrepare not redundant. The job and the process are different objects. A health check that asks only the supervisor whether a daemon exists will miss orphaned children.Console.appfiltered tocom.apple.xpc.launchdis the actual debugger.man launchctldocuments the verbs, not the lifecycle. The lifecycle shows up in the log — the SIGCHLD → respawn → SIGTERM sequence is what tells you the race exists.- Graceful shutdown widens the race window. Anything you do in your SIGTERM handler is time during which KeepAlive’s poller can decide your child has misbehaved and queue a replacement. Make the handler fast, or accept that your teardown script needs a SIGKILL backstop.
References #
launchctl(1)andlaunchd.plist(5)man pages — verb syntax and KeepAlive sub-keys. The man pages are accurate but say nothing about lifecycle ordering. Apple’s launchd man pages andConsole.app(filtered tocom.apple.xpc.launchd) are where the actual cause shows up.- ClaudeDeck install path:
cli/src/install.ts(thelaunchctl bootstrapcall, with the silent “continuing” on non-zero exit) - ClaudeDeck plist generator:
hooks/launchd.ts(RunAtLoad: true,KeepAlive: true) - ClaudeDeck doctor’s launchd check:
cli/src/doctor.ts(thelaunchctl list | grep <label>probe that misses orphan processes) - ClaudeDeck disable script:
scripts/disable.sh(thepkill -9belt-and-suspenders) - Related post on this site: TCC pins your Accessibility grant to a cdhash. Every rebuild breaks it. — same daemon, different macOS-internals trap.