Skip to content

Event-driven DAG daemon mode (converge serve)#5

Merged
TsekNet merged 58 commits intomainfrom
dag-daemon
Mar 16, 2026
Merged

Event-driven DAG daemon mode (converge serve)#5
TsekNet merged 58 commits intomainfrom
dag-daemon

Conversation

@TsekNet
Copy link
Copy Markdown
Owner

@TsekNet TsekNet commented Mar 16, 2026

Summary

  • Replace run-once execution with event-driven DAG-based daemon
  • converge serve <blueprint> runs as a persistent service
  • Resources execute in topological layer order (DAG, not flat list)
  • Auto-edges detect implicit dependencies (Service->Package, File->parent Dir)
  • OS-level watchers detect drift instantly (inotify for File on Linux)
  • Exponential retry/backoff with noncompliance reporting after N failures
  • Event coalescing + per-resource rate limiting prevent thundering herd
  • Centralized exit codes in internal/exit/
  • --once flag for CI/Packer (converge once, exit)

New packages

Package Purpose
internal/graph/ DAG with topological layer computation (heimdalr/dag)
internal/graph/autoedge/ Implicit dependency detection
internal/daemon/ Event-driven convergence loop with retry/backoff
internal/exit/ Centralized exit codes
internal/daemon/coalesce.go Event coalescing + rate limiting
extensions/file/watch_linux.go inotify watcher (reference impl)

New dependencies

  • github.com/heimdalr/dag (thread-safe DAG)
  • golang.org/x/time (rate limiter)

Test plan

  • go test ./... passes (34 test files, including 11 new graph tests, 7 daemon tests, 7 auto-edge tests, 3 coalescer tests, 3 inotify watcher tests)
  • go vet ./... clean
  • go build ./... compiles
  • Manual: converge serve workstation starts daemon, touch managed file, observe re-convergence
  • Manual: converge serve workstation --once exits after initial convergence

Ref #4

TsekNet added 11 commits March 15, 2026 22:06
Replace flat resource slice with a directed acyclic graph (heimdalr/dag).
Resources execute in topological layer order: dependencies complete before
dependents start, resources within the same layer run concurrently.

Add DependsOn field to all Opts structs for explicit dependency declaration.
Add RunPlanDAG/RunApplyDAG engine functions that iterate by topological layer.

Ref #4
Add Watcher and Poller optional interfaces to extensions.Extension.
Resources implementing Watcher get OS-level event notifications,
others fall back to polling at configurable intervals.

New daemon package (internal/daemon) runs initial convergence then
starts per-resource Watch/poll goroutines with a central event loop.

New CLI: converge watch <blueprint> [--once]

Ref #4
Automatically detect dependency relationships between resources:
- service:X depends on package:X (name match)
- file:/a/b/c depends on file:/a/b (parent directory)
- service:X depends on file paths containing the service name

Edges that would create cycles are silently skipped.
Wire auto-edges into BuildGraph, RunPlan, and RunApply paths.

Ref #4
Resources that fail to converge are retried with exponential backoff
(baseDelay * 2^retryCount, capped at 5 minutes). After --max-retries
(default 3), the resource is marked noncompliant and logged as a
warning. Watching continues: new external events reset the retry count.

Add Daemon.Status() for querying per-resource compliance state.
Add --max-retries flag to converge watch command.

Ref #4
Coalescer collapses multiple rapid events for the same resource into a
single CheckApply after a configurable window (default 500ms).

Per-resource rate limiter (golang.org/x/time/rate) prevents flapping
resources from consuming excessive CPU.

Ref #4
converge serve is cleaner and more descriptive for a persistent
daemon mode than converge watch.

Ref #4
Implement extensions.Watcher for the File resource on Linux using
inotify via golang.org/x/sys/unix. Watches both the file and its
parent directory to detect creation, modification, deletion, and
attribute changes. Uses epoll for interruptible reads.

This is the reference implementation for all platform-specific
watchers. Other resources fall back to polling until their native
watchers are implemented.

Ref #4
Move all magic exit code numbers to a single package.
Follows Puppet/Chef convention: 0=ok, 1=error, 2=changed,
3=partial fail, 4=all failed, 5=pending.

Ref #4
Update all documentation to reflect the new event-driven DAG daemon:
- README: converge serve replaces converge apply as primary command
- design.md: DAG engine, auto-edges, daemon mode, and retry/backoff
  mermaid diagrams for DAG layers, daemon lifecycle, and plan flow
- cli.md: converge serve command, --once, --max-retries flags,
  centralized exit code reference

Ref #4
Native OS event watchers:
- File: kqueue (macOS), ReadDirectoryChangesW (Windows)
- Service: godbus/dbus PropertiesChanged (Linux), NotifyServiceStatusChange (Windows), poll (macOS)
- Registry: RegNotifyChangeKeyValue (Windows)
- Sysctl: inotify on /proc/sys/ (Linux)
- Plist: kqueue on plist file (macOS)

Poll-only watchers (no native OS events):
- Package: 5m, Exec: 30s, User: 60s, Firewall: 30s
- AuditPolicy: 60s, SecurityPolicy: 60s

Remove dead code:
- converge apply CLI command (replaced by converge serve)
- RunApply, CheckDuplicates flat-list engine functions
- isRoot from dsl/app.go

Update goreleaser to produce .deb packages via nfpms.
Rewrite engine tests to use DAG functions.
Add BuildGraph auto-edge test.

Ref #4
@TsekNet TsekNet changed the base branch from dev to main March 16, 2026 02:38
TsekNet added 10 commits March 15, 2026 22:41
CRITICAL:
- Remove stale converge apply from cli.md, examples.md, terminal output
- Fix handleFailure goroutine leak (add ctx.Done select)

HIGH:
- Restore root privilege check in serve command
- Make event loop concurrent with per-resource processing lock
- Wire event reason constants (reasonPoll, reasonRetry)
- Fix inotify watch re-establishment after IN_DELETE_SELF
- Fix kqueue watch re-establishment after NOTE_DELETE/RENAME
- Fix sysctl path traversal (validate key, filepath.Clean)
- Add bounds checking to inotify/ReadDirectoryChangesW unsafe parsing
- Replace Windows service SCM watcher with polling (APC incompatible with Go)
- Fix D-Bus AddMatch error check, filter PropertiesChanged by ActiveState
- Fix autoedge false positives for short service names (min 3 chars, path component match)
- Update Security Model table, README features table, DependsOn docs

MEDIUM:
- Fix ReadDirectoryChangesW overlapped re-issue
- Add parent dir fallback to plist watcher
- Skip polling for noncompliant resources
- Only recover string panics in runBlueprint
- Propagate initial convergence error from Run
- Validate MaxRetries > 0
- Add Graph.Flatten() to deduplicate layer flattening
- Add Watcher/Poller docs to extending.md
- Fix "No implicit behavior" to "No implicit mutations"

LOW:
- Remove unused AllExtensions()
- Remove unused serviceNotify constants (Windows service watcher removed)

Ref #4
Linux: systemd unit (converge.service)
macOS: launchd plist (com.tseknet.converge.plist)
Windows: MSI ServiceInstall/ServiceControl in WiX

.deb postinst enables and starts the systemd service.
.pkg postinstall bootstraps the launchd daemon.
MSI registers and starts the Windows service via SCM.

Packages handle upgrades: stop old service, install new binary, restart.

Ref #4
"baseline" is the standard term in CIS/STIG and config management
for the minimum configuration every managed host must have.

Rename across: blueprint function, registration, all docs, service
files, VHS demo, and CLI examples.

Ref #4
1. Wire coalescer + rate limiter into daemon event loop
   Coalescer deduplicates burst events per resource (500ms window).
   Per-resource rate limiter (x/time/rate) throttles watch/poll events.
   Retry events bypass both (they have their own backoff).

2. Extract retry state machine into internal/daemon/retry.go
   retryManager owns all per-resource state, shouldProcess, reset,
   recordFailure, isNoncompliant. Daemon is now the event loop
   coordinator, not a god object.

3. Replace panics with error accumulation in DSL
   Run.err accumulates the first error. Blueprint functions no longer
   panic on duplicate resources, missing dependencies, or empty fields.
   BuildGraph checks run.Err() after execution. Stack traces preserved
   for genuine runtime panics.

4. Typed events (extensions.EventKind) replace string routing
   EventWatch, EventPoll, EventRetry are compile-time-checked constants.
   Event.Reason -> Event.Kind (typed) + Event.Detail (human-readable).
   No more string comparisons for event routing in the daemon.

5. Auto-edge serviceToConfigFile path component matching (from review)
   Already applied in previous commit.

6. CoalesceWindow configurable via Options for testing

Ref #4
Replace O(V^2) GetParents-per-node query with incremental in-degree
tracking during AddEdge. TopologicalLayers now runs in O(V+E) using
pre-computed adjacency lists.

Benchmarks at 2000 nodes:
- Linear chain (worst case): 0.48ms, 364KB, 4021 allocs
- Wide (10 layers x 200):    0.36ms, 261KB, 103 allocs

Ref #4
Replace the heimdalr/dag wrapper with a self-contained DAG using:
- Incremental in-degree + adjacency lists (O(V+E) topological sort)
- DFS cycle detection on AddEdge via transitive reachability check
- Insertion-order tracking for deterministic iteration

Removes 3 transitive dependencies (heimdalr/dag, emirpasic/gods,
google/uuid). Same benchmark performance, simpler code, no wrapper.

Ref #4
Convert all tests to table-driven with t.Parallel() on every subtest.

Removed fluff tests that just verify constants or stdlib behavior:
- TestResourceState, TestServiceState (string constants)
- TestDefaultOptions (struct literal)
- TestWithTimeout (context.WithTimeout)
- TestIsCritical (type assertion)
- TestNodes, TestOrderedExtensions (map length)
- TestApp_Version (version \!= "")

Consolidated related tests into table-driven groups:
- graph: TestAddNode (2 cases), TestAddEdge (4 cases),
  TestTopologicalLayers (4 cases)
- autoedge: TestAddAutoEdges (7 cases)
- dsl: TestRun_Include (2 cases), TestRun_Firewall (3 cases)
- app: TestApp_RunPlan (3 cases), TestApp_BuildGraph (2 cases)

Added t.Helper() to all test helper functions.

Ref #4
Replace bare bool fields with atomic.Bool in mockExt and
mockTransientFailExt. The inSync field is read by Check() in
daemon goroutines and written by test goroutines, causing races
under -race.

Ref #4
CRITICAL (#1):
- Fix kqueue fd leak in darwin file/plist watchers: explicit fd management
  instead of defer capturing stale fd

HIGH (#2-3): Shared watcher multiplexer
- New internal/watch/inotify_linux.go: single inotify+epoll fd for all file
  and sysctl watchers. Prevents hitting inotify_max_user_instances (128) at
  2000+ resources. 5 tests.
- File and sysctl watchers refactored to use shared multiplexer

HIGH (#4-6): Graph scaling
- AddEdge is now O(1) with lazy cycle detection via TopologicalLayers
- Duplicate edges silently deduplicated via edge set
- Auto-edge serviceToConfigFile uses exact config extension matching
- WouldCycle() BFS for auto-edge cycle avoidance

HIGH (#7-8): Daemon correctness
- Default Timeout to 5m when unset (prevents instant context expiry)
- Nil checks in retryManager for unknown resource IDs

HIGH (#9): DSL simplification
- Extract r.require() helper, cutting ~50 lines of boilerplate

HIGH (#10): Watcher dedup (via shared multiplexer above)

HIGH (#11): Unsafe pointer bounds
- Use unsafe.Offsetof for Windows FILE_NOTIFY_INFORMATION headerSize

HIGH (#12): DAG-aware drift remediation
- After successful Apply, schedule Check for dependent resources via Children()

MEDIUM (#13-20): Simplification + security
- Remove dead Nodes() allocation
- Systemd: NoNewPrivileges=yes, remove ProtectSystem=full
- eventMeta stores EventKind not full Event
- Remove retryManager.mu (states map is write-once)
- ResourceMeta struct embedded in all Opts (DependsOn+Critical consolidated)
- Error accumulation: []error with errors.Join, not single error
- Move isRoot() to internal/platform/root.go
- Registry watcher: re-register before sending event

LOW (#21-25): Tests, docs, minor
- Cycle detection test via TopologicalLayers
- Log dropped coalescer events
- Document Event struct and EventKind in extending.md
- Document default blueprint in Service Installation
- Rename coal -> coalescer

Ref #4
TsekNet added 7 commits March 16, 2026 09:31
All platforms now use shared watcher multiplexers:

Linux:
- internal/watch/inotify_linux.go: one inotify fd for all file+sysctl watchers
- internal/watch/dbus_linux.go: one dbus connection for all service watchers

macOS:
- internal/watch/kqueue_darwin.go: one kqueue fd for all file+plist watchers

Windows:
- ReadDirectoryChangesW is already directory-scoped (one handle per dir)
- Service watcher uses polling (SCM notify incompatible with Go scheduler)

This prevents hitting OS limits at 2000+ resources:
- Linux: inotify_max_user_instances (128), dbus max-connections (256)
- macOS: per-process fd limits

Ref #4
README: lead with "Event-driven DAG daemon" tagline. Comparison table
adds drift detection latency (<1s vs ~30min cron). Features table
reordered with DAG and daemon first. Cross-platform quick start.

design.md: new "DAG + Event-Driven Difference" section. OS event
mechanism table per platform. DAG-aware re-convergence explained.
Updated Lessons from Chef with blind spot and propagation rows.

examples.md: cross-platform blueprint examples (Linux, macOS, Windows).
DependsOn section with three-layer DAG. Daemon mode usage with --once.

Rename extending.md -> extensions.md for clarity.

Ref #4
…eout

AutoGrouping:
- Batch package installs into single transaction (apt install git curl neovim)
- All 9 package managers implement BatchInstaller (InstallBatch/RemoveBatch)
- PackageGroup in internal/engine/autogroup.go replaces individual packages
  in each topological layer where they share manager + state
- AutoGroup=false in ResourceMeta disables grouping per resource

Per-resource meta overrides (NodeMeta on graph nodes):
- Noop: skip Apply, only Check (per-resource dry-run)
- Retry: per-resource max retries (overrides daemon default)
- Limit: per-resource rate limit (0 = use daemon default)
- AutoEdge: disable auto-edges for specific resources
- AutoGroup: disable auto-grouping for specific resources

Watcher restart on failure:
- Watchers that fail (e.g., inotify max watches) now restart with
  exponential backoff (1s, 2s, 4s... capped at 5m) instead of dying
  permanently

Converged timeout (--converged-timeout):
- Exit after system is stable for N seconds (e.g., --converged-timeout 60s)
- Useful for Packer image builds and CI idempotency validation
- Tracks last change timestamp, exits when no Apply changes for the duration

Ref #4
…ource-timeout

--timeout on serve means "exit after stable for N seconds" (the intuitive
meaning for a daemon). --resource-timeout is the per-resource Check/Apply
deadline. Updated all docs.

Ref #4
--timeout 1s replaces --once (converge and exit after 1s of stability).
--timeout 0 (default) runs forever. One flag, one concept.

Remove Once field from daemon Options. Update all docs and tests.

Ref #4
TsekNet added 29 commits March 16, 2026 10:33
Winget Install/Remove/IsInstalled now use --id instead of positional
name argument, preventing "multiple packages found" errors (e.g.,
git matching Git.Git and Git.Git.PreRelease).

Baseline blueprint uses winget IDs on Windows:
Git.Git, cURL.cURL, Neovim.Neovim.

Ref #4
- Add divider line after banner (before resources start)
- Stream apply output as each resource completes (no buffering)
- Show field-level diffs in apply mode (content: old → new, mode: 0644)
- Fix spinner indentation to align with result checkmarks
- Carry Check() state.Changes through to Result for display

Ref #4
All three output formats (terminal, serial, JSON) now:
- Show only nonzero counts in summary (no "0 ok")
- Show field-level diffs in apply mode (content: old -> new)
- Use consistent 2-space indentation for resources
- Include divider after banner

JSON: omitempty on zero summary counts, Changes in apply results.
Serial: streaming diffs, nonzero-only summary, aligned indentation.

Ref #4
Demo now shows two commands:
1. converge plan baseline: field-level diffs with +/~ symbols
2. converge serve baseline --timeout 1s: streaming apply with diffs and timing

Regenerate GIF: vhs assets/demo.tape

Ref #4
In daemon mode (no --timeout), the initial convergence summary was
confusing: it looked like the daemon was done. Now it shows
"WATCHING  drift detection active" instead.

With --timeout, the normal APPLY summary still prints on exit.

Ref #4
COM vtable approach crashed (access violation on INetFwRule property
setters due to vtable offset mismatch). Reverted to registry-based
approach with improved notification:

1. Try SERVICE_CONTROL_PARAMCHANGE (works on most Windows versions)
2. Fallback: stop/start mpssvc service to force full registry reload
3. Rules persist in registry regardless, take effect on next boot

Ref #4
…restart

The rule had PrimaryStatus=Error because the registry format was missing
the Profile field. Added Profile=Public, Private, Domain.

Reverted the stop/start mpssvc approach (destructive). PARAMCHANGE
is sufficient when the rule format is correct.

Ref #4
Replace direct registry writes with the Windows Firewall COM API
(HNetCfg.FwPolicy2 / HNetCfg.FWRule) via go-ole IDispatch.

Rules take effect immediately, no service notification needed.
Proper COM lifecycle: CoInitializeEx, LockOSThread, Release.

Check reads rule properties via GetProperty for drift detection.
Apply creates via CreateObject("HNetCfg.FWRule") + Rules.Add.

New dependency: github.com/go-ole/go-ole v1.3.0

Ref #4
Replace 5s polling with WMI __InstanceModificationEvent subscription
for Win32_Service. Detects service state changes in ~1 second via
ExecNotificationQuery/NextEvent COM calls.

Falls back to 5s polling if WMI is unavailable (e.g., restricted
environments).

Now go-ole is used for both firewall (HNetCfg.FwPolicy2) and service
(WbemScripting.SWbemLocator) on Windows.

Ref #4
feat: Add w32time service to Windows baseline

Winget exit codes are inconsistent across versions. Now checks output
for the package ID string and "No installed package found" regardless
of exit code. Fixes false drift detection on installed packages.

Added Windows Time service (w32time) to baseline for testing service
management on Windows.

Ref #4
Replace exec.Command("net user/localgroup") with native Win32 API:
- NetUserAdd (netapi32.dll) for account creation
- NetLocalGroupAddMembers for group membership

Replace 60s user poll with WMI __InstanceModificationEvent on
Win32_UserAccount for instant account change detection.

Falls back to 60s polling if WMI unavailable.

Ref #4
Baseline uses "ssh" on Ubuntu/Debian, "sshd" on RHEL/Fedora.
Test script detects service name, skips service/firewall tests
gracefully when systemd or nftables are unavailable (WSL, containers).

Ref #4
@TsekNet TsekNet merged commit c00f4b6 into main Mar 16, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant