Skip to content

refactor(go/plugin/framework/functions): redesign manager#21850

Merged
ilyam8 merged 24 commits intonetdata:masterfrom
ilyam8:codex/framework-functions-redesign
Feb 28, 2026
Merged

refactor(go/plugin/framework/functions): redesign manager#21850
ilyam8 merged 24 commits intonetdata:masterfrom
ilyam8:codex/framework-functions-redesign

Conversation

@ilyam8
Copy link
Copy Markdown
Member

@ilyam8 ilyam8 commented Feb 28, 2026

Summary
Test Plan
Additional Information
For users: How does this change affect me?

Summary by cubic

Redesigned the Go functions manager with a route-key scheduler, worker pool, single-winner finalization, and runtime metrics so runs are serialized per route, cancellable, observable, and emit one terminal result. Hardened shutdown and registration, fixed payload-mode QUIT/control handling, and wired bounded dyncfg handoffs with manager-backed finalization in Job Manager and Service Discovery; runtime metrics are exported via the runtime service and chartengine metrics were re-namespaced.

  • New Features

    • Route-key scheduler: same-route requests serialize; different routes run in parallel (default workers=1) with a bounded queue.
    • Parser and cancellation: supports FUNCTION, FUNCTION_PAYLOAD, FUNCTION_CANCEL, FUNCTION_PROGRESS (no-op), QUIT; fixes payload-mode control handling; queued cancels return 499; running cancels get a fallback 499 if no terminal arrives.
    • Single-winner finalization: manager-bound TerminalFinalizer drops late terminals; awaiting-result warnings aid debugging; shared JSON payload builder ensures consistent responses.
    • Dyncfg: bounded downstream handoff (deadline-capped) for queue sends; responder uses the manager’s finalizer; wiring added to Job Manager and Service Discovery.
    • Runtime metrics: instrumented the functions manager, registered it with the runtime service from the agent, and exposed counters/gauges for queue state and outcomes; chartengine runtime metrics prefix renamed to netdata.go.plugin.framework.chartengine.
  • Bug Fixes

    • Scheduler: skip invalid queued entries and fix pending counters to avoid stalls/leaks.
    • Duplicates: ignore active or recently finalized UIDs without emitting terminals; lanes stay serialized.
    • Shutdown and stdin close: extracted shutdown FSM; stop admission and drain up to 8s, then cancel and force-finalize unresolved (including awaiting_result) with 499; input channel is receive-only and closes on EOF.
    • Handoff mapping: 503 for full/stopping, 500 for invalid requests; clearer “busy/shutting down” messages.
    • Registration: reject overlapping prefixes at registration to prevent ambiguous routing.

Written for commit 915a135. Summary will update on new commits.

@github-actions github-actions bot added area/docs area/collectors Everything related to data collection collectors/go.d area/go labels Feb 28, 2026
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 17 files

Confidence score: 3/5

  • Most significant risk: scheduler.go returns errSchedulerQueueFull for invalid/nil inputs, which can mislead callers and mask real validation failures.
  • Potential behavior bug: in scheduler.go the guarded nil branch in complete() can leak s.pending and drop queued items, which could cause incorrect queue state over time.
  • Score reflects moderate risk from correctness issues in scheduler/queue handling, though not necessarily merge-blocking if edge paths are rare.
  • Pay close attention to src/go/plugin/framework/functions/scheduler.go and src/go/plugin/framework/functions/manager.go - error handling and pending-count state consistency.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/go/plugin/framework/functions/scheduler.go">

<violation number="1" location="src/go/plugin/framework/functions/scheduler.go:48">
P2: Wrong error returned for invalid/nil request: `errSchedulerQueueFull` is returned when the request fails input validation (nil req, nil fn, empty UID, or empty scheduleKey). This misrepresents the actual problem — the queue isn't full, the input is invalid. Callers that branch on error identity (e.g., to emit HTTP-like 503 vs 400 responses) will produce misleading rejection reasons.</violation>

<violation number="2" location="src/go/plugin/framework/functions/scheduler.go:168">
P2: If the defensively-guarded nil branch is hit in `complete()`, the popped queue item is silently discarded without decrementing `s.pending`, causing a permanent pending-count leak. Additionally, any remaining items behind it in the lane queue become stranded until a new request for the same key arrives. Consider looping to find the next valid item, or at minimum decrementing `pending` for the skipped entry.</violation>
</file>

<file name="src/go/plugin/framework/functions/README.md">

<violation number="1" location="src/go/plugin/framework/functions/README.md:159">
P3: Typo: extra spaces in `pre-admission`</violation>

<violation number="2" location="src/go/plugin/framework/functions/README.md:175">
P3: Typo: extra spaces in `ctx.Done()`</violation>
</file>

<file name="src/go/plugin/framework/functions/manager.go">

<violation number="1" location="src/go/plugin/framework/functions/manager.go:235">
P2: Error from `scheduler.enqueue` is discarded — the response always says "function queue is full" even when the scheduler is stopping. Consider inspecting the error to return a more accurate message (e.g., 503 "functions manager is stopping" for `errSchedulerStopping`).</violation>
</file>
Architecture diagram
sequenceDiagram
    participant Downstream as SD / Job Manager
    participant Input as Stdin
    participant Parser as Input Parser
    participant Mgr as Manager (Dispatcher)
    participant Sched as Keyed Scheduler
    participant Worker as Worker Pool
    participant Finalizer as Terminal Finalizer
    participant Stdout as Stdout (Netdata API)

    Note over Downstream, Mgr: NEW: Bounded Handoff Flow
    Downstream->>Downstream: NEW: BoundedSend(ctx, chan, cap)
    alt Channel Busy or Shutdown
        Downstream-->>Stdout: NEW: 503 Busy/Shutdown (Direct)
    else Success
        Downstream->>Mgr: Push function to dyncfg channel
    end

    Note over Input, Mgr: Command Processing
    Input->>Parser: Command line (FUNCTION, CANCEL, QUIT)
    Parser->>Parser: CHANGED: Parse to Event model
    Parser-->>Mgr: inputEvent (Call, Cancel, Progress, Quit)

    alt Event: Call
        Mgr->>Mgr: Admission Check (UID, State, Queue)
        alt Duplicate UID
            Mgr->>Finalizer: NEW: FinalizeTerminal(UID, 409)
        else Accepted
            Mgr->>Sched: NEW: Enqueue by Route Key
            Sched->>Sched: Buffer in Lane (Serialized per Key)
        end
    else Event: Cancel
        Mgr->>Mgr: Identify Transaction state
        alt Queued
            Mgr->>Sched: NEW: Remove from Lane
            Mgr->>Finalizer: NEW: FinalizeTerminal(UID, 499)
        else Running
            Mgr->>Mgr: NEW: Start Fallback Timer (5s)
            Mgr->>Worker: Signal Context Cancel
            opt Fallback Timer Fires
                Mgr->>Finalizer: NEW: FinalizeTerminal(UID, 499)
            end
        end
    end

    Note over Sched, Worker: Execution Flow
    Worker->>Sched: next() (wait for ready)
    Sched-->>Worker: invocationRequest (Key Free)
    Worker->>Worker: CHANGED: Execute Handler
    alt Handler Panic
        Worker->>Finalizer: NEW: FinalizeTerminal(UID, 500)
    else Handler Success
        Worker->>Finalizer: FinalizeTerminal(UID, Result)
    end

    Note over Finalizer, Stdout: Single-Winner Finalization
    Finalizer->>Finalizer: NEW: Check if UID tombstoned
    alt First Terminal Result
        Finalizer->>Finalizer: Set Tombstone (60s)
        Finalizer->>Stdout: FUNCRESULT (stdout)
        Finalizer->>Sched: NEW: Release Key (Promote Next)
    else Late/Duplicate Result
        Finalizer->>Finalizer: Drop result (silently)
    end
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@ilyam8 ilyam8 changed the title refacotr(go/plugin/framework/functions): redesign manager refactor(go/plugin/framework/functions): redesign manager Feb 28, 2026
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 15 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/go/plugin/framework/functions/manager_flow_test.go">

<violation number="1" location="src/go/plugin/framework/functions/manager_flow_test.go:373">
P2: Goroutine leak: `block` channel is never closed</violation>
</file>

<file name="src/go/plugin/framework/functions/parser.go">

<violation number="1" location="src/go/plugin/framework/functions/parser.go:143">
P2: The narrowed condition correctly prevents payload data like "FUNCTIONAL" from incorrectly aborting payloads, but it swallows `QUIT` and unknown `FUNCTION_...` commands.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is a comprehensive redesign of the Go plugin framework's functions manager. It introduces a route-keyed scheduler with a worker pool, single-winner terminal finalization (tombstone-guarded), and cancellation semantics (queued/running/awaiting states). It also hardens shutdown with a bounded drain path, adds overlapping prefix rejection, and wires the manager-backed finalizer into dyncfg responders (Job Manager, Service Discovery).

Changes:

  • New functions manager architecture: keyed scheduler (scheduler.go), invocation state FSM, terminal finalizer with tombstone deduplication, and worker pool with cancel/panic handling
  • Parser extended with FUNCTION_CANCEL, FUNCTION_PROGRESS, QUIT event types and pre-admission cancel detection in payload mode
  • Dyncfg responders now use a TerminalFinalizer interface and bounded downstream handoff (BoundedSend)

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
functions/manager.go Core redesign: invocation state FSM, tryFinalize, shutdown drain, scheduler wiring
functions/manager_worker.go New: worker pool goroutine with panic recovery, cancel skip, awaiting-result transition
functions/scheduler.go New: key-serialized scheduler with pending budget, lane promotion, and stop/drain semantics
functions/scheduler_test.go New: unit tests for lane advancement, pending counters, stop paths, and queue recovery
functions/parser.go Extended: cancel/progress/quit event parsing; pre-admission cancel in payload mode
functions/parser_test.go New: comprehensive parser event tests including payload-mode cancel and progress
functions/response_payload.go New: shared BuildJSONPayload helper extracted from duplicated code
functions/response_payload_test.go New: tests for JSON payload structure
functions/input.go Input interface narrowed to <-chan string; EOF closes channel
functions/finalizer.go New: TerminalFinalizer type and DirectTerminalFinalizer bypass
functions/ext.go Overlapping prefix rejection at registration time
functions/manager_test.go Updated: unique UIDs per function, multi-worker profiles, thread-safe mock
functions/manager_flow_test.go New: integration flow tests for cancellation, shutdown, concurrency, and duplicates
functions/manager_snapshot_test.go Updated: consolidated and extended snapshot tests including overlap rejection
functions/README.md New: architecture documentation with flow diagram
dyncfg/responder.go Wired TerminalFinalizer into all response emit paths
dyncfg/handoff.go New: generic BoundedSend with deadline-capped timeout
dyncfg/handoff_test.go New: tests for BoundedSend edge cases
jobmgr/manager.go Finalizer propagation from FnReg and SetDyncfgResponder
jobmgr/dyncfg_handoff.go New: bounded enqueue with shutting-down vs busy distinction
jobmgr/dyncfg_vnode.go Replaced unbounded select with enqueueDyncfgFunction
jobmgr/dyncfg_collector.go Replaced unbounded select with enqueueDyncfgFunction
sd/sd.go Finalizer propagation and SetDyncfgResponder update
sd/dyncfg_handoff.go New: bounded enqueue with shutting-down vs busy distinction
sd/dyncfg.go Replaced unbounded select with enqueueDyncfgFunction

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ilyam8 ilyam8 marked this pull request as ready for review February 28, 2026 21:18
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ilyam8 ilyam8 merged commit 49313a2 into netdata:master Feb 28, 2026
142 of 143 checks passed
@ilyam8 ilyam8 deleted the codex/framework-functions-redesign branch February 28, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants