LPS Network Endpoint Enhancements and Signaling#5904
LPS Network Endpoint Enhancements and Signaling#5904milan-zededa wants to merge 6 commits intolf-edge:masterfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #5904 +/- ##
==========================================
+ Coverage 21.62% 22.01% +0.38%
==========================================
Files 464 475 +11
Lines 83994 85923 +1929
==========================================
+ Hits 18166 18912 +746
- Misses 64300 65304 +1004
- Partials 1528 1707 +179 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
eriknordmark
left a comment
There was a problem hiding this comment.
Run tests
But some yetus annotations to fix (however, yetus shows as passing which is new to me!)
If you look to the artifacts, Yetus is 100% passing because it checks only the patch (what it was changed), so if you go to the final result files, they are all blank (no errors). However, the annotation step runs over the But it's important to point out that any error introduced by the PR will make Yetus to fail.... so the check is working..... |
I think this approach is quite annoying for the reviewer. Furthermore, these are not "issues", but suggestions. We may not agree with everything pointed out by Yetus, so we will intentionally ignore some, but they will keep popping up... |
This is at best confusing (because those annotations show up in the diffs the same way as when they cause failures) and a bit annoying. Can we have the annotations be limited to the changed lines as before. |
See: lf-edge/eve-api#144 Signed-off-by: Milan Lenco <milan@zededa.com>
Mirror the controller-provisioned SystemAdapter.allow_local_modifications flag through dpcToProto so LPS can distinguish ports that accept local configuration from those managed exclusively by the controller, without having to infer it by trial-and-error via error_message. Signed-off-by: Milan Lenco <milan@zededa.com>
Open a long-lived GET /api/v1/signal stream to the Local Profile Server and, upon each incoming Signal message, immediately trigger the listed endpoints' pollers -- bypassing the ~1-minute periodic cadence while preserving it as the correctness fallback. This removes the minute-scale delay that operators previously saw between entering a config change in the LPS UI and EVE picking it up. The Signal handler runs as an additional LocalCmdAgent goroutine. Connection open is guarded by the existing startTask/runInterruptible/ endTask pattern used by the other pollers; the long body read runs without the task lock so it cannot block pause(). On URL change, UpdateLpsConfig cancels the in-flight stream and wakes the goroutine, which reconnects against the current LPS address. Dispatches are rate-limited (1 signal / 3s, burst 3). LPS 404 throttles reconnect attempts to once per hour. No watchdog is registered -- a legitimately long blocking Read must not trigger a device reboot. A new controllerconn.Client.OpenLocalStream helper provides the streaming HTTP client (reuses DialerWithResolverCache, adds TCP keepalive for dead-peer detection, disables HTTP keep-alive for clean connection teardown, and drops the per-request timeout that SendLocal applies). The existing triggerProfileGET is exported as TriggerProfileGET for symmetry with the other Trigger*POST helpers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Milan Lenco <milan@zededa.com>
Fire an immediate LPS Network-endpoint POST whenever a substantive change is detected in either the device port config list (handleDPCLImpl) or the device network status (handleDNSImpl), so the local operator sees the effect of a network config change right away instead of waiting for the next periodic post. A burst of updates during, e.g., DPC verification is naturally coalesced by the networkTicker's size-1 buffered channel: TickNow is a non-blocking send, so excess kicks arriving while a POST is already in flight or pending are dropped. Signed-off-by: Milan Lenco <milan@zededa.com>
Populate the new NetworkInfo.port_status field from DeviceNetworkStatus when EVE posts to the LPS Network endpoint, giving the local operator a view of the kernel-observed state of each network port (link up/down, MAC address, currently-assigned IP addresses, active default routers, effective DNS servers and search domain, NTP servers in use, and applied MTU) alongside the existing declarative config views. CIDR-formatted IP addresses require the interface's subnet mask, which DeviceNetworkStatus previously did not carry in AddrInfoList. Extend types.AddrInfo with a Mask field and populate it from the netlink address entry in DpcManager.updateDNS. Signed-off-by: Milan Lenco <milan@zededa.com>
Add timer.lps.<task>.interval config properties (profile, radio, appinfo, devinfo, network, appbootinfo) with defaults matching current hard-coded values, min 3s, max 1h. LocalCmdAgent initializes globalConfig with DefaultConfigItemValueMap() so all tasks use the correct default before the first real config arrives. Interval changes take effect immediately on the next UpdateGlobalConfig call without resetting throttle state. Signed-off-by: Milan Lenco <milan@zededa.com>
1e5b138 to
d22cfef
Compare
|
Rebased on the top of the latest master. |
| | timer.lps.profile.interval | integer in seconds | 60 (1 minute) | 3 | 3600 (1 hour) | how frequently EVE fetches the local profile from the Local Profile Server (LPS) | | ||
| | timer.lps.radio.interval | integer in seconds | 5 | 3 | 3600 (1 hour) | how frequently EVE POSTs radio status to LPS and fetches radio silence configuration | | ||
| | timer.lps.appinfo.interval | integer in seconds | 60 (1 minute) | 3 | 3600 (1 hour) | how frequently EVE POSTs application info to LPS and fetches application commands | | ||
| | timer.lps.devinfo.interval | integer in seconds | 60 (1 minute) | 3 | 3600 (1 hour) | how frequently EVE POSTs device info to LPS and fetches device commands | | ||
| | timer.lps.network.interval | integer in seconds | 60 (1 minute) | 3 | 3600 (1 hour) | how frequently EVE POSTs network configuration to LPS and fetches locally-made network configuration | | ||
| | timer.lps.appbootinfo.interval | integer in seconds | 60 (1 minute) | 3 | 3600 (1 hour) | how frequently EVE POSTs application boot info to LPS and fetches boot configuration | |
There was a problem hiding this comment.
Does it make sense to add a general note about all these below the table that if the device gets a 404(??) it will back off the timer a lot?
eriknordmark
left a comment
There was a problem hiding this comment.
If/when the long-lived connection dies and is re-established is the re-establishment done automatically by EVE?
Who is responsible for ensuring that the LPS can track the current state after the re-establishment?
Should the LPS notice that there is a new TCP/HTTPS connection and use that to trigger fetching/updating the current state? If so, would be be useful if EVE sent a signal over the long-lived connection immediately after it has been re-established the connection so the LPS doesn't need to detect new connections at the lower layers?
This signal endpoint is really just an optimization to minimize latency. If we loose some signal from LPS for whatever reason, the only consequence is just longer latency (like it is today without this enhancement). |
There was a problem hiding this comment.
A few comments:
-
No unit tests for signal.go. A new ~300-line state machine with five
outcomes, an exported public API (OpenLocalStream), and intricate
cancel/restart semantics ships without a single test. At minimum, handleSignal
(rate-limit + dispatch table) and readSigHandlerStream (NDJSON framing +
malformed-line tolerance) are easy to unit-test with an io.Pipe. Worth adding
before merge given the long-lived-connection complexity. -
sigHandlerLimiter config inconsistency. Burst is 3 over a 3 s refill, so a
single legitimate signal listing all 6 endpoints will not exceed the limit (it
dispatches synchronously inside handleSignal and consumes 1 token, not 6 —
Allow() is called once per signal, not per endpoint). That is fine for normal
use, but the comment says "periodic polling is the correctness guarantee;
dropped signals are safe" while the limiter is actually quite generous.
Consider documenting that the limiter is per-message, not per-endpoint, so
future maintainers don't tighten it incorrectly. -
TCP keepalive is key for the robustness, but if we fail to set it there isn't even any logs in. Makes sense adding such logs.
(controllerconn/send.go:111-115):
if tcpConn, ok := conn.(*net.TCPConn); ok {
_ = tcpConn.SetKeepAlive(true)
_ = tcpConn.SetKeepAlivePeriod(keepAlive)
} -
If DialerWithResolverCache ever wraps the connection (e.g., for proxying),
keepalive — the only dead-peer detection mechanism per the design — is
silently disabled. At least log at Trace level when the assertion fails so
this doesn't become a silent debugging problem later. -
Empty mask → no /N in CIDR string. dnsPortsToProto builds &net.IPNet{IP:
addr.Addr, Mask: addr.Mask} and calls String(). If addr.Mask is nil/empty
(possible on a freshly persisted DPC before updateDNS has run, or if a code
path forgets to populate it), IPNet.String() returns just the IP — violating
the proto comment that promises CIDR format. Either guarantee Mask presence by
construction, or fall back to /32 or /128 based on len(addr.Addr) so the wire
format is always CIDR. -
Log line could be large (signal.go:864): Warnf(... line=%q) logs up to 64
KiB of attacker-controlled bytes per malformed line. Truncate to ~256 chars in
the format string. -
Unbounded pendingChanges slice.
A malicious LPS can send a Signal with pendingChanges containing millions of
entries (within the 64 KiB line cap, that's roughly 2k–5k enum repetitions).
Each is iterated in handleSignal. Iteration is O(n) with cheap per-element
work (one switch + one tickNow), so a 5k-entry signal completes in
microseconds. Not exploitable, but a len(pendingChanges) > 32 early-return
would cost nothing. -
add a one-line // see eve-api
PROFILE.md ### Signal — auth model is intentional
comment near OpenLocalStream
For #3 and #4 a suggestion is:
Push keepalive into the dialer itself. net.Dialer has a KeepAlive
time.Duration field that handles this internally. Plumb a KeepAlive field
through DialerWithResolverCache so the inner net.Dialer is constructed with
it:
stdDialer := net.Dialer{
Resolver: resolver.getNetResolver(),
LocalAddr: &net.TCPAddr{IP: d.localIP},
Timeout: d.timeout,
KeepAlive: d.keepAlive, // <-- new
}
Description
Implements the EVE side of the API additions introduced in
lf-edge/eve-api#144.
Three API enhancements plus two ancillary improvements:
1. Signal endpoint — low-latency LPS config notifications (
GET /api/v1/signal)EVE now maintains a persistent long-lived HTTP GET connection to the optional
/api/v1/signalendpoint on the Local Profile Server. LPS can push aSignalproto message (NDJSON-framed) listing which endpoints have a pendingconfiguration change. On receipt, EVE immediately polls the listed endpoints
instead of waiting for the next scheduled tick.
Implementation details:
pkg/pillar/localcommand/signal.gohandles the streamlifecycle independently of the watchdog.
as required by the spec.
to one per hour so a non-signaling LPS does not generate excess traffic.
LPS overwhelming EVE with spurious triggers. Periodic polling is the
correctness guarantee; dropped signals are safe.
UpdateLpsConfig).ConfigEndpointenum values are silently ignored for forwardcompatibility.
2. Per-port runtime status (
NetworkInfo.port_status)EVE now populates the new
NetworkPortStatusrepeated field in everyPOST /api/v1/networkrequest. For each network port it reports thekernel-observed runtime state: link up/down, MAC address, assigned IP
addresses (CIDR), active default gateways, DNS servers, DNS search domain,
NTP servers, and MTU.
3.
local_modifications_allowedper port inNetworkInfo.latest_configEVE sets
NetworkPortConfig.local_modifications_allowedwhen reportinglatest_configinNetworkInfo, mirroring the controller-provisionedSystemAdapter.allow_local_modificationsflag. LPS can use this to knowupfront which ports accept locally submitted configuration, rather than
discovering it via a trial-and-error error message.
4. Reactive LPS network POST on config/status change
EVE now triggers an immediate
POST /api/v1/networkwhen either:DevicePortConfigListchanges (new network configuration applied), orDeviceNetworkStatuschanges with a meaningful state update (link state,IP assignment, etc.).
Previously the network endpoint was only driven by the periodic ticker.
5. Configurable LPS polling intervals
All six LPS polling intervals are now tunable via controller config
properties (previously they were compile-time constants):
timer.lps.profile.intervaltimer.lps.radio.intervaltimer.lps.appinfo.intervaltimer.lps.devinfo.intervaltimer.lps.network.intervaltimer.lps.appbootinfo.intervalDefaults match the previous hard-coded values, so behaviour is unchanged
without explicit configuration.
How to test and validate this PR
Prerequisites
You need a running EVE device with a deployed Local Profile Server application.
Consider using my LPS implementation.
1. Validate
local_modifications_allowedin NetworkInfoLPS side: In the handler for
POST /api/v1/network, inspect each entry innetwork_info.latest_config.ports. Ports provisioned by the controller withallow_local_modifications = truemust arrive withlocal_modifications_allowed = true; those provisioned without it must arrivewith
false.Expected behaviour: LPS can display a read-only indicator next to ports
that it is not permitted to reconfigure, and can pre-validate a
LocalNetworkConfigsubmission before sending it.2. Validate per-port runtime status (
port_status)LPS side: In the same
POST /api/v1/networkhandler, inspectnetwork_info.port_status. For each network port you should see:logical_label— matches the port label fromlatest_configinterface_name— kernel interface name (e.g.,eth0)link_up—truewhen the port has carriermac_address— colon-separated hex (e.g.,aa:bb:cc:dd:ee:ff)ip_addresses— CIDR notation (e.g.,192.168.1.10/24,fd00::1/64)gateways— active default routersdns_servers/dns_domain— effective resolver configurationntp_servers— NTP peers in usemtu— effective MTU in bytesTo verify reactivity (enhancement 4): Change a network configuration on
the device (e.g., add a static route, trigger a DHCP renewal). LPS should
receive an updated
POST /api/v1/networkwithin a few seconds, not after thenext 60-second tick.
3. Validate the Signal endpoint — basic trigger
LPS side: Implement
GET /api/v1/signalas a streaming endpoint that:PUT /api/v1/local_profile), immediately writes a single NDJSON line:EVE side: After receiving the signal, EVE must immediately fetch
GET /api/v1/local_profileand apply the new profile — without waiting forthe next 60-second tick.
Verification: Measure the latency between submitting the config change on
LPS and EVE applying it. With signaling it should be under 2 seconds; without
signaling it would be up to 60 seconds.
4. Validate the Signal endpoint — multi-endpoint coalescing
Send a signal listing multiple endpoints simultaneously:
EVE must trigger both the network POST and the app info POST immediately.
5. Validate configurable polling intervals
Using the controller, set:
Observe in EVE logs that the local profile is now fetched every 10 seconds
instead of every 60 seconds. Restore the default when done.
Changelog notes
Low-latency LPS configuration updates. EVE can now receive near-instant
notifications from the Local Profile Server when new configuration is ready,
reducing the delay from up to 60 seconds to under 2 seconds. LPS
applications must implement the optional
GET /api/v1/signalstreamingendpoint to take advantage of this; existing LPS deployments without the
endpoint continue to work exactly as before.
Richer network status in LPS. EVE now reports the live kernel-observed
state of every network port to LPS on each network POST: link up/down,
assigned IP addresses (with subnet mask), default gateways, DNS and NTP
servers, and MTU. LPS applications can display this information to operators
without needing a separate management channel.
Per-port modification permission flag. LPS now knows which network ports
it is allowed to reconfigure (as provisioned by the controller), so it can
give operators clear feedback instead of a generic error when they attempt to
modify a controller-managed port.
Reactive LPS network updates. EVE now pushes an updated network POST to
LPS immediately when the device's network configuration or link state changes,
rather than waiting for the next periodic tick.
Tunable LPS polling intervals. All six LPS polling intervals (local
profile, radio, app info, device info, network, app boot info) can now be
adjusted via controller config properties. The defaults are unchanged.
PR Backports
Checklist