Skip to content

daemon: log file freezes silently at 20 MB cap; tcpPortUsed.check ETIMEDOUT becomes unhandledRejection #37

@Rinse12

Description

@Rinse12

Three related bugs were diagnosed on a long-running production daemon (new-plebbit host, bitsocial container, image ghcr.io/bitsocialnet/bitsocial-cli:latest v0.19.49, up 11 days, healthy). Symptoms reported: bitsocial logs is stuck — no new output despite the daemon clearly being alive (RPC reachable, healthcheck green).

Bug 1 — silent 20 MB cap permanently disables file logging

src/cli/commands/daemon.ts lines 121–134:

const logFile = fs.createWriteStream(logFilePath, { flags: "a" });
const isLogFileOverLimit = () => logFile.bytesWritten > 20000000; // 20mb

const writeTimestampedLine = (text: string, stream: "stdout" | "stderr") => {
    if (isLogFileOverLimit()) return;   // ← drops all further writes for the rest of the daemon's life
    ...
    logFile.write(timestamped);
};

Once logFile.bytesWritten exceeds 20,000,000, every subsequent write is silently dropped for the lifetime of the daemon process. There is no in-process rotation (_getNewLogfileByEvacuatingOldLogsIfNeeded only runs at startup).

On the affected container this happened on 2026-04-25 14:38:48 after ~28 hours of operation, ~10 days ago. The log file is exactly 20,022,081 bytes, daemon PID 7 still holds fd 20 open (pos: 20022081, flags: 02102001). bitsocial logs -f is correctly tailing a file that no longer changes.

What filled it up: a flood of pkc-js:base-client-manager:resolveAuthorNamesInBackground:error debug entries (~1.5 KB each) for failing .eth resolutions.

Fix

Replace the silent hard-cap with front-truncate rotation: when bytesWritten exceeds 20 MB, drop the oldest portion of the file (keep the most recent ~15 MB), reopen the stream in append mode (resetting bytesWritten), and continue writing. Always keep the file around 20 MB max.

Bug 2 — stderr override silently swallows debug output when file logger fails

Same file, lines 157–164:

process.stderr.write = (...args) => {
    // Only write stderr to the log file, not to the terminal.
    writeTimestampedLine(asString(args[0]).trimStart(), "stderr");
    return true;
};

process.stderr.write was overridden to ONLY route to the file logger. After Bug 1 trips, all stderr output from JS code disappears entirely (not in file, not in docker logs) — only errors hand-routed via writeErrorToTerminal (uncaughtException/unhandledRejection handlers) survive because they call the captured original stderrWrite.

Fix

If file write fails or the stream is closed, fall back to writing to the original stderr. With Bug 1 fixed (file rotation always writes), the fallback is mostly defensive, but it ensures we never silently lose error output.

Bug 3 — tcpPortUsed.check rejection in keepKuboUpInterval becomes unhandledRejection

Same file, lines 533–545:

keepKuboUpInterval = setInterval(async () => {
    if (mainProcessExited) return;
    const isRpcPortTaken = await tcpPortUsed.check(Number(pkcRpcUrl.port), pkcRpcUrl.hostname);
    try {
        ...
    } catch (error) {
        log.trace(`keepKuboUp error (will retry): ${...}`);
    }
    await createOrConnectRpc();
}, 5000);

The await tcpPortUsed.check(...) is outside the try/catch. tcp-port-used rejects with anything other than ECONNREFUSED (e.g., ETIMEDOUT). When the connect attempt experiences a transient network hiccup and the kernel SYN retry exhausts (~127s), the async setInterval callback rejects → unhandledRejection. The daemon's handler then prints the bare error to stderr/docker logs:

Error: connect ETIMEDOUT 127.0.0.1:9138
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1637:16)

Observed on the affected container: 18 occurrences over ~5 hours on 2026-05-04, then it stopped (transient host networking issue). No daemon crash because the runtime tolerates these — but they pollute logs and indicate a real bug.

Fix

Move await tcpPortUsed.check(...) inside the try/catch. Treat its rejection as "port unknown — retry next tick", same as keepKuboUp errors are already handled.

Diagnostics summary (affected daemon)

  • Container: bitsocial, started 2026-04-24T10:20:58Z, PID namespaces verified
  • Log file: /logs/bitsocial/bitsocial_cli_daemon_2026-04-24T10-21-02.253Z.log — 20,022,081 bytes, last write 2026-04-25T14:38:48Z
  • RPC: listening on [::]:9138 (IPv6 dual-stack, also reachable via 127.0.0.1)
  • Process tree: PID 1 tini, PID 7 daemon (still alive), PID 51 kubo, PID 45489 stuck bitsocial logs -f (correctly tailing a frozen file)
  • Healthcheck: passing (uses IPv6-mapped localhost connect)

Plan

  1. Add regression tests (unit-level, exercising the file-logger module in isolation and the keepKuboUp tick body)
  2. Fix Bug 1 (rotation)
  3. Fix Bug 2 (stderr fallback)
  4. Fix Bug 3 (try/catch around tcpPortUsed.check)
  5. Build + run full test suite

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions