daemon: log file freezes silently at 20 MB cap; tcpPortUsed.check ETIMEDOUT becomes unhandledRejection

Three related bugs were diagnosed on a long-running production daemon (`new-plebbit` host, `bitsocial` container, image `ghcr.io/bitsocialnet/bitsocial-cli:latest` v0.19.49, up 11 days, healthy). Symptoms reported: ``bitsocial logs`` is stuck — no new output despite the daemon clearly being alive (RPC reachable, healthcheck green).

## Bug 1 — silent 20 MB cap permanently disables file logging

`src/cli/commands/daemon.ts` lines 121–134:

```ts
const logFile = fs.createWriteStream(logFilePath, { flags: "a" });
const isLogFileOverLimit = () => logFile.bytesWritten > 20000000; // 20mb

const writeTimestampedLine = (text: string, stream: "stdout" | "stderr") => {
    if (isLogFileOverLimit()) return;   // ← drops all further writes for the rest of the daemon's life
    ...
    logFile.write(timestamped);
};
```

Once `logFile.bytesWritten` exceeds 20,000,000, **every subsequent write is silently dropped** for the lifetime of the daemon process. There is no in-process rotation (`_getNewLogfileByEvacuatingOldLogsIfNeeded` only runs at startup).

On the affected container this happened on `2026-04-25 14:38:48` after ~28 hours of operation, ~10 days ago. The log file is exactly 20,022,081 bytes, daemon PID 7 still holds fd 20 open (`pos: 20022081`, `flags: 02102001`). `bitsocial logs -f` is correctly tailing a file that no longer changes.

What filled it up: a flood of `pkc-js:base-client-manager:resolveAuthorNamesInBackground:error` debug entries (~1.5 KB each) for failing `.eth` resolutions.

### Fix

Replace the silent hard-cap with **front-truncate rotation**: when `bytesWritten` exceeds 20 MB, drop the oldest portion of the file (keep the most recent ~15 MB), reopen the stream in append mode (resetting `bytesWritten`), and continue writing. Always keep the file around 20 MB max.

## Bug 2 — stderr override silently swallows debug output when file logger fails

Same file, lines 157–164:

```ts
process.stderr.write = (...args) => {
    // Only write stderr to the log file, not to the terminal.
    writeTimestampedLine(asString(args[0]).trimStart(), "stderr");
    return true;
};
```

`process.stderr.write` was overridden to ONLY route to the file logger. After Bug 1 trips, all stderr output from JS code disappears entirely (not in file, not in docker logs) — only errors hand-routed via `writeErrorToTerminal` (uncaughtException/unhandledRejection handlers) survive because they call the captured original `stderrWrite`.

### Fix

If file write fails or the stream is closed, fall back to writing to the original `stderr`. With Bug 1 fixed (file rotation always writes), the fallback is mostly defensive, but it ensures we never silently lose error output.

## Bug 3 — tcpPortUsed.check rejection in keepKuboUpInterval becomes unhandledRejection

Same file, lines 533–545:

```ts
keepKuboUpInterval = setInterval(async () => {
    if (mainProcessExited) return;
    const isRpcPortTaken = await tcpPortUsed.check(Number(pkcRpcUrl.port), pkcRpcUrl.hostname);
    try {
        ...
    } catch (error) {
        log.trace(`keepKuboUp error (will retry): ${...}`);
    }
    await createOrConnectRpc();
}, 5000);
```

The `await tcpPortUsed.check(...)` is **outside** the try/catch. `tcp-port-used` rejects with anything other than `ECONNREFUSED` (e.g., `ETIMEDOUT`). When the connect attempt experiences a transient network hiccup and the kernel SYN retry exhausts (~127s), the async setInterval callback rejects → `unhandledRejection`. The daemon's handler then prints the bare error to stderr/docker logs:

```
Error: connect ETIMEDOUT 127.0.0.1:9138
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1637:16)
```

Observed on the affected container: 18 occurrences over ~5 hours on 2026-05-04, then it stopped (transient host networking issue). No daemon crash because the runtime tolerates these — but they pollute logs and indicate a real bug.

### Fix

Move `await tcpPortUsed.check(...)` inside the try/catch. Treat its rejection as "port unknown — retry next tick", same as `keepKuboUp` errors are already handled.

## Diagnostics summary (affected daemon)

- Container: `bitsocial`, started 2026-04-24T10:20:58Z, PID namespaces verified
- Log file: `/logs/bitsocial/bitsocial_cli_daemon_2026-04-24T10-21-02.253Z.log` — 20,022,081 bytes, last write 2026-04-25T14:38:48Z
- RPC: listening on `[::]:9138` (IPv6 dual-stack, also reachable via 127.0.0.1)
- Process tree: PID 1 tini, PID 7 daemon (still alive), PID 51 kubo, PID 45489 stuck `bitsocial logs -f` (correctly tailing a frozen file)
- Healthcheck: passing (uses IPv6-mapped localhost connect)

## Plan

1. Add regression tests (unit-level, exercising the file-logger module in isolation and the keepKuboUp tick body)
2. Fix Bug 1 (rotation)
3. Fix Bug 2 (stderr fallback)
4. Fix Bug 3 (try/catch around tcpPortUsed.check)
5. Build + run full test suite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

daemon: log file freezes silently at 20 MB cap; tcpPortUsed.check ETIMEDOUT becomes unhandledRejection #37

Bug 1 — silent 20 MB cap permanently disables file logging

Fix

Bug 2 — stderr override silently swallows debug output when file logger fails

Fix

Bug 3 — tcpPortUsed.check rejection in keepKuboUpInterval becomes unhandledRejection

Fix

Diagnostics summary (affected daemon)

Plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

daemon: log file freezes silently at 20 MB cap; tcpPortUsed.check ETIMEDOUT becomes unhandledRejection #37

Description

Bug 1 — silent 20 MB cap permanently disables file logging

Fix

Bug 2 — stderr override silently swallows debug output when file logger fails

Fix

Bug 3 — tcpPortUsed.check rejection in keepKuboUpInterval becomes unhandledRejection

Fix

Diagnostics summary (affected daemon)

Plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions