Skip to content

feat: Add container monitoring and observability (coi monitor) #112

@mensfeld

Description

@mensfeld

Overview

Add comprehensive monitoring and observability capabilities to COI, allowing users to monitor container network activity, I/O operations, resource usage, and security events from outside the container.

Motivation

Security & Audit:

  • Detect data exfiltration attempts in real-time
  • Log all network connections for compliance/audit trails
  • Alert on suspicious behavior (unexpected connections, high bandwidth, etc.)
  • Understand what AI agents are actually doing

Debugging:

  • Troubleshoot network isolation issues (see what's being blocked)
  • Identify performance bottlenecks
  • Verify firewall rules are working correctly

Cost Control:

  • Monitor API usage if AI is making external calls
  • Track bandwidth consumption
  • Enforce rate limits

Forensics:

  • Post-session analysis of what went wrong
  • Replay session activity
  • Generate audit reports

Proposed Command: coi monitor

Basic Usage

```bash

Live monitoring dashboard (TUI)

coi monitor

JSON output for scripting/integration

coi monitor --json

Monitor all COI containers

coi monitor --all

Auto-detect container from current workspace

coi monitor
```

Monitoring Modes

```bash

Specific monitoring types

coi monitor --network # Network connections only
coi monitor --io # Disk I/O only
coi monitor --resources # CPU/memory/cgroup stats
coi monitor --firewall # Firewall events only

Combined

coi monitor --network --io # Multiple modes
```

Alert Thresholds

```bash

Alert on events

coi monitor --alert-on-new-connections
coi monitor --alert-on-firewall-block

Threshold alerts

coi monitor --bandwidth-threshold 100MB/min
coi monitor --io-threshold 1000iops
coi monitor --cpu-threshold 80%
```

Output & Integration

```bash

Logging

coi monitor --log-file /tmp/coi-monitor.log

Prometheus metrics export

coi monitor --export-prometheus :9090

Output formats

coi monitor --format json|table|dashboard
```

Audit & Forensics

```bash

Full audit mode (syscall + network tracing)

coi monitor --audit

Record session for later replay

coi monitor --record-session

Network packet capture

coi monitor --pcap /tmp/traffic.pcap

Replay recorded session

coi monitor replay

Show session statistics

coi monitor stats
```

Example Output (TUI Dashboard Mode)

```
Container: coi-abc12345-1
Uptime: 15m 32s
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

NETWORK ACTIVITY
Active Connections: 3
├─ 52.84.142.12:443 (HTTPS) - api.anthropic.com ✓ ALLOWED
├─ 8.8.8.8:53 (DNS) ✓ ALLOWED
└─ 192.168.1.1:80 (HTTP) ✗ BLOCKED (RFC1918)

Bandwidth: ↓ 1.2 MB/s ↑ 45 KB/s
Total: ↓ 18.5 MB ↑ 2.1 MB
Firewall Blocks: 5 attempts

DISK I/O
Read: 125 KB/s (1.8 GB total)
Write: 45 KB/s (456 MB total)
IOPS: 45 read, 12 write

RESOURCES
CPU: 15% (2 cores)
Memory: 512 MB / 2 GB (25%)
Processes: 12

RECENT EVENTS
[15:32:45] ✓ Connected to api.anthropic.com:443
[15:32:42] ✗ Blocked connection to 192.168.1.1:80 (RFC1918)
[15:32:40] ✓ DNS query: api.anthropic.com -> 52.84.142.12
[15:32:38] ℹ File write: /workspace/output.txt (1.2 KB)
```

Implementation Approaches

1. Network Monitoring

Option A: Connection Tracking (conntrack)

  • Use conntrack to monitor active connections
  • Filter by container IP address
  • Pros: Low overhead, real-time
  • Cons: Only shows active connections, no historical data

Option B: eBPF (bpftrace/bcc-tools)

  • Trace network syscalls at kernel level
  • Can capture all connection attempts (even failed ones)
  • Pros: Very detailed, can't be bypassed
  • Cons: Requires BPF support, more complex

Option C: Firewalld Logs

  • Parse firewalld logs for block/allow events
  • Already available with network isolation
  • Pros: Easy, already logged
  • Cons: Only shows firewall decisions, not all traffic

Recommended: Hybrid approach

  • Use conntrack for active connections
  • Parse firewalld logs for blocks
  • Optional eBPF mode for deep inspection (--audit)

2. I/O Monitoring

Option A: Cgroup Stats

  • Read from /sys/fs/cgroup/.../io.stat
  • Incus already uses cgroups
  • Pros: Built-in, accurate, low overhead
  • Cons: Aggregated stats only

Option B: eBPF (biosnoop/biolatency)

  • Trace block I/O at kernel level
  • Per-file granularity
  • Pros: Very detailed
  • Cons: Higher overhead

Recommended: Cgroup stats by default, eBPF for --audit mode

3. Resource Monitoring

Use Incus API + Cgroups:

  • incus info <container> provides basic stats
  • Cgroup stats for detailed metrics
  • /proc/<pid>/ for process-level data

4. Event Correlation

Challenge: Map host-level events back to containers

  • Network: Match by container IP (get from Incus)
  • I/O: Match by cgroup path
  • Processes: Match by PID namespace

Data Sources

Metric Source Method
Active connections /proc/net/tcp, conntrack Parse /proc or use conntrack CLI
Bandwidth cgroup io.stat, iftop Read cgroup stats or parse iftop
Firewall events firewalld logs Parse journalctl output
Disk I/O cgroup io.stat Read from sysfs
CPU usage cgroup cpu.stat Read from sysfs
Memory cgroup memory.current Read from sysfs
DNS queries eBPF (optional) Trace UDP port 53
Syscalls eBPF (optional) Trace syscall entry points

Implementation Phases

Phase 1: Basic Monitoring (MVP)

  • coi monitor <container> - simple table output
  • Network: active connections from /proc/net/tcp + conntrack
  • I/O: basic stats from cgroup
  • Resources: CPU/memory from cgroup
  • Parse firewalld logs for blocks
  • JSON output mode (--json)

Phase 2: Enhanced Output

  • TUI dashboard mode (using bubbletea or similar)
  • Real-time updates
  • Color-coded output (green=allowed, red=blocked)
  • Bandwidth calculation (bytes/sec)
  • Event timeline

Phase 3: Alerts & Thresholds

  • Alert on new connections (--alert-on-new-connections)
  • Bandwidth thresholds (--bandwidth-threshold)
  • Firewall block alerts
  • Webhook notifications

Phase 4: Audit & Forensics

  • Session recording (--record-session)
  • eBPF integration for deep inspection (--audit)
  • Packet capture (--pcap)
  • Session replay (coi monitor replay)
  • Summary statistics (coi monitor stats)

Phase 5: Integration

  • Prometheus metrics export
  • SIEM webhooks
  • Grafana dashboards
  • Audit log export (JSON/CSV)

Technical Considerations

Privileges:

  • Network monitoring: Requires access to /proc/net/, conntrack (may need sudo)
  • Cgroup stats: Read-only access to /sys/fs/cgroup/ (usually accessible)
  • eBPF: Requires CAP_BPF/CAP_ADMIN (sudo)
  • Firewalld logs: Requires journalctl access (incus-admin group should have this)

Performance:

  • Cgroup stats: Negligible overhead
  • Conntrack: Very low overhead
  • eBPF: Low overhead but depends on trace frequency
  • Packet capture: Can be heavy with high traffic

Container Identification:

  • Get container IP from Incus API
  • Match network events by source IP
  • Match I/O by cgroup path (Incus sets this)

Security & Privacy

Concerns:

  • Packet capture could expose sensitive data
  • Full syscall tracing reveals all container activity
  • Logs could contain API keys or credentials

Mitigations:

  • Audit mode (--audit, --pcap) requires explicit opt-in
  • Warn users about sensitive data in logs
  • Option to filter/redact credentials in output
  • Secure storage for recorded sessions

Use Cases

  1. Security Auditor: "I need to verify the AI didn't exfiltrate data"

    coi monitor <container> --log-file audit.log
    # Review all connections after session
  2. Developer: "Why is network isolation blocking my package install?"

    coi monitor <container> --firewall
    # See exactly what's being blocked
  3. Compliance Officer: "Generate audit report for AI agent session"

    coi monitor <container> --record-session
    coi monitor stats <session-id> --export report.pdf
  4. DevOps: "Integrate COI metrics into our monitoring stack"

    coi monitor --all --export-prometheus :9090
    # Scrape with Prometheus, visualize in Grafana

Open Questions

  1. Should monitoring be opt-in or always-on?

    • Proposal: Basic stats always available via coi monitor, audit mode opt-in
  2. How long to keep recorded sessions?

    • Proposal: Configurable retention policy, default 7 days
  3. Should we auto-start monitoring when --audit flag is used with coi shell?

    • Proposal: Yes, coi shell --audit automatically enables monitoring
  4. Privacy concerns with packet capture?

    • Proposal: Explicit warning, require --i-understand-the-risks flag

Related Issues

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions