-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Overview
Add comprehensive monitoring and observability capabilities to COI, allowing users to monitor container network activity, I/O operations, resource usage, and security events from outside the container.
Motivation
Security & Audit:
- Detect data exfiltration attempts in real-time
- Log all network connections for compliance/audit trails
- Alert on suspicious behavior (unexpected connections, high bandwidth, etc.)
- Understand what AI agents are actually doing
Debugging:
- Troubleshoot network isolation issues (see what's being blocked)
- Identify performance bottlenecks
- Verify firewall rules are working correctly
Cost Control:
- Monitor API usage if AI is making external calls
- Track bandwidth consumption
- Enforce rate limits
Forensics:
- Post-session analysis of what went wrong
- Replay session activity
- Generate audit reports
Proposed Command: coi monitor
Basic Usage
```bash
Live monitoring dashboard (TUI)
coi monitor
JSON output for scripting/integration
coi monitor --json
Monitor all COI containers
coi monitor --all
Auto-detect container from current workspace
coi monitor
```
Monitoring Modes
```bash
Specific monitoring types
coi monitor --network # Network connections only
coi monitor --io # Disk I/O only
coi monitor --resources # CPU/memory/cgroup stats
coi monitor --firewall # Firewall events only
Combined
coi monitor --network --io # Multiple modes
```
Alert Thresholds
```bash
Alert on events
coi monitor --alert-on-new-connections
coi monitor --alert-on-firewall-block
Threshold alerts
coi monitor --bandwidth-threshold 100MB/min
coi monitor --io-threshold 1000iops
coi monitor --cpu-threshold 80%
```
Output & Integration
```bash
Logging
coi monitor --log-file /tmp/coi-monitor.log
Prometheus metrics export
coi monitor --export-prometheus :9090
Output formats
coi monitor --format json|table|dashboard
```
Audit & Forensics
```bash
Full audit mode (syscall + network tracing)
coi monitor --audit
Record session for later replay
coi monitor --record-session
Network packet capture
coi monitor --pcap /tmp/traffic.pcap
Replay recorded session
coi monitor replay
Show session statistics
coi monitor stats
```
Example Output (TUI Dashboard Mode)
```
Container: coi-abc12345-1
Uptime: 15m 32s
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
NETWORK ACTIVITY
Active Connections: 3
├─ 52.84.142.12:443 (HTTPS) - api.anthropic.com ✓ ALLOWED
├─ 8.8.8.8:53 (DNS) ✓ ALLOWED
└─ 192.168.1.1:80 (HTTP) ✗ BLOCKED (RFC1918)
Bandwidth: ↓ 1.2 MB/s ↑ 45 KB/s
Total: ↓ 18.5 MB ↑ 2.1 MB
Firewall Blocks: 5 attempts
DISK I/O
Read: 125 KB/s (1.8 GB total)
Write: 45 KB/s (456 MB total)
IOPS: 45 read, 12 write
RESOURCES
CPU: 15% (2 cores)
Memory: 512 MB / 2 GB (25%)
Processes: 12
RECENT EVENTS
[15:32:45] ✓ Connected to api.anthropic.com:443
[15:32:42] ✗ Blocked connection to 192.168.1.1:80 (RFC1918)
[15:32:40] ✓ DNS query: api.anthropic.com -> 52.84.142.12
[15:32:38] ℹ File write: /workspace/output.txt (1.2 KB)
```
Implementation Approaches
1. Network Monitoring
Option A: Connection Tracking (conntrack)
- Use
conntrackto monitor active connections - Filter by container IP address
- Pros: Low overhead, real-time
- Cons: Only shows active connections, no historical data
Option B: eBPF (bpftrace/bcc-tools)
- Trace network syscalls at kernel level
- Can capture all connection attempts (even failed ones)
- Pros: Very detailed, can't be bypassed
- Cons: Requires BPF support, more complex
Option C: Firewalld Logs
- Parse firewalld logs for block/allow events
- Already available with network isolation
- Pros: Easy, already logged
- Cons: Only shows firewall decisions, not all traffic
Recommended: Hybrid approach
- Use conntrack for active connections
- Parse firewalld logs for blocks
- Optional eBPF mode for deep inspection (
--audit)
2. I/O Monitoring
Option A: Cgroup Stats
- Read from
/sys/fs/cgroup/.../io.stat - Incus already uses cgroups
- Pros: Built-in, accurate, low overhead
- Cons: Aggregated stats only
Option B: eBPF (biosnoop/biolatency)
- Trace block I/O at kernel level
- Per-file granularity
- Pros: Very detailed
- Cons: Higher overhead
Recommended: Cgroup stats by default, eBPF for --audit mode
3. Resource Monitoring
Use Incus API + Cgroups:
incus info <container>provides basic stats- Cgroup stats for detailed metrics
/proc/<pid>/for process-level data
4. Event Correlation
Challenge: Map host-level events back to containers
- Network: Match by container IP (get from Incus)
- I/O: Match by cgroup path
- Processes: Match by PID namespace
Data Sources
| Metric | Source | Method |
|---|---|---|
| Active connections | /proc/net/tcp, conntrack |
Parse /proc or use conntrack CLI |
| Bandwidth | cgroup io.stat, iftop |
Read cgroup stats or parse iftop |
| Firewall events | firewalld logs | Parse journalctl output |
| Disk I/O | cgroup io.stat |
Read from sysfs |
| CPU usage | cgroup cpu.stat |
Read from sysfs |
| Memory | cgroup memory.current |
Read from sysfs |
| DNS queries | eBPF (optional) | Trace UDP port 53 |
| Syscalls | eBPF (optional) | Trace syscall entry points |
Implementation Phases
Phase 1: Basic Monitoring (MVP)
-
coi monitor <container>- simple table output - Network: active connections from
/proc/net/tcp+ conntrack - I/O: basic stats from cgroup
- Resources: CPU/memory from cgroup
- Parse firewalld logs for blocks
- JSON output mode (
--json)
Phase 2: Enhanced Output
- TUI dashboard mode (using bubbletea or similar)
- Real-time updates
- Color-coded output (green=allowed, red=blocked)
- Bandwidth calculation (bytes/sec)
- Event timeline
Phase 3: Alerts & Thresholds
- Alert on new connections (
--alert-on-new-connections) - Bandwidth thresholds (
--bandwidth-threshold) - Firewall block alerts
- Webhook notifications
Phase 4: Audit & Forensics
- Session recording (
--record-session) - eBPF integration for deep inspection (
--audit) - Packet capture (
--pcap) - Session replay (
coi monitor replay) - Summary statistics (
coi monitor stats)
Phase 5: Integration
- Prometheus metrics export
- SIEM webhooks
- Grafana dashboards
- Audit log export (JSON/CSV)
Technical Considerations
Privileges:
- Network monitoring: Requires access to
/proc/net/, conntrack (may need sudo) - Cgroup stats: Read-only access to
/sys/fs/cgroup/(usually accessible) - eBPF: Requires CAP_BPF/CAP_ADMIN (sudo)
- Firewalld logs: Requires journalctl access (incus-admin group should have this)
Performance:
- Cgroup stats: Negligible overhead
- Conntrack: Very low overhead
- eBPF: Low overhead but depends on trace frequency
- Packet capture: Can be heavy with high traffic
Container Identification:
- Get container IP from Incus API
- Match network events by source IP
- Match I/O by cgroup path (Incus sets this)
Security & Privacy
Concerns:
- Packet capture could expose sensitive data
- Full syscall tracing reveals all container activity
- Logs could contain API keys or credentials
Mitigations:
- Audit mode (
--audit,--pcap) requires explicit opt-in - Warn users about sensitive data in logs
- Option to filter/redact credentials in output
- Secure storage for recorded sessions
Use Cases
-
Security Auditor: "I need to verify the AI didn't exfiltrate data"
coi monitor <container> --log-file audit.log # Review all connections after session
-
Developer: "Why is network isolation blocking my package install?"
coi monitor <container> --firewall # See exactly what's being blocked
-
Compliance Officer: "Generate audit report for AI agent session"
coi monitor <container> --record-session coi monitor stats <session-id> --export report.pdf
-
DevOps: "Integrate COI metrics into our monitoring stack"
coi monitor --all --export-prometheus :9090 # Scrape with Prometheus, visualize in Grafana
Open Questions
-
Should monitoring be opt-in or always-on?
- Proposal: Basic stats always available via
coi monitor, audit mode opt-in
- Proposal: Basic stats always available via
-
How long to keep recorded sessions?
- Proposal: Configurable retention policy, default 7 days
-
Should we auto-start monitoring when
--auditflag is used withcoi shell?- Proposal: Yes,
coi shell --auditautomatically enables monitoring
- Proposal: Yes,
-
Privacy concerns with packet capture?
- Proposal: Explicit warning, require
--i-understand-the-risksflag
- Proposal: Explicit warning, require
Related Issues
- Network isolation (#XX - if exists)
- Session management (#XX - if exists)
- Resource limits (feat: Add resource and time limits for containers #99)