Skip to content

Get Task File Sizes From S2#88

Open
cooper-grc wants to merge 4 commits intomainfrom
cg/tasks-files-3
Open

Get Task File Sizes From S2#88
cooper-grc wants to merge 4 commits intomainfrom
cg/tasks-files-3

Conversation

@cooper-grc
Copy link
Copy Markdown
Contributor

@cooper-grc cooper-grc commented Feb 24, 2026

Summary by cubic

Return accurate sizes for task files in /tasks using a 10MB sentinel until the real size is known, then cache the actual size after the first read for terminal tasks. This fixes size reporting in listings and helps FUSE/NFS issue full reads.

  • New Features
    • Added a size cache to TasksVNode; populated after Read for terminal tasks.
    • Getattr and Readdir now use the cached size or a 10MB sentinel via taskFileSize.

Written for commit 3c6d422. Summary will update on new commits.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 24, 2026

✅ Integration Tests PASSED

Gateway: https://api.airstore.ai
Query path: /sources/gmail/unread-emails
Timestamp: 2026-02-25T02:24:10Z

I/O Smoke Tests

Test Status Latency Bytes
list_sources PASS 146ms -
list_gmail PASS 327ms -
list_query_path PASS 533ms -
read_file PASS 371ms 1,724
stat_file PASS 145ms -

Compression A/B Tests

Metric Value
Files tested 8
Total raw bytes 73,033
Total strip bytes 46,330
Total raw tokens 41,729
Total strip tokens 27,530
Tokens saved 14,199
Byte reduction 36%
Token reduction 34%
Min threshold 10%
Cache consistent Yes
Per-file results
File Raw B Strip B Raw Tok Strip Tok Tok Saved Red% Raw ms Strip ms
2024-10-15_Harris-Walz_Organizing_Tea... 2,337 2,305 976 967 9 1% 291ms 167ms
2024-10-15_Stephen_King_via_KamalaHar... 2,904 2,866 1,116 1,103 13 1% 261ms 153ms
2025-07-17_KAYAK_How_to_have_everythi... 1,724 1,670 1,220 1,204 16 3% 143ms 153ms
2025-07-17_Alamo_Drafthouse_Cinema_Th... 19,047 18,224 12,228 11,704 524 4% 305ms 184ms
2025-07-17_Alamo_Drafthouse_Cinema_Th... 18,978 18,155 12,105 11,578 527 4% 307ms 180ms
2024-10-15_Knut_at_Sanity_Join_us_tom... 1,272 726 504 209 295 42% 277ms 144ms
2025-07-17_Lindsay_Stanley_TODAY_Conf... 1,876 1,049 657 268 389 44% 304ms 144ms
2025-07-17_NORDSTROM_RACK_FLASH_Travi... 24,895 1,335 12,923 497 12,426 94% 344ms 169ms

Charts

Overall Compression
Raw vs Compressed Bytes
Token Comparison
Reduction by File
Read Latency
I/O Latency

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/filesystem/vnode/tasks.go">

<violation number="1" location="pkg/filesystem/vnode/tasks.go:20">
P2: This still advertises a fixed sentinel size when the real task log size is unknown. Clients that rely on stat size can still stop reads early; the feedback calls out using the actual S2 content size instead of a sentinel. Consider fetching/logging the true object size (e.g., S2 metadata) for Getattr/Readdir rather than returning the sentinel.

(Based on your team's feedback about advertising actual S2-backed file sizes rather than using a sentinel size.) [FEEDBACK_USED]</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Copy Markdown
Contributor

@luke-lombardi luke-lombardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cooper-grc I think this branch may need to get cleaned up / integrated into the new agent task model

@luke-lombardi luke-lombardi mentioned this pull request Feb 27, 2026
Copy link
Copy Markdown
Contributor

@luke-lombardi luke-lombardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall: Clean, minimal alternative to the #49/#50 approach. The lazy size-caching strategy — sentinel until a task is terminal, then cache the real size — is well-reasoned: running tasks' sizes change constantly so a 5s content cache (PR #50 approach) adds complexity without much benefit, while terminal tasks are stable forever. This feels like the right tradeoff for a size-only cache.

Relationship to #49 and #50:
This PR targets main directly and takes a different approach than the #49#50 chain. Clarification on which branch will land is needed — are these competing approaches or is #88 meant to be the replacement? The naming (cg/tasks-files-3) suggests iteration rather than competition.

Running tasks still show 10MB sentinel in listings:
This is the deliberate tradeoff. If a user does ls -la /tasks and sees 10MB for all in-progress tasks, that's misleading but technically acceptable since they need to cat the file to see actual content anyway. A comment on taskFileSentinelSize documenting this behavior for reviewers/maintainers would help.

Race in sizeCacheMu vs cacheMu:
These are two separate locks — one for the task list cache and one for the size cache. This is correct, but make sure no code holds both locks simultaneously (a quick scan shows they don't — each critical section only acquires one, which is fine).

task.IsTerminal() is a domain boundary:
The IsTerminal() call in Read is the right semantic gate for caching permanence. Worth confirming that IsTerminal() covers all final states (complete, failed, cancelled) and doesn't omit any edge cases like timeout or preempted states.

No tests:
This PR doesn't include tests. Adding a test analogous to TestTasksVNode_GetattrUsesPlaceholderUntilContentCached from PR #50 — verifying that a terminal task's size is cached after Read and returned by subsequent Getattr without re-fetching — would strengthen confidence.

Copy link
Copy Markdown
Contributor

@luke-lombardi luke-lombardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Get Task File Sizes From S2

Summary: Solid improvement over PR #49. Rather than permanently lying about file size, this caches the real rendered size after the first read and uses the sentinel only as a temporary placeholder. Correct behavior for FUSE/NFS clients.

Positives:

  • taskFileSize is clean: double-checked read lock → sentinel fallback pattern is idiomatic Go.
  • Caching only on terminal tasks (IsTerminal()) is the right scope — non-terminal tasks can still change.
  • Both Getattr and Readdir use the same helper consistently.
  • sizeCache initialized in both constructors (NewTasksVNode and NewTasksVNodeGRPC) — no nil map panics.

Observations / Suggestions:

  1. Cache grows unboundedsizeCache is never pruned. For long-running processes with many distinct task IDs this could leak memory. A simple TTL-based eviction or bounding by max entries would help.

  2. Lock orderingRead acquires sizeCacheMu.Lock inside an existing write lock on data (depending on how Read is structured upstream). Worth a quick audit to ensure no lock inversion with cacheMu.

  3. Write under write lock is fine, but consider sync.Map — since entries are written once and read many times, sync.Map could reduce contention in high-read scenarios. Not critical.

  4. No eviction when task is deleted — if tasks can be removed from the backend, stale size entries will persist in the cache. Low risk since the cache is per-process-lifetime, but worth noting.

  5. Relationship to PR #49 — this PR duplicates the sentinel constant from #49. If both are open, one should be closed or they should be merged in the right order to avoid a conflict.

Verdict: Approved with minor caveats. The unbounded cache is the main concern for long-running deployments — a simple max-size cap would address it.

Copy link
Copy Markdown
Contributor

@luke-lombardi luke-lombardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #88: Get Task File Sizes From S2

Improves on the pure-sentinel approach from #49 by caching the actual rendered size after the first Read for terminal tasks, then serving that cached value from Getattr and Readdir going forward.

What looks good:

  • The two-phase approach (sentinel → real size after read) is pragmatic: it avoids a blocking fetch on every Getattr/Readdir while still converging to the correct size once the file has been read once.
  • Restricting size caching to terminal tasks (task.IsTerminal()) is the right call — in-progress task content changes, so caching its size would produce stale listings.
  • Separate sizeCacheMu from the task-list cacheMu is clean; avoids lock contention between list operations and size lookups.
  • Constructor initialization of sizeCache in both NewTasksVNode and NewTasksVNodeGRPC is consistent.

Observations / Nits:

  • Non-terminal tasks will always report the 10MB sentinel on every Getattr/Readdir call until they complete. If a tool does a stat loop watching an in-progress task, it will always see 10MB. This is probably acceptable, but worth documenting in the function-level comment on taskFileSize.
  • taskFileSize uses early-return on cache hit and falls through to return the sentinel — readable, but a single-expression ternary or named return might make the intent clearer. Minor style point.
  • No tests added in this PR. PR #50 adds a test file for this layer; if #50 is the intended follow-up, that's fine — but if this PR targets main independently, consider adding at least one test covering the terminal-task caching path.

No blocking issues.

Copy link
Copy Markdown
Contributor

@luke-lombardi luke-lombardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall: Good incremental improvement over PR #49. Caching the real rendered size after first read removes the permanent sentinel inaccuracy for terminal tasks, which matters for clients that rely on reported size (e.g., for progress bars or pre-allocation).

Issues / observations:

  1. Unbounded sizeCache growthsizeCache is a plain map[string]int64 with no eviction. For a long-running gateway that processes many tasks over time, every terminal task ID will accumulate in memory indefinitely. Consider either capping the cache size (LRU with a fixed bound) or piggybacking on the existing tasksCacheTTL/cacheExpiry mechanism to clear stale entries.

  2. sizeCache not initialized in NewTasksVNode's GRPCless path — Wait, looking again: NewTasksVNode does initialize sizeCache: make(map[string]int64) — this is fine. But NewTasksVNodeGRPC also does it correctly. No issue here.

  3. Supersedes PR #49 — Both PRs define taskFileSentinelSize and modify the same lines in tasks.go. They will conflict at merge time. PR #49 should be closed once this one is merged.

  4. Non-terminal tasks always report sentinel size — This is correct and expected. However, if a non-terminal task's content changes between reads (e.g., live streaming output), the sentinel will remain 10MB even though the actual content may be smaller or larger. This is an inherent limitation of the sentinel approach and is acceptable given the current design.

  5. RLock → check → RUnlock pattern — The double-lock pattern in taskFileSize (RLock, check, RUnlock) is correct. No TOCTOU issue here since the write path uses a separate Lock(). Good.

Looks solid overall. Resolve the potential cache growth concern and close PR #49.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants