feat: improve eval criteria, clarity in improvement process#1035
feat: improve eval criteria, clarity in improvement process#1035Henry-811 merged 6 commits intodev/v0.1.70from
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedUse the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR introduces a tool renaming ( Changes
Sequence Diagram(s)sequenceDiagram
participant Agent
participant Orchestrator
participant WebServer
participant Client as Client/WebUI
participant User
Agent->>Orchestrator: submit_answer()
Orchestrator->>WebServer: show_final_answer_modal()
WebServer->>Client: emit review_request event + display modal
Client->>User: show answer in review modal
User->>Client: approve/reject with comments
Client->>WebServer: POST /api/sessions/{id}/review-response
WebServer->>Orchestrator: resolve_review(result_data)
Orchestrator->>Orchestrator: decide continuation based on approval
Orchestrator->>Agent: proceed to next round or finalize
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai pause |
There was a problem hiding this comment.
Pull request overview
Adds stronger, more explicit evaluation/improvement semantics (new criterion tiers + draft_approach), plus new WebUI flows for (1) pre-collaboration phase visibility and (2) a git-diff review gate before applying write-mode changes.
Changes:
- Introduces WebUI pre-collab phase tracking (sidebar + results panel) with subagent event polling, and adds a full-screen review modal driven by new WebSocket events.
- Updates checklist-gated evaluation flow across backend/tests/docs (criterion categories, rename
propose_improvements→draft_approach, new anti-pattern/aspiration shape, fast-iteration option). - Improves operational robustness: toolArgs normalization for Copilot hooks/PPM, workspace routing/cleanup, review-pending status surfaced in
status.json.
Reviewed changes
Copilot reviewed 54 out of 191 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| webui/src/types/index.ts | Adds WS event types + review event payload types for the WebUI review modal. |
| webui/src/stores/v2/statusStore.ts | Updates eval criterion category comment to reflect new tiering scheme. |
| webui/src/stores/v2/reviewStore.ts | New Zustand store for parsing diffs, per-file approval selection, and sending review decisions. |
| webui/src/stores/v2/preCollabStore.ts | New Zustand store modeling pre-collab lifecycle + storing generated personas/criteria/prompt artifacts. |
| webui/src/stores/v2/preCollabStore.test.ts | Unit tests validating pre-collab event handling, results storage, and panel state. |
| webui/src/stores/v2/messageStore.ts | Routes specific structured events to preCollabStore via dynamic import. |
| webui/src/stores/agentStore.ts | Notes that pre-collab events are handled in messageStore → preCollabStore path. |
| webui/src/hooks/useWebSocket.ts | Dispatches review WS events to reviewStore and registers a send function for responses. |
| webui/src/hooks/useSubagentEvents.ts | New polling hook to fetch subagent events and feed them into the message store. |
| webui/src/components/v2/tiles/SubagentTile.tsx | Enables subagent event polling when viewing a running pre-collab phase tile. |
| webui/src/components/v2/tiles/PromptBanner.tsx | Adds colors for new criterion categories with legacy fallbacks. |
| webui/src/components/v2/sidebar/Sidebar.tsx | Adds PreCollabSection above channels when phases exist. |
| webui/src/components/v2/sidebar/PreCollabSection.tsx | New sidebar section rendering phase status and linking to tiles/results. |
| webui/src/components/v2/layout/PreCollabResultsPanel.tsx | New modal panel showing personas/criteria/prompt outputs by tab. |
| webui/src/components/v2/layout/AppShell.tsx | Mounts PreCollabResultsPanel and ReviewModal; suppresses LaunchIndicator during active pre-collab. |
| massgen/token_manager/token_manager.py | Clarifies semantics for context usage percentage as a per-round cost proxy. |
| massgen/tests/unit/test_orchestrator_unit.py | Updates expected prompt injection text to reference draft_approach. |
| massgen/tests/test_task_decomposer.py | Updates expected decomposition criteria categories to new scheme. |
| massgen/tests/test_system_prompt_sections.py | Updates expectations for new tool naming + evaluator availability flag usage. |
| massgen/tests/test_subagent_mcp_server.py | Adjusts tests for deferred config loading behavior. |
| massgen/tests/test_standalone_mcp_servers.py | Renames propose-improvements tests to draft-approach equivalents. |
| massgen/tests/test_specialized_subagents.py | Updates prompt assertions to reference draft_approach tool name. |
| massgen/tests/test_round_evaluator_loop.py | Updates expected injected block wording to draft_approach. |
| massgen/tests/test_read_media_analysis.py | Adjusts expected wording (“foundation” → “fundamental”). |
| massgen/tests/test_planning_injection.py | Renames propose-improvements references to draft-approach in planning injection tests/docs. |
| massgen/tests/test_novelty_injection.py | Updates novelty messaging expectations (“CONVERGENCE SIGNAL”) and added guidance assertions. |
| massgen/tests/test_injection_checklist_guidance.py | Updates injection guidance expectations to reference draft_approach. |
| massgen/tests/test_gepa_evaluation_flow.py | Updates default criteria count/category expectations and parsing return shape. |
| massgen/tests/test_enforcement_observability.py | Ensures orchestrator mocks include _review_pending. |
| massgen/tests/test_decomposition_mode.py | Updates _get_active_criteria return tuple shape and category expectations. |
| massgen/tests/test_convergence_novelty.py | Updates changedoc category expectations and revised wording assertions. |
| massgen/tests/test_checklist_criteria_presets.py | Updates valid categories assertions and related preset expectations. |
| massgen/tests/test_changedoc_system_prompt.py | Updates changedoc system prompt expectations and evaluator availability setup. |
| massgen/tests/backend/test_copilot_permission_integration.py | Adds regression test ensuring PPM hook tolerates non-dict tool args. |
| massgen/tests/backend/test_copilot_hooks.py | Expands hook adapter tests to cover toolArgs being JSON strings vs dicts. |
| massgen/task_decomposer.py | Updates decomposition schema example and maps legacy criterion categories to new tiers; threads fast-iteration flag. |
| massgen/system_message_builder.py | Threads anti-patterns, evaluator availability, and fast-iteration flags into system prompt sections. |
| massgen/subagent_types/execution_trace_analyzer/SUBAGENT.md | Refactors subagent spec to DO/DON’T oriented guidance format. |
| massgen/subagent/launch_watcher.py | Adds API to allow additional workspace roots. |
| massgen/skills/massgen/scripts/review_watcher.sh | New watcher script that emits structured markers when review becomes pending/resolved. |
| massgen/skills/massgen/scripts/massgen_run.sh | Adds --web-review wrapper behavior to run MassGen + review watcher together. |
| massgen/skills/massgen/SKILL.md | Updates skill guidance for criteria format, cwd-context usage, and web review workflow. |
| massgen/skills/massgen-log-analyzer/SKILL.md | Updates log analyzer guidance/tool names from propose-improvements to draft-approach. |
| massgen/prompt_improver.py | Threads fast-iteration flag into pre-collab prompt improvement coordination. |
| massgen/persona_generator.py | Threads fast-iteration flag into pre-collab persona generation coordination. |
| massgen/message_templates.py | Tightens final presentation system message to require fully polished output. |
| massgen/mcp_tools/standalone/quality_server.py | Renames propose_improvements → draft_approach, adjusts defaults, and expands payload. |
| massgen/mcp_tools/planning/_planning_mcp_server.py | Renames injection references to draft-approach in comments/logging/help text. |
| massgen/mcp_tools/native_hook_adapters/copilot_adapter.py | Prevents double-encoding when toolArgs arrives as a JSON string. |
| massgen/frontend/web/static/index.html | Updates bundled asset filenames for rebuilt WebUI static output. |
| massgen/frontend/web/static/assets/stateDiagram-v2-4FDKWEC3-CSg-juUb.js | Removes old built asset. |
| massgen/frontend/web/static/assets/stateDiagram-v2-4FDKWEC3-BZytH76p.js.map | Updates source map for rebuilt asset. |
| massgen/frontend/web/static/assets/stateDiagram-v2-4FDKWEC3-BZytH76p.js | Adds rebuilt asset. |
| massgen/frontend/web/static/assets/pieDiagram-ADFJNKIX-nSgxIlGJ.js | Updates rebuilt asset to reference new bundle/chunk names. |
| massgen/frontend/web/static/assets/min-D6HVG0bB.js | Updates rebuilt asset. |
| massgen/frontend/web/static/assets/infoDiagram-WHAUD3N6-CRJbwCDr.js.map | Updates rebuilt asset map. |
| massgen/frontend/web/static/assets/infoDiagram-WHAUD3N6-CRJbwCDr.js | Updates rebuilt asset. |
| massgen/frontend/web/static/assets/flowDiagram-NV44I4VS-DvfftS_T.js | Updates rebuilt asset. |
| massgen/frontend/web/static/assets/diagram-S2PKOQOG-BxWqGtM2.js | Updates rebuilt asset. |
| massgen/frontend/web/static/assets/clone-Dk4sOD41.js.map | Updates rebuilt asset map. |
| massgen/frontend/web/static/assets/clone-Dk4sOD41.js | Adds rebuilt asset. |
| massgen/frontend/web/static/assets/clone-BbJZWq-S.js | Removes old built asset. |
| massgen/frontend/web/static/assets/classDiagram-v2-WZHVMYZB-cXQvqT0Z.js.map | Updates rebuilt asset map. |
| massgen/frontend/web/static/assets/classDiagram-v2-WZHVMYZB-cXQvqT0Z.js | Adds rebuilt asset. |
| massgen/frontend/web/static/assets/classDiagram-v2-WZHVMYZB-DWh8I7Vs.js | Removes old built asset. |
| massgen/frontend/web/static/assets/classDiagram-2ON5EDUG-cXQvqT0Z.js.map | Updates rebuilt asset map. |
| massgen/frontend/web/static/assets/classDiagram-2ON5EDUG-cXQvqT0Z.js | Adds rebuilt asset. |
| massgen/frontend/web/static/assets/classDiagram-2ON5EDUG-DWh8I7Vs.js | Removes old built asset. |
| massgen/frontend/web/static/assets/chunk-TZMSLE5B-CNqpE2vn.js | Updates rebuilt chunk to reference new bundle. |
| massgen/frontend/web/static/assets/chunk-QZHKN3VN-T8lBmuOM.js.map | Updates rebuilt chunk map. |
| massgen/frontend/web/static/assets/chunk-QZHKN3VN-T8lBmuOM.js | Updates rebuilt chunk to reference new bundle. |
| massgen/frontend/web/static/assets/chunk-QN33PNHL-BfOaT5Ka.js.map | Updates rebuilt chunk map. |
| massgen/frontend/web/static/assets/chunk-QN33PNHL-BfOaT5Ka.js | Updates rebuilt chunk to reference new bundle. |
| massgen/frontend/web/static/assets/chunk-FMBD7UC4-DUvoIIV_.js.map | Updates rebuilt chunk map. |
| massgen/frontend/web/static/assets/chunk-FMBD7UC4-DUvoIIV_.js | Updates rebuilt chunk to reference new bundle. |
| massgen/frontend/web/static/assets/chunk-55IACEB6-BcS3BSyB.js.map | Updates rebuilt chunk map. |
| massgen/frontend/web/static/assets/chunk-55IACEB6-BcS3BSyB.js | Updates rebuilt chunk to reference new bundle. |
| massgen/frontend/web/static/assets/chunk-4BX2VUAB-B2PxL9r5.js.map | Updates rebuilt chunk map. |
| massgen/frontend/web/static/assets/chunk-4BX2VUAB-B2PxL9r5.js | Updates rebuilt chunk to reference new bundle. |
| massgen/frontend/web/static/assets/channel-JJR5B7js.js | Removes old built asset. |
| massgen/frontend/web/static/assets/channel-0bNEMcha.js.map | Adds rebuilt asset map. |
| massgen/frontend/web/static/assets/channel-0bNEMcha.js | Adds rebuilt asset. |
| massgen/frontend/web/static/assets/base-80a1f760-DCEq9MnE.js.map | Updates rebuilt asset map. |
| massgen/frontend/web/static/assets/base-80a1f760-DCEq9MnE.js | Updates rebuilt asset. |
| massgen/frontend/web/static/assets/arc-gU3KohN6.js | Updates rebuilt asset. |
| massgen/frontend/displays/web_display.py | Adds review modal lifecycle support (WS events + REST resolution + status markers/files). |
| massgen/frontend/displays/textual/widgets/modals/content_modals.py | Adds new category badges and defaults for criteria modal (with legacy fallbacks). |
| massgen/frontend/displays/rich_terminal_display.py | Improves context-usage explanation and warning text. |
| massgen/filesystem_manager/_path_permission_manager.py | Hardens tool-args parsing to avoid crashes when args aren’t dict-shaped. |
| massgen/filesystem_manager/_isolation_context_manager.py | Clears non-git workspace files between rounds in workspace mode. |
| massgen/filesystem_manager/_filesystem_manager.py | Prunes stale workspaces and optionally cleans up main workspace under .massgen/workspaces/. |
| massgen/coordination_tracker.py | Defaults criteria category to standard and adds review_pending/waiting state to status output. |
| massgen/configs/features/fast_iteration.yaml | New example config enabling fast-iteration mode + docker/code-mode defaults. |
| massgen/config_validator.py | Allows new category values and validates fast_iteration_mode. |
| massgen/config_builder.py | Centralizes docker backend defaults into a shared constant and uses it for quickstart config generation. |
| massgen/cli.py | Adds workspace routing under .massgen/workspaces/, propagates --web-review, and updates criteria parsing defaults. |
| massgen/backend/base.py | Computes context_usage_pct based on per-round input tokens rather than cumulative. |
| massgen/api_params_handler/_api_params_handler_base.py | Updates excluded param comment to reference draft_approach flow. |
| massgen/agent_config.py | Adds auto_trace_analysis, web_review, and fast_iteration_mode to coordination config. |
| docs/source/user_guide/concepts.rst | Adds checklist-gated evaluation section and updates tool naming (draft_approach). |
| docs/source/reference/yaml_schema.rst | Documents fast_iteration_mode under coordination config schema. |
| docs/modules/subagents.md | Updates subagent documentation for tool naming changes. |
| docs/modules/injection.md | Updates injection docs to reference draft_approach. |
| docs/modules/coordination_workflow.md | Updates workflow docs and describes fast-iteration mode behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| request_json = { | ||
| "review_pending": True, | ||
| "url": "http://localhost:8000/?v=2", | ||
| "api_url": f"http://localhost:8000/api/sessions/{self.session_id}/review-response", |
There was a problem hiding this comment.
The review URL/API URL written to review_request.json is hard-coded to http://localhost:8000, which will be wrong when the WebUI is served on a different host/port (e.g., --web-port, remote server). Consider deriving this from the server’s configured base URL (passed into WebDisplay via kwargs) so external watchers and logs point to the correct address.
| request_json = { | |
| "review_pending": True, | |
| "url": "http://localhost:8000/?v=2", | |
| "api_url": f"http://localhost:8000/api/sessions/{self.session_id}/review-response", | |
| base_url = ( | |
| getattr(self, "web_base_url", None) | |
| or getattr(self, "base_url", None) | |
| or getattr(self, "server_base_url", None) | |
| or "http://localhost:8000" | |
| ) | |
| base_url = base_url.rstrip("/") | |
| request_json = { | |
| "review_pending": True, | |
| "url": f"{base_url}/?v=2", | |
| "api_url": f"{base_url}/api/sessions/{self.session_id}/review-response", |
| # Wait briefly for LOG_DIR to be printed, then extract it from status.json | ||
| sleep 3 | ||
| # Find the latest log dir from ~/.massgen/massgen_logs | ||
| LOG_DIR=$(ls -td ~/.massgen/massgen_logs/*/ 2>/dev/null | head -1) | ||
| if [[ -n "$LOG_DIR" ]]; then | ||
| "$SKILL_DIR/scripts/review_watcher.sh" "$LOG_DIR" & | ||
| WATCHER_PID=$! | ||
| fi |
There was a problem hiding this comment.
When --web-review is enabled, the script picks LOG_DIR via ls -td ~/.massgen/massgen_logs/* | head -1. If multiple runs are active (or another run finishes later), the watcher can attach to the wrong session. Prefer parsing the actual LOG_DIR from MassGen’s stdout (it prints [WebUI] LOG_DIR: ...) or accept it explicitly as an argument/output from the MassGen process you just spawned.
| # Usage: | ||
| # review_watcher.sh <LOG_DIR> [--poll-interval SECONDS] | ||
| # |
There was a problem hiding this comment.
The usage text advertises review_watcher.sh <LOG_DIR> [--poll-interval SECONDS], but the script actually treats $2 as the poll interval and does not parse a --poll-interval flag. Either implement flag parsing (recommended) or update the usage/docs to match the positional-arg behavior.
| export const useReviewStore = create<ReviewState & ReviewActions>( | ||
| (set, get) => ({ | ||
| // Internal: WebSocket send function (set via setSendFn) | ||
| _sendFn: null as ((data: Record<string, unknown>) => void) | null, | ||
|
|
There was a problem hiding this comment.
useReviewStore is typed as ReviewState & ReviewActions, but the initializer adds _sendFn, which isn’t declared in either interface. This typically triggers an excess-property type error and forces the later any casts. Consider modeling _sendFn in the store type (e.g., a separate internal interface) or keeping the send function in a module-level variable/closure so the store remains fully typed without any casts.
| // Match "diff --git a/path b/path" or "+++ b/path" | ||
| const diffMatch = line.match(/^diff --git a\/(.+?) b\/(.+)/); | ||
| if (diffMatch) { |
There was a problem hiding this comment.
The comment says it matches both diff --git ... and +++ b/path, but the implementation only matches ^diff --git .... Either update the comment to reflect the actual behavior, or extend the parser to handle diffs that start at +++/--- (some patch formats omit the diff --git line).
✅ Actions performedReviews paused. |
PR Title Format
Your PR title must follow the format:
<type>: <brief description>Valid types:
fix:- Bug fixesfeat:- New featuresbreaking:- Breaking changesdocs:- Documentation updatesrefactor:- Code refactoringtest:- Test additions/modificationschore:- Maintenance tasksperf:- Performance improvementsstyle:- Code style changesci:- CI/CD configuration changesExamples:
fix: resolve memory leak in data processingfeat: add export to CSV functionalitybreaking: change API response formatdocs: update installation guideDescription
Brief description of the changes in this PR
Type of change
fix:) - Non-breaking change which fixes an issuefeat:) - Non-breaking change which adds functionalitybreaking:) - Fix or feature that would cause existing functionality to not work as expecteddocs:) - Documentation updatesrefactor:) - Code changes that neither fix a bug nor add a featuretest:) - Adding missing tests or correcting existing testschore:) - Maintenance tasks, dependency updates, etc.perf:) - Code changes that improve performancestyle:) - Changes that do not affect the meaning of the code (formatting, missing semi-colons, etc.)ci:) - Changes to CI/CD configuration files and scriptsChecklist
Pre-commit status
How to Test
Add test method for this PR.
Test CLI Command
Write down the test bash command. If there is pre-requests, please emphasize.
Expected Results
Description/screenshots of expected results.
Additional context
Add any other context about the PR here.
Summary by CodeRabbit
New Features
Improvements
Documentation