Replace Playwright spec tests with QUnit-based card testing (CS-10599)#4337
Replace Playwright spec tests with QUnit-based card testing (CS-10599)#4337
Conversation
Overhaul the software factory's testing infrastructure to use QUnit .test.gts files that render cards in a real browser DOM, replacing Playwright .spec.ts files that only did API round-trips. Key changes: - Test files are co-located with card definitions (hello.test.gts next to hello.gts), not in a separate Tests/ folder - Test executor serves a custom QUnit page that loads the host app's test assets and uses the live-test infrastructure (PR 4191) for module discovery - No test artifacts realm needed: QUnit tests use in-memory browser realms - Rename SpecResultData -> TestModuleResultData, specResults -> moduleResults (purge "spec" in Playwright context; Catalog Spec unchanged) - Self-hosted test page server serves host dist assets directly, rewriting Ember config meta tag to point resolvedBaseRealmURL at the actual realm server Infrastructure: - test-run-execution.ts: custom QUnit HTML page builder, local HTTP server for host assets, Playwright browser navigation with result collection - test-run-parsing.ts: parseQunitResults() replaces Playwright JSON parsing - test-run-types.ts: QunitTestResult, QunitRunSummary, QunitResults types - realm/test-results.gts: TestModuleResult replaces SpecResult - fixtures.ts: hostAppUrl on StartedFactoryRealm Updated: skills, prompts, docs, smoke tests, all unit tests (385/385 pass) Known issue: .test.gts module imports fail silently in the hermetic Playwright harness (live-test discovers modules but can't import them). The QUnit page infrastructure works end-to-end. Debugging the import chain is next. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The realm server's getFileWithFallbacks() in stream.ts checked if a path contained a dot and skipped extension fallbacks if so. This meant a request for "hello.test" (without .gts) would never find "hello.test.gts" — the dot in "hello.test" was treated as a file extension. Fix: only skip fallbacks when the path already has a known executable extension (.gts, .ts, .js, .gjs), using hasExecutableExtension() instead of a generic dot check. Applied the same fix to: - runtime-common/stream.ts (getFileWithFallbacks) - runtime-common/realm.ts (fallbackHandle) - runtime-common/dependency-tracker.ts (hasPathExtension) - runtime-common/index-runner/dependency-normalization.ts (isExtensionlessPath) Added realm-server test: GET /hello.test resolves to hello.test.gts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e changes These operate in a different context and need a separate, more considered fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pec.ts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test page server only served /assets/* paths. SQLite WASM lives at the dist root (e.g., c29fc2dacfd64764a6ad.wasm) and fonts at various paths. Serve all dist files for any non-root URL request. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keep only essential log lines (server URL, completion stats). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR migrates software-factory card verification from factory-generated Playwright .spec.ts files (and a test-artifacts realm) to QUnit-based .test.gts files that run in a real browser DOM via the host’s live-test infrastructure, with results persisted back to TestRun cards.
Changes:
- Replace Playwright spec-based test execution with a Playwright-driven QUnit live-test page that discovers and runs co-located
.test.gtsfiles. - Rename TestRun result structures from spec-oriented naming (
SpecResultData,specResults,specRef) to module-oriented naming (TestModuleResultData,moduleResults,moduleRef). - Fix dotted-filename resolution by only skipping fallbacks when an executable extension is already present.
Reviewed changes
Copilot reviewed 34 out of 34 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/software-factory/tests/fixtures.ts | Add hostAppUrl to realm fixture metadata for QUnit test runs. |
| packages/software-factory/tests/factory-tool-executor.spec.ts | Remove testRealmUrl usage from tool-building in tests. |
| packages/software-factory/tests/factory-tool-builder.test.ts | Update tool-builder tests for single-realm targeting + new run_tests contract. |
| packages/software-factory/tests/factory-test-realm.test.ts | Replace Playwright report parsing tests with QUnit results parsing tests + moduleResults renames. |
| packages/software-factory/tests/factory-test-realm.spec.ts | Update e2e to write/run .test.gts files and assert persisted moduleResults. |
| packages/software-factory/tests/factory-prompt-loader.test.ts | Adjust prompt assertions to reflect updated system prompt content checks. |
| packages/software-factory/tests/factory-implement.test.ts | Update expectations around derived test realm URL exposure in agent context. |
| packages/software-factory/tests/factory-agent.test.ts | Adjust system message assertions (now checks for read_file). |
| packages/software-factory/test-fixtures/test-realm-runner/hello.test.gts | Add QUnit test fixture co-located with card definition. |
| packages/software-factory/src/harness/support-services.ts | Stop rejecting Ember test builds in host dist validation. |
| packages/software-factory/src/factory-entrypoint.ts | Remove testRealmUrl from implement summary output. |
| packages/software-factory/src/cli/smoke-test-realm.ts | Update smoke test to generate .test.gts files and invoke new test runner. |
| packages/software-factory/scripts/smoke-tests/factory-tools-smoke.ts | Remove testRealmUrl from smoke tool config. |
| packages/software-factory/scripts/lib/test-run-types.ts | Introduce QUnit result types and switch TestRun attributes to moduleResults/moduleRef. |
| packages/software-factory/scripts/lib/test-run-parsing.ts | Implement parseQunitResults and remove Playwright/run-realm-tests parsing logic. |
| packages/software-factory/scripts/lib/test-run-execution.ts | Replace pull-and-run Playwright specs flow with self-hosted QUnit page + Playwright browser collection. |
| packages/software-factory/scripts/lib/test-run-cards.ts | Persist moduleResults instead of specResults in TestRun card lifecycle. |
| packages/software-factory/scripts/lib/factory-tool-builder.ts | Remove test-realm targeting and update run_tests tool to QUnit mode. |
| packages/software-factory/scripts/lib/factory-test-realm.ts | Re-export new QUnit parsing/types and drop test-artifacts realm helpers. |
| packages/software-factory/scripts/lib/factory-skill-loader.ts | Switch always-loaded testing reference from Playwright to QUnit. |
| packages/software-factory/scripts/lib/factory-implement.ts | Update test runner discovery/execution logic for .test.gts and new runner options. |
| packages/software-factory/realm/test-results.gts | Rename SpecResult → TestModuleResult and specResults → moduleResults in TestRun schema/UI. |
| packages/software-factory/prompts/ticket-test.md | Update agent instruction from Playwright specs to QUnit .test.gts files. |
| packages/software-factory/prompts/ticket-implement.md | Update implementation checklist to produce co-located QUnit tests. |
| packages/software-factory/prompts/system.md | Update global rule to require .test.gts tests. |
| packages/software-factory/docs/testing-strategy.md | Update testing strategy docs to remove test-artifacts realm and describe QUnit live-test flow. |
| packages/software-factory/docs/phase-1-plan.md | Update phase plan docs to reflect new QUnit-based execution model. |
| packages/software-factory/.agents/skills/software-factory-operations/SKILL.md | Update skill docs to describe QUnit test file creation/execution patterns. |
| packages/software-factory/.agents/skills/boxel-development/references/dev-qunit-testing.md | Add QUnit card testing reference doc for agents. |
| packages/software-factory/.agents/skills/boxel-development/references/dev-playwright-testing.md | Remove Playwright testing reference doc. |
| packages/runtime-common/stream.ts | Fix fallback behavior for dotted filenames using hasExecutableExtension. |
| packages/runtime-common/realm.ts | Fix server-side fallback handling for dotted filenames in fallbackHandle. |
| packages/realm-server/tests/cards/hello.test.gts | Add fixture card module to validate dotted filename resolution. |
| packages/realm-server/tests/card-source-endpoints-test.ts | Add test asserting /hello.test resolves to hello.test.gts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…atus - Validate asset server paths to prevent directory traversal (normalize, reject '..', verify resolved path stays within hostDistDir) - Poll for QUnit availability instead of relying on window 'load' event to avoid race where QUnit starts before hooks are attached - Map QUnit skipped/todo to 'passed' instead of 'pending' so they're terminal states that don't confuse resume logic or isComplete checks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test fixture directory now includes hello.test.gts, so the directory GET response test needs to expect it in the listing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
live-test.js fetches _mtimes without auth headers, which fails on private realms (401 Unauthorized). Use page.route() to intercept requests to the realm origin and inject the Authorization header at the network level. Also includes diagnostic console forwarding for live-test and error messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add debug option to ExecuteTestRunOptions. Browser console is only forwarded to stderr when debug is enabled, reducing noise in normal runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each factory loop iteration should produce its own TestRun card, not overwrite the previous one. Without forceNew, resolveTestRun found the existing 'running' TestRun from the prior iteration and resumed it, resulting in a single TestRun that only showed the final iteration's results. Add forceNew: true to both buildTestRunner() and the run_tests tool. Add regression test verifying consecutive forceNew calls create separate TestRuns with incrementing sequence numbers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These will be deleted after the PR description references are updated to use GitHub-hosted URLs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Screenshots are now referenced by commit hash in the PR description and no longer needed in the working tree. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document QUnit test page architecture, test artifacts realm removal, private realm auth, dotted filename fix, forceNew per iteration, skipped test handling, and production build limitation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3eebb50484
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Forward hostAppUrl from ImplementConfig into ToolBuilderConfig so the run_tests tool uses the browser-accessible compat proxy URL, not the internal realm server port (which the browser can't reach in the harness) - Wait for written .test.gts files to be accessible in the realm before launching QUnit to avoid flaky failures from indexing delay Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
.spec.tstests with QUnit.test.gtsfiles that render cards in a real browser DOMhello.test.gtsnext tohello.gts), not in a separateTests/folderSpecResultData→TestModuleResultData,specResults→moduleResults,specRef→moduleRef(Catalog Spec unchanged)hello.test.gts)How the QUnit test page works
What the software factory needs
When
factory:goruns the implement→test→iterate loop, it needs to execute QUnit tests that live as.test.gtsfiles in the target realm. The test executor (executeTestRunFromRealm) must:setupCardTest,renderCard,@ember/test-helpers).test.gtsfiles via_mtimesand import them through the realm loaderIn
mise run dev-all, the Ember dev server atlocalhost:4200does serve/tests/with the QUnit page — but the factory can't rely on that because:vendor.jsdoesn't include QUnit — it's only available intest-support.jswhich ships with test/development builds. The host app atapp.boxel.aiserves a production build, so there are no test assets available at all — notests/index.html, notest-support.js, no test helper chunks.dist/viaserve --single, but that SPA fallback catches/testsand serves the app root instead oftests/index.html.What our Playwright test harness needs
The software-factory's Playwright tests run in a hermetic environment — an isolated realm server on random ports with its own postgres and synapse. There's no Ember dev server running at all. The test harness needs the exact same QUnit page capability, but fully self-contained — no external host app dependency, no network access to anything outside the test process. Additionally, the Ember config meta tag has hardcoded
resolvedBaseRealmURL,realmServerURL, etc. from build time that won't match the harness's realm server on its random ports.The solution: a self-hosted test page server
Rather than depending on a running Ember dev server or the realm-server-hosted Boxel app's routing,
executeTestRunFromRealmstarts its own minimal HTTP server. This serves both the software factory'sfactory:goflow and the hermetic Playwright test harness with identical code:Reads
host/dist/tests/index.htmlat runtime to extract its<script>,<link>, and<meta>tags — including the chunk hashes that change with every Ember build. This is how we get the correct asset references without hardcoding them.Rewrites asset URLs from root-relative (
/assets/vendor.js) to absolute (http://127.0.0.1:<port>/assets/vendor.js) pointing at our server.Rewrites the Ember config meta tag — replaces
resolvedBaseRealmURL,realmServerURL, etc. with the browser-accessible realm proxy URL. This is needed for the hermetic test harness where the realm server is on random ports that don't match the build-time URLs. In productionfactory:gothe URLs already match, but the rewrite is harmless.Serves all files from
host/dist/— JS chunks, CSS, WASM (SQLite), fonts, images — with correct MIME types. This includestest-support.js(which contains QUnit,@ember/test-helpers,qunit-dom) and all the webpack chunks that contain the test helper code.Injects QUnit result collection hooks —
QUnit.on('testEnd')andQUnit.on('runEnd')callbacks that store structured results onwindow.__qunitResults, which Playwright reads after QUnit completes.Passes
?liveTest=true&realmURL=<targetRealm>as URL query params so the host'stest-helper.jsactivates live-test mode, which discovers.test.gtsfiles via_mtimesand imports them through the realm loader.This approach is fully hermetic — no external host app needed. The only requirement is a built
host/dist/directory. The same code path serves productionfactory:go, thesmoke:test-realmCLI, and the Playwright test harness.Known limitation: production host builds (CS-10650)
The test page server requires a development or test host build. The Ember production build (
ember build -prod) strips all test assets (tests/index.html,test-support.js, test helper chunks). This means the software factory cannot run QUnit card tests in a production deployment where the host was built in production mode. This works today becausemise run dev-alluses a development build.This is not just a software factory limitation — it's a deeper live-test limitation. Running card tests in Code Mode within the Boxel app (the end goal of the live-test infrastructure from PR #4191) will face the same problem: the production Boxel app has no QUnit or test helpers available. Solving this for one solves it for both.
Options are tracked in CS-10650.
Dotted filename resolution bug fix
Also fixes a bug in
runtime-common/stream.tswheregetFileWithFallbacks()checked for any dot in the filename to skip extension fallbacks. This meanthello.test(fromhello.test.gtswith.gtsstripped by the live-test module discovery) was treated as already having an extension (.test), so the function never tried appending.gtsto find the actual file. The fix: only skip fallbacks when the path has a known executable extension (.gts,.ts,.js,.gjs). The same fix was applied torealm.ts'sfallbackHandle. A separate bug was filed for the same pattern independency-tracker.tsanddependency-normalization.ts(CS-10649).Try it out — Smoke Test
Prerequisites
mise run dev-allrunningRun the smoke test
The smoke test simulates the full factory workflow — the LLM implementation phase followed by QUnit-based testing.
Phase 1 — Simulate LLM implementation output. The smoke test creates a realm and writes what the LLM would have produced during the implementation phase:
hello.gts)Spec/hello-card.json) pointing to the HelloCard definitionhello.test.gts) — co-located with the card definitionhello-fail.test.gts)Phase 2 — Run QUnit tests via Playwright. The smoke test calls
executeTestRunFromRealm, which:status: runningin the target realm'sTest Runs/folder?liveTest=true&realmURL=<targetRealmUrl>.test.gtsfiles in the target realm via_mtimes, imports them through the realm loader, and runs any that exportrunTests()testEnd/runEndcallbackscd packages/software-factory MATRIX_URL=http://localhost:8008 \ MATRIX_USERNAME=your-username \ MATRIX_PASSWORD=your-password \ pnpm smoke:test-realm -- \ --target-realm-url http://localhost:4201/your-username/smoke-test-realm/What to expect on the command line:
What to expect in the Boxel app:
smoke-test-realmworkspacehello.gts) with its co-located test (hello.test.gts), the Catalog Spec card (Spec/hello-card), and the sample instanceTest Runs/you'll findhello-smoke-1— the TestRun card produced by the testing phaseTry it out — Full Factory E2E
Prerequisites
mise run dev-allrunninghttp://localhost:4201/software-factory/Wiki/sticky-note)Run the factory
cd packages/software-factory MATRIX_URL=http://localhost:8008/ \ MATRIX_USERNAME=your-username \ MATRIX_PASSWORD=your-password \ OPENROUTER_API_KEY=sk-or-v1-your-key-here \ pnpm factory:go -- \ --brief-url http://localhost:4201/software-factory/Wiki/sticky-note \ --target-realm-url http://localhost:4201/your-username/my-test-realm/ \ --debugWhat to expect on the command line
What to expect in the Boxel host app (target realm)
Projects/Tickets/doneKnowledge Articles/*.gts*.test.gtsStickyNote/(or similar)Spec/Test Runs/E2E screenshots
TestRun card — all 10 QUnit tests passing (Code Mode, showing the TestRun JSON and rendered card):
Co-located
.test.gtsfile (Code Mode, showing the LLM-generated QUnit tests alongside the card definition). Note: the co-located test encounters a fetch error for@cardstack/host/tests/helperswhen viewed in Code Mode — this is a known issue related to the test helpers not being available in the realm-server-hosted Boxel app's production build (CS-10650):StickyNote card definition and preview (Code Mode, showing the
.gtssource and rendered card preview):Linear tickets
Test plan
pnpm test:node)pnpm test:playwright)pnpm lint:js)pnpm lint:types)pnpm lint:format)hello.test→hello.test.gts)pnpm smoke:test-realmagainst live appfactory:gowith QUnit test generation🤖 Generated with Claude Code