Skip to content

Replace Playwright spec tests with QUnit-based card testing (CS-10599)#4337

Merged
habdelra merged 20 commits intomainfrom
cs-10599-qunit-card-testing
Apr 7, 2026
Merged

Replace Playwright spec tests with QUnit-based card testing (CS-10599)#4337
habdelra merged 20 commits intomainfrom
cs-10599-qunit-card-testing

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

@habdelra habdelra commented Apr 6, 2026

Summary

  • Replace factory-generated Playwright .spec.ts tests with QUnit .test.gts files that render cards in a real browser DOM
  • Test files are co-located with card definitions (hello.test.gts next to hello.gts), not in a separate Tests/ folder
  • Eliminate test artifacts realm — QUnit tests use in-memory browser realms, test instances never leave the browser
  • Purge "spec" in Playwright context: SpecResultDataTestModuleResultData, specResultsmoduleResults, specRefmoduleRef (Catalog Spec unchanged)
  • Fix realm module resolution bug for dotted filenames (hello.test.gts)

How the QUnit test page works

What the software factory needs

When factory:go runs the implement→test→iterate loop, it needs to execute QUnit tests that live as .test.gts files in the target realm. The test executor (executeTestRunFromRealm) must:

  1. Serve a browser page that has the full Ember/card runtime (for rendering cards)
  2. Load QUnit and the host's test helpers (setupCardTest, renderCard, @ember/test-helpers)
  3. Use the live-test infrastructure (PR introduce-realm-enabled-tests #4191) to discover .test.gts files via _mtimes and import them through the realm loader
  4. Collect structured QUnit results and write them to a TestRun card

In mise run dev-all, the Ember dev server at localhost:4200 does serve /tests/ with the QUnit page — but the factory can't rely on that because:

  • In production, the Ember production build has no QUnit (CS-10650). The production vendor.js doesn't include QUnit — it's only available in test-support.js which ships with test/development builds. The host app at app.boxel.ai serves a production build, so there are no test assets available at all — no tests/index.html, no test-support.js, no test helper chunks.
  • The realm-server-hosted Boxel app blocks the route. The realm server serves host assets from dist/ via serve --single, but that SPA fallback catches /tests and serves the app root instead of tests/index.html.

What our Playwright test harness needs

The software-factory's Playwright tests run in a hermetic environment — an isolated realm server on random ports with its own postgres and synapse. There's no Ember dev server running at all. The test harness needs the exact same QUnit page capability, but fully self-contained — no external host app dependency, no network access to anything outside the test process. Additionally, the Ember config meta tag has hardcoded resolvedBaseRealmURL, realmServerURL, etc. from build time that won't match the harness's realm server on its random ports.

The solution: a self-hosted test page server

Rather than depending on a running Ember dev server or the realm-server-hosted Boxel app's routing, executeTestRunFromRealm starts its own minimal HTTP server. This serves both the software factory's factory:go flow and the hermetic Playwright test harness with identical code:

  1. Reads host/dist/tests/index.html at runtime to extract its <script>, <link>, and <meta> tags — including the chunk hashes that change with every Ember build. This is how we get the correct asset references without hardcoding them.

  2. Rewrites asset URLs from root-relative (/assets/vendor.js) to absolute (http://127.0.0.1:<port>/assets/vendor.js) pointing at our server.

  3. Rewrites the Ember config meta tag — replaces resolvedBaseRealmURL, realmServerURL, etc. with the browser-accessible realm proxy URL. This is needed for the hermetic test harness where the realm server is on random ports that don't match the build-time URLs. In production factory:go the URLs already match, but the rewrite is harmless.

  4. Serves all files from host/dist/ — JS chunks, CSS, WASM (SQLite), fonts, images — with correct MIME types. This includes test-support.js (which contains QUnit, @ember/test-helpers, qunit-dom) and all the webpack chunks that contain the test helper code.

  5. Injects QUnit result collection hooksQUnit.on('testEnd') and QUnit.on('runEnd') callbacks that store structured results on window.__qunitResults, which Playwright reads after QUnit completes.

  6. Passes ?liveTest=true&realmURL=<targetRealm> as URL query params so the host's test-helper.js activates live-test mode, which discovers .test.gts files via _mtimes and imports them through the realm loader.

This approach is fully hermetic — no external host app needed. The only requirement is a built host/dist/ directory. The same code path serves production factory:go, the smoke:test-realm CLI, and the Playwright test harness.

Known limitation: production host builds (CS-10650)

The test page server requires a development or test host build. The Ember production build (ember build -prod) strips all test assets (tests/index.html, test-support.js, test helper chunks). This means the software factory cannot run QUnit card tests in a production deployment where the host was built in production mode. This works today because mise run dev-all uses a development build.

This is not just a software factory limitation — it's a deeper live-test limitation. Running card tests in Code Mode within the Boxel app (the end goal of the live-test infrastructure from PR #4191) will face the same problem: the production Boxel app has no QUnit or test helpers available. Solving this for one solves it for both.

Options are tracked in CS-10650.

Dotted filename resolution bug fix

Also fixes a bug in runtime-common/stream.ts where getFileWithFallbacks() checked for any dot in the filename to skip extension fallbacks. This meant hello.test (from hello.test.gts with .gts stripped by the live-test module discovery) was treated as already having an extension (.test), so the function never tried appending .gts to find the actual file. The fix: only skip fallbacks when the path has a known executable extension (.gts, .ts, .js, .gjs). The same fix was applied to realm.ts's fallbackHandle. A separate bug was filed for the same pattern in dependency-tracker.ts and dependency-normalization.ts (CS-10649).

Try it out — Smoke Test

Prerequisites

  • mise run dev-all running

Run the smoke test

The smoke test simulates the full factory workflow — the LLM implementation phase followed by QUnit-based testing.

Phase 1 — Simulate LLM implementation output. The smoke test creates a realm and writes what the LLM would have produced during the implementation phase:

  • A HelloCard card definition (hello.gts)
  • A Catalog Spec card (Spec/hello-card.json) pointing to the HelloCard definition
  • A passing QUnit test (hello.test.gts) — co-located with the card definition
  • A deliberately failing QUnit test (hello-fail.test.gts)

Phase 2 — Run QUnit tests via Playwright. The smoke test calls executeTestRunFromRealm, which:

  • Creates a TestRun card with status: running in the target realm's Test Runs/ folder
  • Serves a custom QUnit test page that loads the host app's test assets locally
  • Launches a Playwright browser and navigates to the QUnit page with ?liveTest=true&realmURL=<targetRealmUrl>
  • QUnit discovers all .test.gts files in the target realm via _mtimes, imports them through the realm loader, and runs any that export runTests()
  • Test instances are created in browser memory only — no test artifacts realm needed
  • Collects structured results via QUnit testEnd/runEnd callbacks
  • Completes the TestRun card with pass/fail results grouped by QUnit module
cd packages/software-factory

MATRIX_URL=http://localhost:8008 \
MATRIX_USERNAME=your-username \
MATRIX_PASSWORD=your-password \
pnpm smoke:test-realm -- \
  --target-realm-url http://localhost:4201/your-username/smoke-test-realm/

Note: The realm smoke-test-realm does not need to exist beforehand — the smoke test creates it and populates it. If the realm already exists from a previous run, the new content will be written into the existing realm.

What to expect on the command line:

=== Factory Test Realm Smoke Test (QUnit) ===

Target realm: http://localhost:4201/your-username/smoke-test-realm/

--- Phase 1: Writing LLM implementation output to target realm ---

  ✓ hello.gts
  ✓ Spec/hello-card.json
  ✓ hello.test.gts
  ✓ hello-fail.test.gts

--- Phase 2: Running QUnit tests ---

  TestRun ID:  Test Runs/hello-smoke-1
  Status:      failed

--- Results ---

  TestRun status: ✓ failed (as expected — one test passes, one deliberately fails)

✓ Smoke test passed! QUnit test execution works correctly.

What to expect in the Boxel app:

  • Navigate to your smoke-test-realm workspace
  • You'll see the HelloCard definition (hello.gts) with its co-located test (hello.test.gts), the Catalog Spec card (Spec/hello-card), and the sample instance
  • In Test Runs/ you'll find hello-smoke-1 — the TestRun card produced by the testing phase
  • The fitted view shows: status badge (failed), sequence number (Bring in a demo from ember-animated #1), pass/fail counts, duration
  • The isolated view shows: full test results grouped by QUnit module, with individual test names, status per test, and failure details
  • No test artifacts realm is created — all test instances lived in browser memory during execution

Try it out — Full Factory E2E

Prerequisites

  1. mise run dev-all running
  2. A brief card published in the software-factory realm (e.g., http://localhost:4201/software-factory/Wiki/sticky-note)
  3. An OpenRouter API key
  4. Matrix credentials (username/password) that can create realms on the server

Run the factory

cd packages/software-factory

MATRIX_URL=http://localhost:8008/ \
MATRIX_USERNAME=your-username \
MATRIX_PASSWORD=your-password \
OPENROUTER_API_KEY=sk-or-v1-your-key-here \
pnpm factory:go -- \
  --brief-url http://localhost:4201/software-factory/Wiki/sticky-note \
  --target-realm-url http://localhost:4201/your-username/my-test-realm/ \
  --debug

What to expect on the command line

[factory:go] mode=implement brief=http://localhost:4201/software-factory/Wiki/sticky-note
[factory:go] Starting bootstrap + implement flow...
[test-run-execution] Serving QUnit page at http://127.0.0.1:<port> for realm ...
[test-run-execution] QUnit completed in <N>ms: <N> test(s)
[factory-implement] Updated ticket status to done
[factory:go] Implement complete: outcome=tests_passed iterations=<N> toolCalls=<N>

What to expect in the Boxel host app (target realm)

Folder / File What it is
Projects/ A Project card with the brief's objective and success criteria
Tickets/ Ticket cards — the active ticket should show status done
Knowledge Articles/ Context articles derived from the brief
*.gts Card definition file(s) for the implemented card
*.test.gts Co-located QUnit test file(s)
StickyNote/ (or similar) Sample card instance(s) with realistic data
Spec/ Catalog Spec card(s) linking to the card definition and sample instances
Test Runs/ TestRun card(s) with structured pass/fail results grouped by QUnit module

E2E screenshots

TestRun card — all 10 QUnit tests passing (Code Mode, showing the TestRun JSON and rendered card):

TestRun card with 10/10 tests passing

Co-located .test.gts file (Code Mode, showing the LLM-generated QUnit tests alongside the card definition). Note: the co-located test encounters a fetch error for @cardstack/host/tests/helpers when viewed in Code Mode — this is a known issue related to the test helpers not being available in the realm-server-hosted Boxel app's production build (CS-10650):

Co-located .test.gts file

StickyNote card definition and preview (Code Mode, showing the .gts source and rendered card preview):

StickyNote card definition and preview

Linear tickets

  • CS-10599 — Main ticket
  • CS-10649 — Follow-up: dependency tracker dotted filename bug
  • CS-10650 — Follow-up: production host build has no test assets
  • CS-10651 — Follow-up: surface skipped/todo tests instead of hiding as passed

Test plan

  • 386/386 unit tests pass (pnpm test:node)
  • 25/25 Playwright tests pass (pnpm test:playwright)
  • ESLint clean (pnpm lint:js)
  • TypeScript types clean (pnpm lint:types)
  • Prettier clean (pnpm lint:format)
  • Realm-server test: dotted filename resolution (hello.testhello.test.gts)
  • pnpm smoke:test-realm against live app
  • Full E2E factory:go with QUnit test generation

🤖 Generated with Claude Code

habdelra and others added 8 commits April 6, 2026 18:35
Overhaul the software factory's testing infrastructure to use QUnit .test.gts
files that render cards in a real browser DOM, replacing Playwright .spec.ts
files that only did API round-trips.

Key changes:
- Test files are co-located with card definitions (hello.test.gts next to
  hello.gts), not in a separate Tests/ folder
- Test executor serves a custom QUnit page that loads the host app's test
  assets and uses the live-test infrastructure (PR 4191) for module discovery
- No test artifacts realm needed: QUnit tests use in-memory browser realms
- Rename SpecResultData -> TestModuleResultData, specResults -> moduleResults
  (purge "spec" in Playwright context; Catalog Spec unchanged)
- Self-hosted test page server serves host dist assets directly, rewriting
  Ember config meta tag to point resolvedBaseRealmURL at the actual realm
  server

Infrastructure:
- test-run-execution.ts: custom QUnit HTML page builder, local HTTP server
  for host assets, Playwright browser navigation with result collection
- test-run-parsing.ts: parseQunitResults() replaces Playwright JSON parsing
- test-run-types.ts: QunitTestResult, QunitRunSummary, QunitResults types
- realm/test-results.gts: TestModuleResult replaces SpecResult
- fixtures.ts: hostAppUrl on StartedFactoryRealm

Updated: skills, prompts, docs, smoke tests, all unit tests (385/385 pass)

Known issue: .test.gts module imports fail silently in the hermetic Playwright
harness (live-test discovers modules but can't import them). The QUnit page
infrastructure works end-to-end. Debugging the import chain is next.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The realm server's getFileWithFallbacks() in stream.ts checked if a path
contained a dot and skipped extension fallbacks if so. This meant a request
for "hello.test" (without .gts) would never find "hello.test.gts" — the
dot in "hello.test" was treated as a file extension.

Fix: only skip fallbacks when the path already has a known executable
extension (.gts, .ts, .js, .gjs), using hasExecutableExtension() instead
of a generic dot check. Applied the same fix to:
- runtime-common/stream.ts (getFileWithFallbacks)
- runtime-common/realm.ts (fallbackHandle)
- runtime-common/dependency-tracker.ts (hasPathExtension)
- runtime-common/index-runner/dependency-normalization.ts (isExtensionlessPath)

Added realm-server test: GET /hello.test resolves to hello.test.gts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e changes

These operate in a different context and need a separate, more considered fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pec.ts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test page server only served /assets/* paths. SQLite WASM lives at
the dist root (e.g., c29fc2dacfd64764a6ad.wasm) and fonts at various
paths. Serve all dist files for any non-root URL request.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keep only essential log lines (server URL, completion stats).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates software-factory card verification from factory-generated Playwright .spec.ts files (and a test-artifacts realm) to QUnit-based .test.gts files that run in a real browser DOM via the host’s live-test infrastructure, with results persisted back to TestRun cards.

Changes:

  • Replace Playwright spec-based test execution with a Playwright-driven QUnit live-test page that discovers and runs co-located .test.gts files.
  • Rename TestRun result structures from spec-oriented naming (SpecResultData, specResults, specRef) to module-oriented naming (TestModuleResultData, moduleResults, moduleRef).
  • Fix dotted-filename resolution by only skipping fallbacks when an executable extension is already present.

Reviewed changes

Copilot reviewed 34 out of 34 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/software-factory/tests/fixtures.ts Add hostAppUrl to realm fixture metadata for QUnit test runs.
packages/software-factory/tests/factory-tool-executor.spec.ts Remove testRealmUrl usage from tool-building in tests.
packages/software-factory/tests/factory-tool-builder.test.ts Update tool-builder tests for single-realm targeting + new run_tests contract.
packages/software-factory/tests/factory-test-realm.test.ts Replace Playwright report parsing tests with QUnit results parsing tests + moduleResults renames.
packages/software-factory/tests/factory-test-realm.spec.ts Update e2e to write/run .test.gts files and assert persisted moduleResults.
packages/software-factory/tests/factory-prompt-loader.test.ts Adjust prompt assertions to reflect updated system prompt content checks.
packages/software-factory/tests/factory-implement.test.ts Update expectations around derived test realm URL exposure in agent context.
packages/software-factory/tests/factory-agent.test.ts Adjust system message assertions (now checks for read_file).
packages/software-factory/test-fixtures/test-realm-runner/hello.test.gts Add QUnit test fixture co-located with card definition.
packages/software-factory/src/harness/support-services.ts Stop rejecting Ember test builds in host dist validation.
packages/software-factory/src/factory-entrypoint.ts Remove testRealmUrl from implement summary output.
packages/software-factory/src/cli/smoke-test-realm.ts Update smoke test to generate .test.gts files and invoke new test runner.
packages/software-factory/scripts/smoke-tests/factory-tools-smoke.ts Remove testRealmUrl from smoke tool config.
packages/software-factory/scripts/lib/test-run-types.ts Introduce QUnit result types and switch TestRun attributes to moduleResults/moduleRef.
packages/software-factory/scripts/lib/test-run-parsing.ts Implement parseQunitResults and remove Playwright/run-realm-tests parsing logic.
packages/software-factory/scripts/lib/test-run-execution.ts Replace pull-and-run Playwright specs flow with self-hosted QUnit page + Playwright browser collection.
packages/software-factory/scripts/lib/test-run-cards.ts Persist moduleResults instead of specResults in TestRun card lifecycle.
packages/software-factory/scripts/lib/factory-tool-builder.ts Remove test-realm targeting and update run_tests tool to QUnit mode.
packages/software-factory/scripts/lib/factory-test-realm.ts Re-export new QUnit parsing/types and drop test-artifacts realm helpers.
packages/software-factory/scripts/lib/factory-skill-loader.ts Switch always-loaded testing reference from Playwright to QUnit.
packages/software-factory/scripts/lib/factory-implement.ts Update test runner discovery/execution logic for .test.gts and new runner options.
packages/software-factory/realm/test-results.gts Rename SpecResult → TestModuleResult and specResults → moduleResults in TestRun schema/UI.
packages/software-factory/prompts/ticket-test.md Update agent instruction from Playwright specs to QUnit .test.gts files.
packages/software-factory/prompts/ticket-implement.md Update implementation checklist to produce co-located QUnit tests.
packages/software-factory/prompts/system.md Update global rule to require .test.gts tests.
packages/software-factory/docs/testing-strategy.md Update testing strategy docs to remove test-artifacts realm and describe QUnit live-test flow.
packages/software-factory/docs/phase-1-plan.md Update phase plan docs to reflect new QUnit-based execution model.
packages/software-factory/.agents/skills/software-factory-operations/SKILL.md Update skill docs to describe QUnit test file creation/execution patterns.
packages/software-factory/.agents/skills/boxel-development/references/dev-qunit-testing.md Add QUnit card testing reference doc for agents.
packages/software-factory/.agents/skills/boxel-development/references/dev-playwright-testing.md Remove Playwright testing reference doc.
packages/runtime-common/stream.ts Fix fallback behavior for dotted filenames using hasExecutableExtension.
packages/runtime-common/realm.ts Fix server-side fallback handling for dotted filenames in fallbackHandle.
packages/realm-server/tests/cards/hello.test.gts Add fixture card module to validate dotted filename resolution.
packages/realm-server/tests/card-source-endpoints-test.ts Add test asserting /hello.test resolves to hello.test.gts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

habdelra and others added 2 commits April 6, 2026 20:47
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…atus

- Validate asset server paths to prevent directory traversal (normalize,
  reject '..', verify resolved path stays within hostDistDir)
- Poll for QUnit availability instead of relying on window 'load' event
  to avoid race where QUnit starts before hooks are attached
- Map QUnit skipped/todo to 'passed' instead of 'pending' so they're
  terminal states that don't confuse resume logic or isComplete checks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

Host Test Results

2 120 tests  ±0   2 105 ✅ ±0   2h 17m 17s ⏱️ - 1m 2s
    1 suites ±0      15 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 18b105e. ± Comparison against base commit ab883ef.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

Realm Server Test Results

  1 files  ±0    1 suites  ±0   13m 43s ⏱️ -14s
838 tests +1  838 ✅ +1  0 💤 ±0  0 ❌ ±0 
909 runs  +1  909 ✅ +1  0 💤 ±0  0 ❌ ±0 

Results for commit 18b105e. ± Comparison against base commit ab883ef.

♻️ This comment has been updated with latest results.

habdelra and others added 6 commits April 7, 2026 08:44
The test fixture directory now includes hello.test.gts, so the directory
GET response test needs to expect it in the listing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
live-test.js fetches _mtimes without auth headers, which fails on private
realms (401 Unauthorized). Use page.route() to intercept requests to the
realm origin and inject the Authorization header at the network level.

Also includes diagnostic console forwarding for live-test and error messages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add debug option to ExecuteTestRunOptions. Browser console is only
forwarded to stderr when debug is enabled, reducing noise in normal runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each factory loop iteration should produce its own TestRun card, not
overwrite the previous one. Without forceNew, resolveTestRun found the
existing 'running' TestRun from the prior iteration and resumed it,
resulting in a single TestRun that only showed the final iteration's
results.

Add forceNew: true to both buildTestRunner() and the run_tests tool.
Add regression test verifying consecutive forceNew calls create separate
TestRuns with incrementing sequence numbers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These will be deleted after the PR description references are updated
to use GitHub-hosted URLs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@habdelra habdelra marked this pull request as ready for review April 7, 2026 13:39
habdelra and others added 2 commits April 7, 2026 09:40
Screenshots are now referenced by commit hash in the PR description
and no longer needed in the working tree.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document QUnit test page architecture, test artifacts realm removal,
private realm auth, dotted filename fix, forceNew per iteration,
skipped test handling, and production build limitation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3eebb50484

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Forward hostAppUrl from ImplementConfig into ToolBuilderConfig so the
  run_tests tool uses the browser-accessible compat proxy URL, not the
  internal realm server port (which the browser can't reach in the harness)
- Wait for written .test.gts files to be accessible in the realm before
  launching QUnit to avoid flaky failures from indexing delay

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@habdelra habdelra requested a review from a team April 7, 2026 14:21
@habdelra habdelra merged commit 4b4a823 into main Apr 7, 2026
79 of 81 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants