This doc is a working plan for refreshing the aging 2016 WMATA fixtures
that back the transitclockIntegration module. It's written for both
developers and future Claude sessions — if you're picking this up cold, start
at "Current state" and read through.
Related:
tools/wmata_capture/README.md— the capture tool itself.- GitHub issues
OneBusAway/thetransitclock#7(detour test) and#8(prediction accuracy baseline).
- Step 1 capture — DONE. Output at
tools/wmata_capture/output/fixtures-20260423/. 1,441 polls / 0 failures / 26,646 rows / 85 vehicles across D40, A40, D20. - Step 2 vehicle selection — DONE. Picks:
D40_5506(736 rows, 100% BLOCK_ID) for the prediction accuracy testA40_3151(575 rows, 2,430 m off-route excursion over 26 AVL points, ends on-route) for the detour test. Initial pick A40_3115 exited layover correctly but ended >10 min behind schedule at the end of AM-rush traffic, failing the test's adherence assertion; A40_3151 ended at −7.5 min, within the 10-min threshold.D20_7198(843 rows) kept as a spare for future tests
- Step 3 promotion — DONE. Subsetted GTFS + chosen AVL traces landed under
transitclockIntegration/src/test/resources/{gtfs,avl}/. Old S2 and 3T fixtures removed. 5A fixtures left in place (their tests still pass). - Step 4 test edits — DONE.
RecoverFromDetourTestis un-ignored and points atA40_3151, passing.PredictionAccuracyIntegrationTesthas its path constants updated but remains@Ignored — the blocker turned out to be deeper than fixture rot (see Step 4b). - Step 4b pred baseline — BLOCKED (non-determinism). Attempted to
generate
pred/D40_5506.csvby adding a one-shot dumper that callssession.createCriteria(Prediction.class).list()afterPlaybackModule .runTrace. The dump succeeds, but replaying the same AVL trace in a separate JVM produces different prediction counts on every run (observed 2155, 2696, and 2118 across three runs; AD counts also varied by >2x). A frozen CSV baseline therefore cannot serve as a regression signal — thenewTotalError <= oldTotalErrorandoldTotalPreds <= newTotalPredsassertions fail on run-to-run variance, not on predictor regression. Most likely source is unsorted collection iteration order in the predictor pipeline carrying over into which predictions get generated / persisted, but that's speculation — a real fix needs a deterministic replay. - Step 5 validate — DONE (for non-ignored tests).
mvn -P include-integration-tests teston the module passes: 3 tests green (detour + two 5A), 1 skipped (testPredictions). Matches the pre- refresh test count with one test now genuinely live instead of pinned to stale 2016 data.
Two plausible paths for future work:
- Make the predictor deterministic. Find the source(s) of
non-determinism — likely
HashMapiteration order, or thread scheduling inAvlProcessor/ prediction generation — and replace with deterministic equivalents (LinkedHashMap, single-thread replay mode). Then a CSV baseline works as originally intended. - Rewrite the assertions to be tolerance-based. Instead of exact CSV comparison, assert on properties that should be stable even across non-deterministic runs — e.g. "mean absolute prediction error over all observed stops is below X seconds" or "90th-percentile error is below Y seconds." No baseline CSV needed. Loses regression detection against prior predictor versions but gains stability.
Option 2 is the smaller change. Option 1 is the more honest fix and would also help other tests that replay traces.
Post-capture utilities now live in tools/wmata_capture/:
subset_gtfs.py— slice the full feed to one route with referential closurefind_detour_candidates.py— rank AVL traces by detour-likeness against GTFS shapes
Four tests under transitclockIntegration/src/test/java/org/transitclock/integration_tests/
consume WMATA fixtures. Post-refresh: three pass, one stays @Ignored pending a
deterministic-replay or tolerance-based fix:
| Test | Route / vehicle fixture | Status | Issue |
|---|---|---|---|
prediction/PredictionAccuracyIntegrationTest |
D40_5506 (avl; pred baseline blocked) |
@Ignore — non-determinism, see Step 4b |
#8 |
RecoverFromDetourTest |
A40_3151 |
passes | #7 (closed by this refresh) |
GenerateEffectiveScheduleDifferenceTest |
5A_8062 |
passes | — |
EffectiveScheduleDifferenceDuringLayoverTest |
5A_8062 |
passes | — |
Fixture layout:
transitclockIntegration/src/test/resources/gtfs/{A40,D20,D40,5A}/— unpacked static GTFS per route.D20is a spare (no test consumes it yet) kept from the same capture window.transitclockIntegration/src/test/resources/avl/{A40_3151,D20_7198,D40_5506,5A_8062}.csv— AVL traces.transitclockIntegration/src/test/resources/pred/— predictor output baseline for the accuracy test. Currently empty; aD40_5506.csvbaseline will land here once Step 4b is unblocked.
The tests all call PlaybackModule.runTrace(GTFS, AVL), which boots a
real Core against an in-memory database, replays the AVL CSV, and lets
the matcher/prediction pipeline run end-to-end. No mocks.
None of S2, 3T, or 5A exist in today's WMATA GTFS. WMATA's
"Better Bus Network" redesign (effective June 2025) renamed every
route to a <zone-letter><number> scheme. Current short names are all
in A*, C*, D*, F*, M*, P* ranges — 127 routes total, none
matching the 2016 fixture names.
So the refresh can't just re-capture the same route IDs. We have to pick current routes, capture fresh traces, and rename fixtures + update the route-ID constants in the tests.
Picking routes that (a) exist today, (b) have enough trips/vehicles during the capture window to produce dense traces, and (c) give us a shot at exercising the pinned test scenarios.
Block-id coverage is no longer a selection worry — the old README
caveat about patchy block_id in trips.txt is obsolete post-redesign.
Spot check: 100% of trips in the current GTFS carry a block_id.
Minimum-scope recommendation (fix only what's broken — #7 and #8):
| Replaces | New route | Rationale |
|---|---|---|
| S2 (16th St DC trunk) | D40 "7 St–Georgia Av" | High-frequency DC N-S trunk; 1,600+ trips/day; plenty of predictor signal |
| 3T (VA crosstown) | A40 "Columbia Pike–National Landing" | Busy VA arterial; dense urban section maximizes odds of finding a detour event for the detour test |
If you want variety for future tests, D20 ("H Street") is a good third pick at zero extra capture cost — one process can filter multiple routes. But nothing in the currently-failing test suite needs it.
Prereqs: WMATA_API_KEY in tools/wmata_capture/.env (already present
as of this writing — the capture script auto-loads it).
cd tools/wmata_capture
nohup uv run capture.py \
--output-dir ./output/fixtures-$(date +%Y%m%d) \
--duration-hours 12 \
--poll-interval 30 \
--routes D20,D40,A40 \
> /dev/null 2>&1 &Why these flags:
--duration-hours 12— starting evening PT, 12h captures past DC AM rush (06:00–10:00 ET), which is the richest window per the capture tool's README.--poll-interval 30— WMATA's GTFS-RT updates roughly every 60s; the script de-dups overlapping observations within a run.--routes D40,A40— both in one process (cheaper than two, dedup is shared). Add,D20if you want an extra route for variety.
Monitor live:
tail -f tools/wmata_capture/output/fixtures-*/capture.logIn the morning, survey output:
cd tools/wmata_capture/output/fixtures-<date>
# Fattest traces per route (more rows = more predictor signal)
ls -S avl/ | head -20
# Sanity: row counts per vehicle
wc -l avl/*.csv | sort -n -r | head -20
# Confirm block_id assignments came through (should be mostly BLOCK, not TRIP_ID)
awk -F, 'NR>1 {print $4}' avl/D40_*.csv | sort | uniq -cPick one vehicle per route to promote. Criteria:
- For the prediction accuracy test (replaces S2_2113): pick the longest/densest D40 trace — more prediction horizons = stronger signal. Cover the AM rush window if possible.
- For the detour test (replaces 3T_3757): this is the hard one.
The test asserts a vehicle exits layover state after going off-route
and returning. Not every vehicle contains a detour event. Options,
in order of preference:
- Manually scan A40 traces for vehicles whose lat/lon briefly
departs and rejoins the route shape — those are detour
candidates.
transitclockIntegration/src/test/resources/avl/3T_3757.csvis the reference for what a "good" detour trace looks like. - If no clean detour is found, this test may need to be re-scoped to a more generally observable scenario (e.g. "vehicle recovers schedule adherence after an AVL gap") rather than one-for-one replaced. Flag this as a finding in the PR; don't silently loosen the assertions.
- Manually scan A40 traces for vehicles whose lat/lon briefly
departs and rejoins the route shape — those are detour
candidates.
From the repo root, for each route (example below shows D40 replacing S2):
# 1. Swap GTFS
rm -rf transitclockIntegration/src/test/resources/gtfs/S2
cp -R tools/wmata_capture/output/fixtures-<date>/gtfs \
transitclockIntegration/src/test/resources/gtfs/D40
# 2. Swap AVL CSV (pick the vehicle from Step 2)
rm transitclockIntegration/src/test/resources/avl/S2_2113.csv
cp tools/wmata_capture/output/fixtures-<date>/avl/D40_<vehicle>.csv \
transitclockIntegration/src/test/resources/avl/D40_<vehicle>.csvRepeat for A40 replacing 3T. The GTFS dir in step 1 is the whole feed
unpacked — PlaybackModule.runTrace only uses the route referenced by
the test, but it loads the full dir. Copying the full GTFS twice (once
per route directory) is fine and matches the current layout.
Java files to edit (these are the exact occurrences as of this writing):
transitclockIntegration/src/test/java/org/transitclock/integration_tests/prediction/PredictionAccuracyIntegrationTest.javaGTFS = "src/test/resources/gtfs/S2"→ new path (done in this refresh:gtfs/D40)AVL = "src/test/resources/avl/S2_2113.csv"→ new path (done:avl/D40_5506.csv)PREDICTIONS_CSV = "src/test/resources/pred/S2_2113.csv"→ new path (not yet pointed at a real baseline — see Step 4b)@Ignore(...)— keep until the non-determinism blocker is resolved (see "Progress so far" / "Re-enabling testPredictions"). Removing it now reintroduces run-to-run flakes, not a regression signal.
transitclockIntegration/src/test/java/org/transitclock/integration_tests/RecoverFromDetourTest.javaGTFS,AVL,VEHICLE→ new values (done:A40,A40_3151.csv,3151)@Ignore(...)— remove (done: the test is live and passing)
Do not touch the two 5A tests unless you're intentionally
expanding scope — they currently pass and changing their fixtures means
re-validating their assertions.
Blocked as of this refresh. Replaying the same AVL trace in separate JVMs produces different prediction counts run-to-run (see "Progress so far" Step 4b), so a frozen CSV baseline can't serve as a regression signal yet. Either resolve the non-determinism or move the test to tolerance-based assertions before following the steps below. Until then, leave
@IgnoreonPredictionAccuracyIntegrationTest.
The baseline is the output of the current predictor replayed against the new AVL trace. It's not produced by the capture tool. Process once Step 4b is unblocked:
- With
@Ignoreremoved onPredictionAccuracyIntegrationTestand the new fixtures in place, run the test once:mvn -pl transitclockIntegration test \ -Dtest=PredictionAccuracyIntegrationTest \ -P include-integration-tests - The test's
setUp()fetchesList<Prediction>from Hibernate (PredictionAccuracyIntegrationTest.java:80). Dump those rows to a CSV with the exact same header as the oldpred/S2_2113.csv:Easiest approach: add a one-shotid,affectedByWaitStop,avlTime,configRev,creationTime,gtfsStopSeq,isArrival,predictionTime,routeId,schedBasedPred,stopId,tripId,vehicleId
@Before/@Afterhook (or a scratch@Testthat you delete afterwards) that writes the CSV before the assertions run. The existing test code already binds to the samePredictionentity — reuse its fields. - Save the CSV as
transitclockIntegration/src/test/resources/pred/D40_5506.csvand pointPREDICTIONS_CSVat it. - Re-run the test. It compares new predictions against the baseline
you just generated, so on first pass it should be close to a
no-op (new ≈ old). The assertions (
newTotalPreds >= oldTotalPreds,newTotalError <= oldTotalError,oldBetter/bothTotalPreds <= 0.5) then mean "the predictor must not regress against this captured baseline in future." That's the regression-signal the fixture refresh is restoring. - Leave
@Ignoreoff once the baseline is stable. Re-adding it defeats the refresh.
# Run just the two previously-ignored tests
mvn -pl transitclockIntegration test -P include-integration-tests \
-Dtest=PredictionAccuracyIntegrationTest,RecoverFromDetourTest
# Then run the whole integration suite to ensure 5A tests still pass
mvn -pl transitclockIntegration -am test -P include-integration-tests
# Then the full reactor with everything enabled
mvn install -P run-all-tests- Don't commit the WMATA API key.
.envis gitignored; verify withgit statusbefore committing fixture changes. - Don't re-add
@Ignoreafter regenerating the pred baseline. The whole point of the refresh is to make those tests live signal again. If the test fails on first live run, investigate — don't silence it. - Don't break the passing 5A tests. They use fixtures named
5Aon disk. If you're tempted to rename5A→ something current for tidy consistency, stop — that means re-validating the 5A tests' assertions against new data, which is out of scope for the "fix broken tests" goal. Leave 5A alone unless the user explicitly expands scope. - Verify block-assignment type in the captured AVL. Column 4
(
assignmentType) should be mostlyBLOCK, withTRIP_IDfallback only for the minority of trips where WMATA didn't populateblock_id. Current GTFS shows 100%block_idcoverage, soTRIP_IDfallbacks should be rare. HeavyTRIP_IDusage means something changed and the promotion should pause. - The capture's timezone is
America/New_Yorkregardless of the host clock.BatchCsvAvlFeedModuleparses with the JVM default TZ, so tests must run with-Duser.timezone=America/New_Yorkor equivalent — check how the existing suite sets this before assuming it'll "just work" on a Pacific-time dev box. - Don't resume into the same
--output-dir. The capture's dedup state isn't persisted across runs; resuming produces duplicates. Always start a new timestamped dir. - The detour test may need more than a fixture swap. Real detour events are rare in any 12-hour capture. If you can't find one, document that in the PR and propose either (a) a longer/targeted capture, or (b) re-scoping the test — don't paper over it by weakening assertions.