[Spider2-E] spider2-snow: gold corrections and evaluation patches by AyanKBhowmick · Pull Request #191 · xlang-ai/Spider2

AyanKBhowmick · 2026-05-05T10:48:03Z

Summary

This PR upstreams the gold corrections and evaluation patches developed in EmergenceAI/Spider2-E — a community-maintained fork of spider2-snow — to bring its gold artifacts and evaluation behavior in line with what is reproducible against live Snowflake as of April 2026.

Every change in this PR has its own per-instance justification in the upstream Spider2-E repo. The PR squashes the changes into one commit for upstreaming convenience — happy to re-split into per-category commits or per-category PRs if you'd prefer.

Changes

1. 23 instances removed — gold no longer reproducible

Both the test entry and all gold artifacts (gold SQL if present, all gold result CSV variants, eval-config row) deleted.

1a. Snowflake data shares revoked (3)

SHOW DATABASES returns a tombstone; queries return 003030 (02000): Shared database is no longer available for use.

Instance	db_id
`sf009`	`NETHERLANDS_OPEN_MAP_DATA`
`sf013`	`NETHERLANDS_OPEN_MAP_DATA`
`sf029`	`AMAZON_VENDOR_ANALYTICS__SAMPLE_DATASET`

These cannot be re-enabled until the publishers republish the shares.

1b. Live data drift (20)

Prompts are well-formed and the SQL executes, but the underlying table has drifted (rolling windows, refreshed counts, regenerated identifiers, granularity shifts) since the upstream gold was captured. Capturing today's output as a new variant only buys correctness for one cycle — removal is the only stable answer.

Instance	db_id	Drift symptom
`sf006`	`FINANCE__ECONOMICS`	Refreshed branch counts and pct-change values
`sf008`	`US_REAL_ESTATE`	Home price index value drifted
`sf012`	`WEATHER__ENVIRONMENT`	Cybersyn-managed FEMA NFIP table is mutable; 2012 values disagree across the 3 upstream gold variants
`sf037`	`US_REAL_ESTATE`	POI_ID format changed from hex to UUID
`sf040`	`US_ADDRESSES__POI`	Top-10 northernmost addresses shift as records are added
`sf_bq009`	`GA360`	Revenue diff drifted
`sf_bq024`	`USFS_FIA`	EVALUATION_TYPE shifted
`sf_bq058`	`GOOG_BLOCKCHAIN`	Optimism blockchain data not present in this Snowflake share
`sf_bq063`	`DEPS_DEV_V1`	Top-result GitHub URL shifts as npm versions are published
`sf_bq102`	`GNOMAD`	Genomic start position moved (assembly/annotation update)
`sf_bq130`	`COVID19_NYT`	Result granularity shifted from county-level to state-level
`sf_bq165`	`TCGA_MITELMAN`	Cohort total and per-band counts shifted
`sf_bq190`	`THELOOK_ECOMMERCE`	Oldest-female-user count drifted
`sf_bq249`	`GITHUB_REPOS`	Whitespace-category counts shifted as repo data grew
`sf_bq256`	`CRYPTO`	Final Ethereum balance value sign flipped
`sf_bq275`	`GA360`	Sessions are rolling/dynamic — `fullVisitorId` set changes per capture
`sf_bq366`	`THE_MET`	Same rows as gold, different ordering (font/drawing swap)
`sf_ga014`	`GA4`	Live GA4 session counts drift
`sf_ga018`	`GA4`	PLP-to-PDP conversion ratio changed
`sf_ga021`	`FIREBASE`	Cohort event-type now resolves to a different value

2. Gold SQL changes (3 removed, 1 added)

Instance	What changed
`sf_bq294`	Upstream gold embeds `EXTRACT(YEAR FROM CURRENT_DATE)` — calendar-dependent. Replaced with a deterministic version anchoring the reference year at literal `2025` (matching when the upstream gold result CSVs were captured). Verified to score 1 against all 3 upstream gold variants under the eval `condition_cols`.
`sf012`	Non-deterministic; instance also fully removed in §1b.
`sf040`	Non-deterministic; instance also fully removed in §1b.

Net SQL file count: 120 → 118 (−3 + 1).

3. Gold result CSV variants added (5)

For these instances the agent's SQL is structurally correct but Snowflake legitimately produces different output from the upstream gold because of engine differences. Today's Snowflake output is added as an additional accepted variant — original gold variants are untouched; evaluate.py accepts any matching variant.

Instance	New variant	Reason
`sf_bq111`	`sf_bq111_f.csv`	BigQuery `corr_pvalue` UDF has no Snowflake equivalent
`sf_bq276`	`sf_bq276_d.csv`	Spatial predicate evaluates port/storm geometry differently
`sf_bq430`	`sf_bq430_f.csv`	UUID/hash generation from activity IDs and SMILES strings differs
`sf_bq458`	`sf_bq458_e.csv`	`UNNEST` vs `LATERAL FLATTEN` of article-vector arrays drops different rows
`sf_local344`	`sf_local344_c.csv`	Float-boundary edge case yields a different overtake count

Note: sf_bq458_e.csv was captured on the SNOWFLAKE_LEARNING_WH warehouse. The default COMPUTE_WH_PARTICIPANT warehouse has a 120s statement timeout the query exceeds (~258s actual runtime).

4. `evaluate.py` patches

4a. Hardcoded-SQL rejection (default on; opt-out via --allow_hardcoded)

Rejects submissions that bake gold values into CASE lookup tables with score = 0 before execution, instead of letting them trivially match the expected output.

Signal	Triggers when	Action
`NO_FROM`	`SELECT` with no `FROM` clause (pure literal return)	Reject
`HARDCODED_CASE_MAP`	≥ 2 `WHEN 'literal' THEN 'literal'` pairs where THEN values are strings ≥ 20 chars	Reject
`VALUES_CLAUSE`	SQL contains a `VALUES (...)` constructor	Warn only
`UNION_ALL_LITERALS`	≥ 2 pure-literal `SELECT` blocks joined by `UNION ALL`	Warn only

HARDCODED_CASE_MAP has zero known false positives across the 547 upstream instances. Earlier draft signals (overlap between SQL literals and gold CSV cells, IN (...) filter contents) were dropped — they could not be distinguished from legitimate enum filtering on documented schemas (NAICS, SNOMED, DICOM transfer-syntax UIDs, well-known contract addresses, etc.).

4b. Empty-result handling

Upstream short-circuits any predicted SQL returning zero rows as a failure, even when the gold is also empty. Patched: empty-vs-empty matches; empty-vs-non-empty still fails. No change for non-empty results.

4c. condition_cols padding

When per-instance eval config has fewer condition_cols entries than gold result variants, missing entries are padded with [] (compare full row) instead of crashing with IndexError. Defensive default for any future variant additions.

Verification

Test set: 547 → 524 instances (−23)
Gold SQL: 120 → 118 files (−3 + 1)
Gold result CSVs: 1,544 → 1,469 (−80 + 5)
New sf_bq294.sql verified to score 1 against all 3 upstream gold variants under the existing condition_cols.
The 5 new gold result variants were captured by direct execution of correct agent SQL against live Snowflake in late April 2026.
Per-instance evidence (reproductions, error messages, before/after diffs) lives in the EmergenceAI/Spider2-E repo.

Out of scope

Deliberately not included:

Any changes to resource/, schemas, or supporting docs (verbatim from upstream).
Any changes to task instructions, prompts, or the eval-suite scaffolding (CLI surface, threading, output format unchanged).
Any new dependencies.

Notes for reviewers

Single commit by design — the four categories are forensic findings from one audit pass and are most coherent together. Happy to re-split if you'd prefer per-category commits, or to break this into separate PRs (e.g. (1) share-revocation + non-det gold, (2) data-drift removals, (3) engine-diff variants, (4) evaluate.py patches) if some categories are more controversial.
Happy to address evaluate.py policy questions (the rejection rule is opinionated) in a separate thread if that's a barrier to merging the gold-artifact corrections.

Corrects gold artifacts and evaluation behavior for spider2-snow based on issues observed running the benchmark against live Snowflake in April 2026. Originated as the EmergenceAI Spider2-E fork; squashed here for upstreaming. Test set & gold result CSVs (spider2-snow.jsonl, spider2snow_eval.jsonl, gold/exec_result/): - Removed 23 instances whose gold can no longer be reproduced. 3 due to Snowflake data shares revoked by their publishers (sf009, sf013, sf029). 20 due to live data drift on rolling/refreshed datasets where capturing today's output only buys correctness for one cycle. - Added 5 new gold result CSV variants for instances where Snowflake legitimately produces different output than upstream BigQuery-derived gold (sf_bq111, sf_bq276, sf_bq430, sf_bq458, sf_local344). Original variants are untouched; evaluate.py accepts any matching variant. Gold SQL (gold/sql/): - Removed 2 non-deterministic upstream gold SQLs whose instances are also removed above (sf012, sf040). - Replaced sf_bq294's gold SQL: upstream embeds EXTRACT(YEAR FROM CURRENT_DATE) which makes age comparisons calendar-dependent. New SQL anchors the reference year at literal 2025 (the year the upstream gold result CSVs were captured) and reproduces all 3 upstream gold result variants under the eval condition_cols. evaluation_suite/evaluate.py: - Reject SQL submissions that bake gold values into CASE lookup tables (HARDCODED_CASE_MAP signal) or have no FROM clause (NO_FROM signal). --allow_hardcoded restores the prior behavior. - Treat empty predicted result vs. empty gold result as a match instead of failing with "No data found for the specified query." - Pad missing condition_cols entries with [] when the per-instance eval config has fewer entries than gold result variants, instead of IndexError.

AyanKBhowmick changed the title ~~spider2-snow: gold corrections and evaluation patches~~ [Spider2-E] spider2-snow: gold corrections and evaluation patches May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Spider2-E] spider2-snow: gold corrections and evaluation patches#191

[Spider2-E] spider2-snow: gold corrections and evaluation patches#191
AyanKBhowmick wants to merge 1 commit into
xlang-ai:mainfrom
AyanKBhowmick:EmergenceAI/Spider2-E

AyanKBhowmick commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

AyanKBhowmick commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. 23 instances removed — gold no longer reproducible

2. Gold SQL changes (3 removed, 1 added)

3. Gold result CSV variants added (5)

4. evaluate.py patches

Verification

Out of scope

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AyanKBhowmick commented May 5, 2026 •

edited

Loading

4. `evaluate.py` patches