Skip to content

[Spider2-E] spider2-snow: gold corrections and evaluation patches#191

Open
AyanKBhowmick wants to merge 1 commit into
xlang-ai:mainfrom
AyanKBhowmick:EmergenceAI/Spider2-E
Open

[Spider2-E] spider2-snow: gold corrections and evaluation patches#191
AyanKBhowmick wants to merge 1 commit into
xlang-ai:mainfrom
AyanKBhowmick:EmergenceAI/Spider2-E

Conversation

@AyanKBhowmick

@AyanKBhowmick AyanKBhowmick commented May 5, 2026

Copy link
Copy Markdown

Summary

This PR upstreams the gold corrections and evaluation patches developed in EmergenceAI/Spider2-E — a community-maintained fork of spider2-snow — to bring its gold artifacts and evaluation behavior in line with what is reproducible against live Snowflake as of April 2026.

Every change in this PR has its own per-instance justification in the upstream Spider2-E repo. The PR squashes the changes into one commit for upstreaming convenience — happy to re-split into per-category commits or per-category PRs if you'd prefer.

Changes

1. 23 instances removed — gold no longer reproducible

Both the test entry and all gold artifacts (gold SQL if present, all gold result CSV variants, eval-config row) deleted.

1a. Snowflake data shares revoked (3)

SHOW DATABASES returns a tombstone; queries return 003030 (02000): Shared database is no longer available for use.

Instance db_id
sf009 NETHERLANDS_OPEN_MAP_DATA
sf013 NETHERLANDS_OPEN_MAP_DATA
sf029 AMAZON_VENDOR_ANALYTICS__SAMPLE_DATASET

These cannot be re-enabled until the publishers republish the shares.

1b. Live data drift (20)

Prompts are well-formed and the SQL executes, but the underlying table has drifted (rolling windows, refreshed counts, regenerated identifiers, granularity shifts) since the upstream gold was captured. Capturing today's output as a new variant only buys correctness for one cycle — removal is the only stable answer.

Instance db_id Drift symptom
sf006 FINANCE__ECONOMICS Refreshed branch counts and pct-change values
sf008 US_REAL_ESTATE Home price index value drifted
sf012 WEATHER__ENVIRONMENT Cybersyn-managed FEMA NFIP table is mutable; 2012 values disagree across the 3 upstream gold variants
sf037 US_REAL_ESTATE POI_ID format changed from hex to UUID
sf040 US_ADDRESSES__POI Top-10 northernmost addresses shift as records are added
sf_bq009 GA360 Revenue diff drifted
sf_bq024 USFS_FIA EVALUATION_TYPE shifted
sf_bq058 GOOG_BLOCKCHAIN Optimism blockchain data not present in this Snowflake share
sf_bq063 DEPS_DEV_V1 Top-result GitHub URL shifts as npm versions are published
sf_bq102 GNOMAD Genomic start position moved (assembly/annotation update)
sf_bq130 COVID19_NYT Result granularity shifted from county-level to state-level
sf_bq165 TCGA_MITELMAN Cohort total and per-band counts shifted
sf_bq190 THELOOK_ECOMMERCE Oldest-female-user count drifted
sf_bq249 GITHUB_REPOS Whitespace-category counts shifted as repo data grew
sf_bq256 CRYPTO Final Ethereum balance value sign flipped
sf_bq275 GA360 Sessions are rolling/dynamic — fullVisitorId set changes per capture
sf_bq366 THE_MET Same rows as gold, different ordering (font/drawing swap)
sf_ga014 GA4 Live GA4 session counts drift
sf_ga018 GA4 PLP-to-PDP conversion ratio changed
sf_ga021 FIREBASE Cohort event-type now resolves to a different value

2. Gold SQL changes (3 removed, 1 added)

Instance What changed
sf_bq294 Upstream gold embeds EXTRACT(YEAR FROM CURRENT_DATE) — calendar-dependent. Replaced with a deterministic version anchoring the reference year at literal 2025 (matching when the upstream gold result CSVs were captured). Verified to score 1 against all 3 upstream gold variants under the eval condition_cols.
sf012 Non-deterministic; instance also fully removed in §1b.
sf040 Non-deterministic; instance also fully removed in §1b.

Net SQL file count: 120 → 118 (−3 + 1).

3. Gold result CSV variants added (5)

For these instances the agent's SQL is structurally correct but Snowflake legitimately produces different output from the upstream gold because of engine differences. Today's Snowflake output is added as an additional accepted variant — original gold variants are untouched; evaluate.py accepts any matching variant.

Instance New variant Reason
sf_bq111 sf_bq111_f.csv BigQuery corr_pvalue UDF has no Snowflake equivalent
sf_bq276 sf_bq276_d.csv Spatial predicate evaluates port/storm geometry differently
sf_bq430 sf_bq430_f.csv UUID/hash generation from activity IDs and SMILES strings differs
sf_bq458 sf_bq458_e.csv UNNEST vs LATERAL FLATTEN of article-vector arrays drops different rows
sf_local344 sf_local344_c.csv Float-boundary edge case yields a different overtake count

Note: sf_bq458_e.csv was captured on the SNOWFLAKE_LEARNING_WH warehouse. The default COMPUTE_WH_PARTICIPANT warehouse has a 120s statement timeout the query exceeds (~258s actual runtime).

4. evaluate.py patches

4a. Hardcoded-SQL rejection (default on; opt-out via --allow_hardcoded)

Rejects submissions that bake gold values into CASE lookup tables with score = 0 before execution, instead of letting them trivially match the expected output.

Signal Triggers when Action
NO_FROM SELECT with no FROM clause (pure literal return) Reject
HARDCODED_CASE_MAP ≥ 2 WHEN 'literal' THEN 'literal' pairs where THEN values are strings ≥ 20 chars Reject
VALUES_CLAUSE SQL contains a VALUES (...) constructor Warn only
UNION_ALL_LITERALS ≥ 2 pure-literal SELECT blocks joined by UNION ALL Warn only

HARDCODED_CASE_MAP has zero known false positives across the 547 upstream instances. Earlier draft signals (overlap between SQL literals and gold CSV cells, IN (...) filter contents) were dropped — they could not be distinguished from legitimate enum filtering on documented schemas (NAICS, SNOMED, DICOM transfer-syntax UIDs, well-known contract addresses, etc.).

4b. Empty-result handling

Upstream short-circuits any predicted SQL returning zero rows as a failure, even when the gold is also empty. Patched: empty-vs-empty matches; empty-vs-non-empty still fails. No change for non-empty results.

4c. condition_cols padding

When per-instance eval config has fewer condition_cols entries than gold result variants, missing entries are padded with [] (compare full row) instead of crashing with IndexError. Defensive default for any future variant additions.

Verification

  • Test set: 547 → 524 instances (−23)
  • Gold SQL: 120 → 118 files (−3 + 1)
  • Gold result CSVs: 1,544 → 1,469 (−80 + 5)
  • New sf_bq294.sql verified to score 1 against all 3 upstream gold variants under the existing condition_cols.
  • The 5 new gold result variants were captured by direct execution of correct agent SQL against live Snowflake in late April 2026.
  • Per-instance evidence (reproductions, error messages, before/after diffs) lives in the EmergenceAI/Spider2-E repo.

Out of scope

Deliberately not included:

  • Any changes to resource/, schemas, or supporting docs (verbatim from upstream).
  • Any changes to task instructions, prompts, or the eval-suite scaffolding (CLI surface, threading, output format unchanged).
  • Any new dependencies.

Notes for reviewers

  • Single commit by design — the four categories are forensic findings from one audit pass and are most coherent together. Happy to re-split if you'd prefer per-category commits, or to break this into separate PRs (e.g. (1) share-revocation + non-det gold, (2) data-drift removals, (3) engine-diff variants, (4) evaluate.py patches) if some categories are more controversial.
  • Happy to address evaluate.py policy questions (the rejection rule is opinionated) in a separate thread if that's a barrier to merging the gold-artifact corrections.

Corrects gold artifacts and evaluation behavior for spider2-snow based
on issues observed running the benchmark against live Snowflake in
April 2026. Originated as the EmergenceAI Spider2-E fork; squashed
here for upstreaming.

Test set & gold result CSVs (spider2-snow.jsonl, spider2snow_eval.jsonl,
gold/exec_result/):
- Removed 23 instances whose gold can no longer be reproduced. 3 due to
  Snowflake data shares revoked by their publishers (sf009, sf013,
  sf029). 20 due to live data drift on rolling/refreshed datasets where
  capturing today's output only buys correctness for one cycle.
- Added 5 new gold result CSV variants for instances where Snowflake
  legitimately produces different output than upstream BigQuery-derived
  gold (sf_bq111, sf_bq276, sf_bq430, sf_bq458, sf_local344). Original
  variants are untouched; evaluate.py accepts any matching variant.

Gold SQL (gold/sql/):
- Removed 2 non-deterministic upstream gold SQLs whose instances are
  also removed above (sf012, sf040).
- Replaced sf_bq294's gold SQL: upstream embeds
  EXTRACT(YEAR FROM CURRENT_DATE) which makes age comparisons
  calendar-dependent. New SQL anchors the reference year at literal
  2025 (the year the upstream gold result CSVs were captured) and
  reproduces all 3 upstream gold result variants under the eval
  condition_cols.

evaluation_suite/evaluate.py:
- Reject SQL submissions that bake gold values into CASE lookup tables
  (HARDCODED_CASE_MAP signal) or have no FROM clause (NO_FROM signal).
  --allow_hardcoded restores the prior behavior.
- Treat empty predicted result vs. empty gold result as a match
  instead of failing with "No data found for the specified query."
- Pad missing condition_cols entries with [] when the per-instance
  eval config has fewer entries than gold result variants, instead of
  IndexError.
@AyanKBhowmick AyanKBhowmick changed the title spider2-snow: gold corrections and evaluation patches [Spider2-E] spider2-snow: gold corrections and evaluation patches May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant