[Spider2-E] spider2-snow: gold corrections and evaluation patches#191
Open
AyanKBhowmick wants to merge 1 commit into
Open
[Spider2-E] spider2-snow: gold corrections and evaluation patches#191AyanKBhowmick wants to merge 1 commit into
AyanKBhowmick wants to merge 1 commit into
Conversation
Corrects gold artifacts and evaluation behavior for spider2-snow based on issues observed running the benchmark against live Snowflake in April 2026. Originated as the EmergenceAI Spider2-E fork; squashed here for upstreaming. Test set & gold result CSVs (spider2-snow.jsonl, spider2snow_eval.jsonl, gold/exec_result/): - Removed 23 instances whose gold can no longer be reproduced. 3 due to Snowflake data shares revoked by their publishers (sf009, sf013, sf029). 20 due to live data drift on rolling/refreshed datasets where capturing today's output only buys correctness for one cycle. - Added 5 new gold result CSV variants for instances where Snowflake legitimately produces different output than upstream BigQuery-derived gold (sf_bq111, sf_bq276, sf_bq430, sf_bq458, sf_local344). Original variants are untouched; evaluate.py accepts any matching variant. Gold SQL (gold/sql/): - Removed 2 non-deterministic upstream gold SQLs whose instances are also removed above (sf012, sf040). - Replaced sf_bq294's gold SQL: upstream embeds EXTRACT(YEAR FROM CURRENT_DATE) which makes age comparisons calendar-dependent. New SQL anchors the reference year at literal 2025 (the year the upstream gold result CSVs were captured) and reproduces all 3 upstream gold result variants under the eval condition_cols. evaluation_suite/evaluate.py: - Reject SQL submissions that bake gold values into CASE lookup tables (HARDCODED_CASE_MAP signal) or have no FROM clause (NO_FROM signal). --allow_hardcoded restores the prior behavior. - Treat empty predicted result vs. empty gold result as a match instead of failing with "No data found for the specified query." - Pad missing condition_cols entries with [] when the per-instance eval config has fewer entries than gold result variants, instead of IndexError.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR upstreams the gold corrections and evaluation patches developed in EmergenceAI/Spider2-E — a community-maintained fork of
spider2-snow— to bring its gold artifacts and evaluation behavior in line with what is reproducible against live Snowflake as of April 2026.Every change in this PR has its own per-instance justification in the upstream Spider2-E repo. The PR squashes the changes into one commit for upstreaming convenience — happy to re-split into per-category commits or per-category PRs if you'd prefer.
Changes
1. 23 instances removed — gold no longer reproducible
Both the test entry and all gold artifacts (gold SQL if present, all gold result CSV variants, eval-config row) deleted.
1a. Snowflake data shares revoked (3)
SHOW DATABASESreturns a tombstone; queries return003030 (02000): Shared database is no longer available for use.sf009NETHERLANDS_OPEN_MAP_DATAsf013NETHERLANDS_OPEN_MAP_DATAsf029AMAZON_VENDOR_ANALYTICS__SAMPLE_DATASETThese cannot be re-enabled until the publishers republish the shares.
1b. Live data drift (20)
Prompts are well-formed and the SQL executes, but the underlying table has drifted (rolling windows, refreshed counts, regenerated identifiers, granularity shifts) since the upstream gold was captured. Capturing today's output as a new variant only buys correctness for one cycle — removal is the only stable answer.
sf006FINANCE__ECONOMICSsf008US_REAL_ESTATEsf012WEATHER__ENVIRONMENTsf037US_REAL_ESTATEsf040US_ADDRESSES__POIsf_bq009GA360sf_bq024USFS_FIAsf_bq058GOOG_BLOCKCHAINsf_bq063DEPS_DEV_V1sf_bq102GNOMADsf_bq130COVID19_NYTsf_bq165TCGA_MITELMANsf_bq190THELOOK_ECOMMERCEsf_bq249GITHUB_REPOSsf_bq256CRYPTOsf_bq275GA360fullVisitorIdset changes per capturesf_bq366THE_METsf_ga014GA4sf_ga018GA4sf_ga021FIREBASE2. Gold SQL changes (3 removed, 1 added)
sf_bq294EXTRACT(YEAR FROM CURRENT_DATE)— calendar-dependent. Replaced with a deterministic version anchoring the reference year at literal2025(matching when the upstream gold result CSVs were captured). Verified to score 1 against all 3 upstream gold variants under the evalcondition_cols.sf012sf040Net SQL file count: 120 → 118 (−3 + 1).
3. Gold result CSV variants added (5)
For these instances the agent's SQL is structurally correct but Snowflake legitimately produces different output from the upstream gold because of engine differences. Today's Snowflake output is added as an additional accepted variant — original gold variants are untouched;
evaluate.pyaccepts any matching variant.sf_bq111sf_bq111_f.csvcorr_pvalueUDF has no Snowflake equivalentsf_bq276sf_bq276_d.csvsf_bq430sf_bq430_f.csvsf_bq458sf_bq458_e.csvUNNESTvsLATERAL FLATTENof article-vector arrays drops different rowssf_local344sf_local344_c.csvNote:
sf_bq458_e.csvwas captured on theSNOWFLAKE_LEARNING_WHwarehouse. The defaultCOMPUTE_WH_PARTICIPANTwarehouse has a 120s statement timeout the query exceeds (~258s actual runtime).4.
evaluate.pypatches4a. Hardcoded-SQL rejection (default on; opt-out via
--allow_hardcoded)Rejects submissions that bake gold values into
CASElookup tables withscore = 0before execution, instead of letting them trivially match the expected output.NO_FROMSELECTwith noFROMclause (pure literal return)HARDCODED_CASE_MAPWHEN 'literal' THEN 'literal'pairs where THEN values are strings ≥ 20 charsVALUES_CLAUSEVALUES (...)constructorUNION_ALL_LITERALSSELECTblocks joined byUNION ALLHARDCODED_CASE_MAPhas zero known false positives across the 547 upstream instances. Earlier draft signals (overlap between SQL literals and gold CSV cells,IN (...)filter contents) were dropped — they could not be distinguished from legitimate enum filtering on documented schemas (NAICS, SNOMED, DICOM transfer-syntax UIDs, well-known contract addresses, etc.).4b. Empty-result handling
Upstream short-circuits any predicted SQL returning zero rows as a failure, even when the gold is also empty. Patched: empty-vs-empty matches; empty-vs-non-empty still fails. No change for non-empty results.
4c.
condition_colspaddingWhen per-instance eval config has fewer
condition_colsentries than gold result variants, missing entries are padded with[](compare full row) instead of crashing withIndexError. Defensive default for any future variant additions.Verification
sf_bq294.sqlverified to score 1 against all 3 upstream gold variants under the existingcondition_cols.EmergenceAI/Spider2-Erepo.Out of scope
Deliberately not included:
resource/, schemas, or supporting docs (verbatim from upstream).Notes for reviewers
evaluate.pypatches) if some categories are more controversial.evaluate.pypolicy questions (the rejection rule is opinionated) in a separate thread if that's a barrier to merging the gold-artifact corrections.