Skip to content

Use EIC codes as deterministic matching key before Duke fuzzy matching #287

@MaykThewessen

Description

@MaykThewessen

Problem

PPM currently routes all cross-source matching through Duke fuzzy matching on 5 fields: NAME, FUELTYPE, COUNTRY, CAPACITY, GEOPOSITION. EIC codes — unique European plant identifiers — are loaded from OPSD and ENTSOE, carried through the pipeline, and aggregated as sets in reduce_matched_dataframe(), but never used in the matching decision itself.

This leads to a class of bugs where co-located plants with similar names but different fuels get incorrectly merged. The canonical example:

Eemshaven harbour, Netherlands

Plant Operator Fuel Capacity EIC OPSD name
Eemshavencentrale RWE Hard Coal + biomass co-firing 2 × 780 MW (unique) Eemshavencentrale EM1/EM2
Eemscentrale ENGIE Natural Gas (peakers) 6 units, ~2200 MW 49W00000000008xG/K/O/S/W, etc. Eemscentrale EC3–EC7, EC20

Duke merges them into one "Eemshaven" entry labelled Hard Coal because:

  • Name similarity: "Eemshavencentrale" ≈ "Eemscentrale" (~0.86 JaroWinkler)
  • Geo distance: ~400m apart on the same peninsula (max 5000m comparator)
  • These two signals score well above the 0.965 threshold, overwhelming the fueltype mismatch (QGram low=0.09, high=0.7 — too weak to block)

The result: 2200 MW of Dutch gas peaker capacity is silently absorbed into a coal plant entry.


Proposed solution: EIC-first matching

The European energy market already solved plant identity — EIC (Energy Identification Code) is a unique identifier assigned to every market participant and generation unit. ENTSOE and OPSD both provide EIC codes for most plants.

Architecture

Before (current):
  All sources ──→ Duke fuzzy match ──→ reduce

After (proposed):
  All sources ──→ Deterministic EIC join ──→ matched set 1 (high confidence)
                  │
                  └→ EIC-unmatched residual ──→ Duke fuzzy match ──→ matched set 2
                  
                  matched set 1 + matched set 2 ──→ reduce

Implementation

Best insertion point: compare_two_datasets() in matching.py (line ~76)

def compare_two_datasets(dfs, dukeargs=None, ...):
    dfs = list(map(read_csv_if_string, dfs))
    
    # ── NEW: Deterministic EIC matching ──────────────────────
    eic_matches = _match_by_eic(dfs[0], dfs[1])
    
    # Remove EIC-matched rows from Duke input
    remaining = [
        dfs[0].drop(eic_matches['idx_0']),
        dfs[1].drop(eic_matches['idx_1']),
    ]
    
    # Duke handles the rest (fuzzy)
    duke_matches = _duke_match(remaining, dukeargs, ...)
    
    # Combine
    return pd.concat([eic_matches, duke_matches])

The _match_by_eic() function would:

  1. Inner join on non-null EIC values
  2. Handle multi-EIC plants (some have multiple EIC codes per unit)
  3. Return matches with a perfect confidence score (1.0)

Why this is better than fueltype blocking

  • Positive identification vs negative filtering — we know what matches, not just what shouldn't
  • No false negatives — fueltype blocking would prevent merging when sources disagree on fuel classification (e.g. "Biomass and biogas" vs "Hard Coal" for co-firing plants)
  • Generalises — works for any co-located plants (same fuel, different owner; industrial CHP next to utility plant; etc.)
  • Uses existing data — EIC is already in target_columns, loaded from ENTSOE (data.py:782), OPSD EU (Eic_Code), OPSD DE (Eic_Code_Plant), and GEM (WEIC)

Current EIC availability

Source EIC field Coverage
ENTSOE projectID (= EIC) ~100% for EU grid-connected plants
OPSD EU eic_code ~70% of entries
OPSD DE eic_code_plant ~80% of DE entries
GEM WEIC partial
GEO, GPD, others none

Even with partial coverage, EIC-first matching would deterministically resolve the majority of cross-source matches, leaving Duke to handle only the residual.


Impact

  • Eliminates co-located plant confusion (Eemshaven class of bugs) across all countries
  • Enables reliable cross-source data enrichment (e.g. OPSD efficiency mapped to the correct ENTSOE plant)
  • Reduces Duke's workload (fewer comparisons → faster matching)
  • Makes the matching pipeline auditable — EIC matches can be verified independently

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions