Problem
PPM currently routes all cross-source matching through Duke fuzzy matching on 5 fields: NAME, FUELTYPE, COUNTRY, CAPACITY, GEOPOSITION. EIC codes — unique European plant identifiers — are loaded from OPSD and ENTSOE, carried through the pipeline, and aggregated as sets in reduce_matched_dataframe(), but never used in the matching decision itself.
This leads to a class of bugs where co-located plants with similar names but different fuels get incorrectly merged. The canonical example:
Eemshaven harbour, Netherlands
| Plant |
Operator |
Fuel |
Capacity |
EIC |
OPSD name |
| Eemshavencentrale |
RWE |
Hard Coal + biomass co-firing |
2 × 780 MW |
(unique) |
Eemshavencentrale EM1/EM2 |
| Eemscentrale |
ENGIE |
Natural Gas (peakers) |
6 units, ~2200 MW |
49W00000000008xG/K/O/S/W, etc. |
Eemscentrale EC3–EC7, EC20 |
Duke merges them into one "Eemshaven" entry labelled Hard Coal because:
- Name similarity: "Eemshavencentrale" ≈ "Eemscentrale" (~0.86 JaroWinkler)
- Geo distance: ~400m apart on the same peninsula (max 5000m comparator)
- These two signals score well above the 0.965 threshold, overwhelming the fueltype mismatch (QGram low=0.09, high=0.7 — too weak to block)
The result: 2200 MW of Dutch gas peaker capacity is silently absorbed into a coal plant entry.
Proposed solution: EIC-first matching
The European energy market already solved plant identity — EIC (Energy Identification Code) is a unique identifier assigned to every market participant and generation unit. ENTSOE and OPSD both provide EIC codes for most plants.
Architecture
Before (current):
All sources ──→ Duke fuzzy match ──→ reduce
After (proposed):
All sources ──→ Deterministic EIC join ──→ matched set 1 (high confidence)
│
└→ EIC-unmatched residual ──→ Duke fuzzy match ──→ matched set 2
matched set 1 + matched set 2 ──→ reduce
Implementation
Best insertion point: compare_two_datasets() in matching.py (line ~76)
def compare_two_datasets(dfs, dukeargs=None, ...):
dfs = list(map(read_csv_if_string, dfs))
# ── NEW: Deterministic EIC matching ──────────────────────
eic_matches = _match_by_eic(dfs[0], dfs[1])
# Remove EIC-matched rows from Duke input
remaining = [
dfs[0].drop(eic_matches['idx_0']),
dfs[1].drop(eic_matches['idx_1']),
]
# Duke handles the rest (fuzzy)
duke_matches = _duke_match(remaining, dukeargs, ...)
# Combine
return pd.concat([eic_matches, duke_matches])
The _match_by_eic() function would:
- Inner join on non-null EIC values
- Handle multi-EIC plants (some have multiple EIC codes per unit)
- Return matches with a perfect confidence score (1.0)
Why this is better than fueltype blocking
- Positive identification vs negative filtering — we know what matches, not just what shouldn't
- No false negatives — fueltype blocking would prevent merging when sources disagree on fuel classification (e.g. "Biomass and biogas" vs "Hard Coal" for co-firing plants)
- Generalises — works for any co-located plants (same fuel, different owner; industrial CHP next to utility plant; etc.)
- Uses existing data — EIC is already in
target_columns, loaded from ENTSOE (data.py:782), OPSD EU (Eic_Code), OPSD DE (Eic_Code_Plant), and GEM (WEIC)
Current EIC availability
| Source |
EIC field |
Coverage |
| ENTSOE |
projectID (= EIC) |
~100% for EU grid-connected plants |
| OPSD EU |
eic_code |
~70% of entries |
| OPSD DE |
eic_code_plant |
~80% of DE entries |
| GEM |
WEIC |
partial |
| GEO, GPD, others |
— |
none |
Even with partial coverage, EIC-first matching would deterministically resolve the majority of cross-source matches, leaving Duke to handle only the residual.
Impact
- Eliminates co-located plant confusion (Eemshaven class of bugs) across all countries
- Enables reliable cross-source data enrichment (e.g. OPSD efficiency mapped to the correct ENTSOE plant)
- Reduces Duke's workload (fewer comparisons → faster matching)
- Makes the matching pipeline auditable — EIC matches can be verified independently
Related
Problem
PPM currently routes all cross-source matching through Duke fuzzy matching on 5 fields:
NAME,FUELTYPE,COUNTRY,CAPACITY,GEOPOSITION. EIC codes — unique European plant identifiers — are loaded from OPSD and ENTSOE, carried through the pipeline, and aggregated as sets inreduce_matched_dataframe(), but never used in the matching decision itself.This leads to a class of bugs where co-located plants with similar names but different fuels get incorrectly merged. The canonical example:
Eemshaven harbour, Netherlands
Duke merges them into one "Eemshaven" entry labelled Hard Coal because:
The result: 2200 MW of Dutch gas peaker capacity is silently absorbed into a coal plant entry.
Proposed solution: EIC-first matching
The European energy market already solved plant identity — EIC (Energy Identification Code) is a unique identifier assigned to every market participant and generation unit. ENTSOE and OPSD both provide EIC codes for most plants.
Architecture
Implementation
Best insertion point:
compare_two_datasets()inmatching.py(line ~76)The
_match_by_eic()function would:Why this is better than fueltype blocking
target_columns, loaded from ENTSOE (data.py:782), OPSD EU (Eic_Code), OPSD DE (Eic_Code_Plant), and GEM (WEIC)Current EIC availability
projectID(= EIC)eic_codeeic_code_plantWEICEven with partial coverage, EIC-first matching would deterministically resolve the majority of cross-source matches, leaving Duke to handle only the residual.
Impact
Related