Use EIC codes as deterministic matching key before Duke fuzzy matching

## Problem

PPM currently routes **all** cross-source matching through Duke fuzzy matching on 5 fields: `NAME`, `FUELTYPE`, `COUNTRY`, `CAPACITY`, `GEOPOSITION`. EIC codes — unique European plant identifiers — are loaded from OPSD and ENTSOE, carried through the pipeline, and aggregated as sets in `reduce_matched_dataframe()`, but **never used in the matching decision itself**.

This leads to a class of bugs where co-located plants with similar names but different fuels get incorrectly merged. The canonical example:

### Eemshaven harbour, Netherlands

| Plant | Operator | Fuel | Capacity | EIC | OPSD name |
|---|---|---|---|---|---|
| **Eemshavencentrale** | RWE | Hard Coal + biomass co-firing | 2 × 780 MW | _(unique)_ | Eemshavencentrale EM1/EM2 |
| **Eemscentrale** | ENGIE | Natural Gas (peakers) | 6 units, ~2200 MW | 49W00000000008xG/K/O/S/W, etc. | Eemscentrale EC3–EC7, EC20 |

Duke merges them into one "Eemshaven" entry labelled Hard Coal because:
- Name similarity: "Eemshavencentrale" ≈ "Eemscentrale" (~0.86 JaroWinkler)
- Geo distance: ~400m apart on the same peninsula (max 5000m comparator)
- These two signals score well above the 0.965 threshold, overwhelming the fueltype mismatch (QGram low=0.09, high=0.7 — too weak to block)

The result: 2200 MW of Dutch gas peaker capacity is silently absorbed into a coal plant entry.

---

## Proposed solution: EIC-first matching

The European energy market already solved plant identity — EIC (Energy Identification Code) is a unique identifier assigned to every market participant and generation unit. ENTSOE and OPSD both provide EIC codes for most plants.

### Architecture

```
Before (current):
  All sources ──→ Duke fuzzy match ──→ reduce

After (proposed):
  All sources ──→ Deterministic EIC join ──→ matched set 1 (high confidence)
                  │
                  └→ EIC-unmatched residual ──→ Duke fuzzy match ──→ matched set 2
                  
                  matched set 1 + matched set 2 ──→ reduce
```

### Implementation

**Best insertion point: `compare_two_datasets()` in `matching.py` (line ~76)**

```python
def compare_two_datasets(dfs, dukeargs=None, ...):
    dfs = list(map(read_csv_if_string, dfs))
    
    # ── NEW: Deterministic EIC matching ──────────────────────
    eic_matches = _match_by_eic(dfs[0], dfs[1])
    
    # Remove EIC-matched rows from Duke input
    remaining = [
        dfs[0].drop(eic_matches['idx_0']),
        dfs[1].drop(eic_matches['idx_1']),
    ]
    
    # Duke handles the rest (fuzzy)
    duke_matches = _duke_match(remaining, dukeargs, ...)
    
    # Combine
    return pd.concat([eic_matches, duke_matches])
```

The `_match_by_eic()` function would:
1. Inner join on non-null EIC values
2. Handle multi-EIC plants (some have multiple EIC codes per unit)
3. Return matches with a perfect confidence score (1.0)

### Why this is better than fueltype blocking

- **Positive identification** vs negative filtering — we know *what matches*, not just *what shouldn't*
- **No false negatives** — fueltype blocking would prevent merging when sources disagree on fuel classification (e.g. "Biomass and biogas" vs "Hard Coal" for co-firing plants)
- **Generalises** — works for any co-located plants (same fuel, different owner; industrial CHP next to utility plant; etc.)
- **Uses existing data** — EIC is already in `target_columns`, loaded from ENTSOE (`data.py:782`), OPSD EU (`Eic_Code`), OPSD DE (`Eic_Code_Plant`), and GEM (`WEIC`)

### Current EIC availability

| Source | EIC field | Coverage |
|---|---|---|
| ENTSOE | `projectID` (= EIC) | ~100% for EU grid-connected plants |
| OPSD EU | `eic_code` | ~70% of entries |
| OPSD DE | `eic_code_plant` | ~80% of DE entries |
| GEM | `WEIC` | partial |
| GEO, GPD, others | — | none |

Even with partial coverage, EIC-first matching would deterministically resolve the majority of cross-source matches, leaving Duke to handle only the residual.

---

## Impact

- Eliminates co-located plant confusion (Eemshaven class of bugs) across all countries
- Enables reliable cross-source data enrichment (e.g. OPSD efficiency mapped to the correct ENTSOE plant)
- Reduces Duke's workload (fewer comparisons → faster matching)
- Makes the matching pipeline auditable — EIC matches can be verified independently

## Related

- #286 — OPSD data quality (efficiency column, biomass co-firing): improves the data that flows into matching


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use EIC codes as deterministic matching key before Duke fuzzy matching #287

Problem

Eemshaven harbour, Netherlands

Proposed solution: EIC-first matching

Architecture

Implementation

Why this is better than fueltype blocking

Current EIC availability

Impact

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Plant	Operator	Fuel	Capacity	EIC	OPSD name
Eemshavencentrale	RWE	Hard Coal + biomass co-firing	2 × 780 MW	(unique)	Eemshavencentrale EM1/EM2
Eemscentrale	ENGIE	Natural Gas (peakers)	6 units, ~2200 MW	49W00000000008xG/K/O/S/W, etc.	Eemscentrale EC3–EC7, EC20

Source	EIC field	Coverage
ENTSOE	`projectID` (= EIC)	~100% for EU grid-connected plants
OPSD EU	`eic_code`	~70% of entries
OPSD DE	`eic_code_plant`	~80% of DE entries
GEM	`WEIC`	partial
GEO, GPD, others	—	none

Use EIC codes as deterministic matching key before Duke fuzzy matching #287

Description

Problem

Eemshaven harbour, Netherlands

Proposed solution: EIC-first matching

Architecture

Implementation

Why this is better than fueltype blocking

Current EIC availability

Impact

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions