Skip to content

[Repo Assist] fix: preserve DataFrame index in one_hot_encode to fix data_subset_refuter bug with categorical columns#1408

Open
github-actions[bot] wants to merge 1 commit intomainfrom
repo-assist/fix-issue-1372-encoding-index-37b1ef561bdd1454
Open

[Repo Assist] fix: preserve DataFrame index in one_hot_encode to fix data_subset_refuter bug with categorical columns#1408
github-actions[bot] wants to merge 1 commit intomainfrom
repo-assist/fix-issue-1372-encoding-index-37b1ef561bdd1454

Conversation

@github-actions
Copy link
Contributor

🤖 This is an automated pull request from Repo Assist.

Closes #1372

Root Cause

one_hot_encode() in dowhy/utils/encoding.py unconditionally called reset_index(drop=True) on both the "columns to keep" slice and the freshly-created encoded DataFrame. This reset every DataFrame's index to 0, 1, 2, … regardless of the original row indices.

DistanceMatchingEstimator.fit() stores the encoded common-causes DataFrame in self._observed_common_causes. When DataSubsetRefuter samples a subset of the original data, the subsetted data retains its original (non-sequential) index (e.g. [5, 10, 15, …]), while self._observed_common_causes always has [0, 1, 2, …].

In estimate_effect(), these two DataFrames are joined with pd.concat(..., axis=1), which aligns on index — causing silent NaN-filling. The immediately following boolean .loc indexing then raises:

pandas.errors.IndexingError: Unalignable boolean Series provided as indexer

Fix

  • Remove reset_index(drop=True) from the "columns to keep" slice (it already carries the original index).
  • Pass index=data_to_encode.index when constructing the encoded pd.DataFrame from the numpy array, so both parts of the final pd.concat share the original index.

This is a one-line effective change — no behavioural difference for callers that already had a sequential 0, 1, 2, … index.

Trade-offs

None identified. All callers downstream use .values for numerical operations, so preserving the index does not affect those code paths.

Test Status

  • Added test_one_hot_encode_preserves_index and test_one_hot_encode_preserves_index_no_categorical to tests/utils/test_encoding.py.
  • All 4 encoding unit tests pass (tests/utils/test_encoding.py).
  • All 15 estimator-consistency tests pass (tests/causal_estimators/test_estimator_consistency.py).

Generated by Repo Assist ·

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@b897c2f3e43bde9ff7923c8fa9211055b26e27cc

…futer with categorical columns

When one_hot_encode() reset the index to 0,1,2,..., the encoded
common-causes DataFrame stored in DistanceMatchingEstimator lost the
original row indices. If the estimator was called with a subset of the
original data (as done by DataSubsetRefuter), the pd.concat inside
estimate_effect() would silently misalign rows and the subsequent
boolean .loc indexing would raise an IndexingError.

Fix: preserve the original DataFrame index throughout one_hot_encode
by removing the reset_index(drop=True) calls and passing
index=data_to_encode.index when constructing df_encoded.

Adds two regression tests to tests/utils/test_encoding.py.

Closes #1372

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@emrekiciman
Copy link
Member

This needs some attention. The existing code in encoding.py was written specifically to fix another bug, where I think one-hot encoding required columns to maintain a specific order. #1112

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

data subset refuter bug when dataframe has categorical columns

1 participant