[Repo Assist] fix: preserve DataFrame index in one_hot_encode to fix data_subset_refuter bug with categorical columns by github-actions[bot] · Pull Request #1408 · py-why/dowhy

github-actions · 2026-03-20T17:26:59Z

🤖 This is an automated pull request from Repo Assist.

Root Cause

one_hot_encode() in dowhy/utils/encoding.py unconditionally called reset_index(drop=True) on both the "columns to keep" slice and the freshly-created encoded DataFrame. This reset every DataFrame's index to 0, 1, 2, … regardless of the original row indices.

DistanceMatchingEstimator.fit() stores the encoded common-causes DataFrame in self._observed_common_causes. When DataSubsetRefuter samples a subset of the original data, the subsetted data retains its original (non-sequential) index (e.g. [5, 10, 15, …]), while self._observed_common_causes always has [0, 1, 2, …].

In estimate_effect(), these two DataFrames are joined with pd.concat(..., axis=1), which aligns on index — causing silent NaN-filling. The immediately following boolean .loc indexing then raises:

pandas.errors.IndexingError: Unalignable boolean Series provided as indexer

Fix

Remove reset_index(drop=True) from the "columns to keep" slice (it already carries the original index).
Pass index=data_to_encode.index when constructing the encoded pd.DataFrame from the numpy array, so both parts of the final pd.concat share the original index.

This is a one-line effective change — no behavioural difference for callers that already had a sequential 0, 1, 2, … index.

Trade-offs

None identified. All callers downstream use .values for numerical operations, so preserving the index does not affect those code paths.

Test Status

Added test_one_hot_encode_preserves_index and test_one_hot_encode_preserves_index_no_categorical to tests/utils/test_encoding.py.
All 4 encoding unit tests pass (tests/utils/test_encoding.py).
All 15 estimator-consistency tests pass (tests/causal_estimators/test_estimator_consistency.py).

Generated by Repo Assist · ◷

To install this agentic workflow, run
gh aw add githubnext/agentics/workflows/repo-assist.md@b897c2f3e43bde9ff7923c8fa9211055b26e27cc

…futer with categorical columns When one_hot_encode() reset the index to 0,1,2,..., the encoded common-causes DataFrame stored in DistanceMatchingEstimator lost the original row indices. If the estimator was called with a subset of the original data (as done by DataSubsetRefuter), the pd.concat inside estimate_effect() would silently misalign rows and the subsequent boolean .loc indexing would raise an IndexingError. Fix: preserve the original DataFrame index throughout one_hot_encode by removing the reset_index(drop=True) calls and passing index=data_to_encode.index when constructing df_encoded. Adds two regression tests to tests/utils/test_encoding.py. Closes #1372 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

emrekiciman · 2026-03-25T04:55:30Z

This needs some attention. The existing code in encoding.py was written specifically to fix another bug, where I think one-hot encoding required columns to maintain a specific order. #1112

github-actions bot added automation repo-assist labels Mar 20, 2026

This was referenced Mar 20, 2026

data subset refuter bug when dataframe has categorical columns #1372

Open

[Repo Assist] Monthly Activity 2026-03 #1403

Open

emrekiciman marked this pull request as ready for review March 25, 2026 04:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Repo Assist] fix: preserve DataFrame index in one_hot_encode to fix data_subset_refuter bug with categorical columns#1408

[Repo Assist] fix: preserve DataFrame index in one_hot_encode to fix data_subset_refuter bug with categorical columns#1408
github-actions[bot] wants to merge 1 commit intomainfrom
repo-assist/fix-issue-1372-encoding-index-37b1ef561bdd1454

github-actions bot commented Mar 20, 2026

Uh oh!

emrekiciman commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

github-actions bot commented Mar 20, 2026

Root Cause

Fix

Trade-offs

Test Status

Uh oh!

emrekiciman commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant