Skip to content

data subset refuter bug when dataframe has categorical columns #1372

@daludsblock

Description

@daludsblock

Describe the bug
This following error occurs with the distance matching estimator and data subset refuter if the dataframe has categorical column. This is caused by concatenating reindexed dataframe with not reindexed dataframe at the line here. The dataframe self._observed_common_causes is encoded in the script that reindex the encoded dataframe. The bug only appears when data subset refuter is used because the original dataframe is sampled, so reindexing will cause index values mismatch. The distance matching estimator would still work when no sampling is applied, because the index values are the same for original and encoded dataframes.

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
[Trace ID: 00-b9354c2840feea7fea571bd8e74bcf5e-52edfa00b0c57816-00]
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/databricks/python/lib/python3.12/site-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker
r = call_item()
^^^^^^^^^^^
File "/databricks/python/lib/python3.12/site-packages/joblib/externals/loky/process_executor.py", line 291, in call
return self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.12/site-packages/joblib/parallel.py", line 598, in call
return [func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.12/site-packages/dowhy/causal_refuters/data_subset_refuter.py", line 82, in _refute_once
new_effect = new_estimator.estimate_effect(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.12/site-packages/dowhy/causal_estimators/distance_matching_estimator.py", line 178, in estimate_effect
treated = updated_df.loc[data[self._target_estimand.treatment_variable[0]] == 1]
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.12/site-packages/pandas/core/indexing.py", line 1191, in getitem
return self._getitem_axis(maybe_callable, axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.12/site-packages/pandas/core/indexing.py", line 1413, in _getitem_axis
return self._getbool_axis(key, axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.12/site-packages/pandas/core/indexing.py", line 1209, in _getbool_axis
key = check_bool_indexer(labels, key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.12/site-packages/pandas/core/indexing.py", line 2662, in check_bool_indexer
raise IndexingError(
pandas.errors.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
"""
The index is reset by encoder if the input dataframe has categorical column.

Steps to reproduce the behavior
use a dataframe with categorical column and do the data subset refute

Expected behavior
Indices should align

Version information:

  • DoWhy version 0.14

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions