fix(automl): fix data leakage in classification holdout validation#1418
fix(automl): fix data leakage in classification holdout validation#1418commint-tian wants to merge 2 commits intomicrosoft:mainfrom
Conversation
>> >> - Stop duplicating X_first in both train/val sets >> - Implement dynamic class balancing >> - Preserve original dataset size
|
@microsoft-github-policy-service agree |
thinkall
left a comment
There was a problem hiding this comment.
Thank you @commint-tian for the PR. Please check below my comments.
| if len(first) < len(y_train_all) / 2: | ||
| # Get X_rest and y_rest with drop, sparse matrix can't apply np.delete | ||
| X_rest = ( | ||
| np.delete(X_train_all, first, axis=0) | ||
| if isinstance(X_train_all, np.ndarray) | ||
| else X_train_all.drop(first.tolist()) | ||
| if data_is_df | ||
| else X_train_all[rest] | ||
| ) | ||
| y_rest = ( | ||
| np.delete(y_train_all, first, axis=0) | ||
| if isinstance(y_train_all, np.ndarray) | ||
| else y_train_all.drop(first.tolist()) | ||
| if data_is_df | ||
| else y_train_all[rest] | ||
| ) | ||
| else: | ||
| X_rest = ( | ||
| iloc_pandas_on_spark(X_train_all, rest) | ||
| if is_spark_dataframe | ||
| else X_train_all.iloc[rest] | ||
| if data_is_df | ||
| else X_train_all[rest] | ||
| ) | ||
| y_rest = ( | ||
| iloc_pandas_on_spark(y_train_all, rest) | ||
| if is_spark_dataframe | ||
| else y_train_all.iloc[rest] | ||
| if data_is_df | ||
| else y_train_all[rest] | ||
| ) |
There was a problem hiding this comment.
Why remove the second way of getting X_rest?
There was a problem hiding this comment.
Oh,I made a mistake about it. The second part should be kept here.
| train_labels = np.unique(y_train) | ||
| val_labels = np.unique(y_val) |
There was a problem hiding this comment.
np.unique doesn't work for psSeries or psDataFrame. Check an example here.
There was a problem hiding this comment.
it can be modified like this:
if isinstance(y_train, (ps.Series, ps.DataFrame)):
train_labels = y_train.unique() if isinstance(y_train, ps.Series) else y_train.iloc[:, 0].unique()
else:
train_labels = np.unique(y_train)
if isinstance(y_val, (ps.Series, ps.DataFrame)):
val_labels = y_val.unique() if isinstance(y_val, ps.Series) else y_val.iloc[:, 0].unique()
else:
val_labels = np.unique(y_val)
There was a problem hiding this comment.
try reusing existing functions
There was a problem hiding this comment.
this is a pretty big project, and I'm not very familiar with it yet. I'm not sure if there are already existing functions that I can use.
There was a problem hiding this comment.
Check an example here.
You can reuse the function len_labels.
There was a problem hiding this comment.
Yes, I noticed this function after seeing your correction. I hadn't paid attention to it before,thank you!
There was a problem hiding this comment.
This is the modified code:
train_labels= len_labels(y_train)
val_labels= len_labels(y_val)
| mask = (y_rest == label) | ||
| X_train = concat(X_rest[mask], X_train) | ||
| y_train = concat(y_rest[mask], y_train) |
There was a problem hiding this comment.
X_rest[mask] may not work for dataframe.
There was a problem hiding this comment.
Perhaps this could work:
mask = (y_rest == label)
filtered_X_rest = X_rest.filter(mask)
filtered_y_rest = y_rest.filter(mask)
X_train = X_train.union(filtered_X_rest)
y_train = y_train.union(filtered_y_rest)
| if missing_in_val: | ||
| X_val = concat(X_first, X_val) | ||
| y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val]) | ||
| #The training set supplements the missing categories with the remaining data. | ||
| if missing_in_train: |
There was a problem hiding this comment.
What if missing_in_val only has 1 value missed?
There was a problem hiding this comment.
Indeed, the code can be optimized.
if missing_in_val:
if len(label_set) == 1:
X_val = concat(X_first.iloc[[0]], X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
else:
X_val = concat(X_first, X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
There was a problem hiding this comment.
a better way is to fill the missing labels in both train and val with first. For those not missing in either train or val, put them in train.
There was a problem hiding this comment.
if missing_in_val:
if data_is_df:
X_val = concat([X_first, X_val])
y_val = concat([label_set, y_val])
else:
X_val = np.concatenate([X_first, X_val], axis=0)
y_val = np.concatenate([label_set, y_val])
if missing_in_train:
if data_is_df:
X_train = concat([X_first, X_train])
y_train = concat([label_set, y_train])
else:
X_train = np.concatenate([X_first, X_train], axis=0)
y_train = np.concatenate([label_set, y_train])
common_labels = set(train_labels) & set(val_labels)
only_train_labels = set(train_labels) - set(val_labels)
only_val_labels = set(val_labels) - set(train_labels)
#将不在train或val中缺失的标签放到train中
for label in only_val_labels:
#从val中移除这些标签
mask = y_val != label
X_val = X_val[mask]
y_val = y_val[mask]
#将这些标签添加到train中
if data_is_df:
X_train = concat([X_train, X_first.loc[X_first[label_col] == label]])
y_train = concat([y_train, label_set.loc[label_set[label_col] == label]])
else:
X_train = np.concatenate([X_train, X_first[y_first == label]], axis=0)
y_train = np.concatenate([y_train, label_set[y_first == label]])
|
Hi @commint-tian , do you get a chance to revise the PR? Thanks. |
I've finished revising the code, please check it again. |
Hi @commint-tian , have you pushed your changes to github? I don't see it yet. |
Why are these changes needed?
In
generic_task.py, the unconditional merging ofX_firstinto both the training and validation sets leads to data duplication, causing the total row countlen(X_train) + len(X_val)to exceed the original dataset size. More critically, this introduces a data leakage issue: since part of the validation data (X_first) also appears in the training set, the evaluation results may become unreliable.To resolve this, we can dynamically check for missing classes using
missing_in_trainandmissing_in_valto identify gaps in the training/validation sets, then selectively supplement data:X_first(ensuring validation set class coverage takes priority).X_rest(avoiding reuse ofX_first).Related issue number
issue #1390