Skip to content

[Feature]distributed gauc#127

Merged
tiankongdeguiji merged 14 commits intoalibaba:masterfrom
eric-gecheng:feature/dist_gauc
Mar 11, 2025
Merged

[Feature]distributed gauc#127
tiankongdeguiji merged 14 commits intoalibaba:masterfrom
eric-gecheng:feature/dist_gauc

Conversation

@eric-gecheng
Copy link
Collaborator

distributed calculation of gauc metrics

data_list_reduce = []
for data in data_list:
key_mask = data[2, :] % world_size == local_rank
pred_selected = torch.masked_select(data[0, :], key_mask)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当rank 0 和rank 1上都有同一个user的样本时,会不会导致某个rank上的样本被过滤掉?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不会的, 在reduce时, gpu process_i 会拉取所有gpu上的符合条件的样本, 合并到gpu_i上, 所以每个gpu的样本数加起来是等于总的evaluation样本数。 在78行加打印

print(f"preds.shape = {preds.shape}")

可以得到验证。

tzrec/main.py Outdated
dist.all_gather(
gather_metric_list[k], v.cuda() if dist.get_backend() == "nccl" else v
)
metric_result[k] = torch.mean(torch.stack(gather_metric_list[k]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torchmetrics什么情况下,是需要自己all_gather所有rank上的metric来求平均的呢?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

分布执行时, torch将样本分配到各个GPU上, 然后各个gpu process独自计算自己的gauc, 最后需要将各个gpu上的gauc聚合回来(all_gather), 再求一次平均。

@tiankongdeguiji tiankongdeguiji merged commit 8c5c752 into alibaba:master Mar 11, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants