[Feature]distributed gauc by eric-gecheng · Pull Request #127 · alibaba/TorchEasyRec

eric-gecheng · 2025-03-05T02:36:28Z

distributed calculation of gauc metrics

tiankongdeguiji · 2025-03-05T07:01:31Z

tzrec/metrics/grouped_auc.py

+    data_list_reduce = []
+    for data in data_list:
+        key_mask = data[2, :] % world_size == local_rank
+        pred_selected = torch.masked_select(data[0, :], key_mask)


当rank 0 和rank 1上都有同一个user的样本时，会不会导致某个rank上的样本被过滤掉？

不会的，在reduce时， gpu process_i 会拉取所有gpu上的符合条件的样本, 合并到gpu_i上，所以每个gpu的样本数加起来是等于总的evaluation样本数。在78行加打印

print(f"preds.shape = {preds.shape}")

可以得到验证。

tiankongdeguiji · 2025-03-05T07:02:30Z

tzrec/main.py

+            dist.all_gather(
+                gather_metric_list[k], v.cuda() if dist.get_backend() == "nccl" else v
+            )
+            metric_result[k] = torch.mean(torch.stack(gather_metric_list[k]))


torchmetrics什么情况下，是需要自己all_gather所有rank上的metric来求平均的呢？

分布执行时， torch将样本分配到各个GPU上，然后各个gpu process独自计算自己的gauc, 最后需要将各个gpu上的gauc聚合回来(all_gather)，再求一次平均。

…ing dist object to be initialized

eric-gecheng added 7 commits March 4, 2025 21:15

distributed gauc

9c81270

gather metric data across gpus

41cf6e7

Merge branch 'master' into feature/dist_gauc

1e72aad

add compatibility with CPU mode

21106ef

compatibility with cpu gloo backend

7a63885

pytyping ignore

6d3c138

pytyping ignore

6c6b1b3

tiankongdeguiji reviewed Mar 5, 2025

View reviewed changes

eric-gecheng added 7 commits March 5, 2025 19:00

move gather op into compute method

04ea40e

move dist.all_gather op into main.py since gauc compute method is ask…

499e41b

…ing dist object to be initialized

gather op in compute, check dist object is initialized

5f81f82

avoid grouping_key dtype conversion risk; adjust auc averaging strategy

6e3bc34

support non-distributed mode; pytyping error fix

ebe7a81

pytyping fix

ea02045

pytyping fix

e74d3a0

tiankongdeguiji approved these changes Mar 11, 2025

View reviewed changes

tiankongdeguiji merged commit 8c5c752 into alibaba:master Mar 11, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]distributed gauc#127

[Feature]distributed gauc#127
tiankongdeguiji merged 14 commits intoalibaba:masterfrom
eric-gecheng:feature/dist_gauc

eric-gecheng commented Mar 5, 2025

Uh oh!

tiankongdeguiji Mar 5, 2025

Uh oh!

eric-gecheng Mar 5, 2025

Uh oh!

tiankongdeguiji Mar 5, 2025

Uh oh!

eric-gecheng Mar 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eric-gecheng commented Mar 5, 2025

Uh oh!

tiankongdeguiji Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

eric-gecheng Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

tiankongdeguiji Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

eric-gecheng Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants