[Feature]distributed gauc#127
Merged
tiankongdeguiji merged 14 commits intoalibaba:masterfrom Mar 11, 2025
Merged
Conversation
tzrec/metrics/grouped_auc.py
Outdated
| data_list_reduce = [] | ||
| for data in data_list: | ||
| key_mask = data[2, :] % world_size == local_rank | ||
| pred_selected = torch.masked_select(data[0, :], key_mask) |
Collaborator
There was a problem hiding this comment.
当rank 0 和rank 1上都有同一个user的样本时,会不会导致某个rank上的样本被过滤掉?
Collaborator
Author
There was a problem hiding this comment.
不会的, 在reduce时, gpu process_i 会拉取所有gpu上的符合条件的样本, 合并到gpu_i上, 所以每个gpu的样本数加起来是等于总的evaluation样本数。 在78行加打印
print(f"preds.shape = {preds.shape}")
可以得到验证。
tzrec/main.py
Outdated
| dist.all_gather( | ||
| gather_metric_list[k], v.cuda() if dist.get_backend() == "nccl" else v | ||
| ) | ||
| metric_result[k] = torch.mean(torch.stack(gather_metric_list[k])) |
Collaborator
There was a problem hiding this comment.
torchmetrics什么情况下,是需要自己all_gather所有rank上的metric来求平均的呢?
Collaborator
Author
There was a problem hiding this comment.
分布执行时, torch将样本分配到各个GPU上, 然后各个gpu process独自计算自己的gauc, 最后需要将各个gpu上的gauc聚合回来(all_gather), 再求一次平均。
…ing dist object to be initialized
tiankongdeguiji
approved these changes
Mar 11, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
distributed calculation of gauc metrics