[Cherry-Pick][BugFix] Seperate prometheus multiproc dir for single-server multi-dp services (#8059)#8067
Closed
liyonghua0910 wants to merge 3 commits into
Conversation
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-22 18:16:01
📋 Review 摘要
PR 概述:修复单 server 多 DP 场景 Prometheus multiprocess 目录共享导致的指标串扰。
变更范围:metrics helper、engine/common_engine DP 启动、OpenAI multi_api_server、相关单测。
影响面 Tag:[Engine] [APIServer]
问题
未发现阻塞性问题。PR 规范问题在下面章节报,不要在这里重复
📝 PR 规范检查
PR 描述仍保留模板说明/占位内容,Checklist 未按本次 cherry-pick 和测试变更勾选;标题格式符合 cherry-pick 要求,但建议顺手修正 Seperate 拼写。
标题建议(可直接复制):
[Cherry-Pick][BugFix] Separate prometheus multiproc dir for single-server multi-DP services (#8059)
PR 描述建议(点击展开,可直接复制)
## Motivation
Fix inaccurate Prometheus multiprocess metrics in single-server multi-DP services. Each DP rank should write metrics to an isolated `PROMETHEUS_MULTIPROC_DIR` subdirectory instead of sharing one directory.
## Modifications
- Add original Prometheus multiprocess dir tracking and `setup_dp_prometheus_dir()` to create per-DP `dp{rank}` directories.
- Move existing DP0 `.db` files into `dp0/` and switch DP worker launch paths to the per-DP directory when `FD_ENABLE_INTERNAL_ADAPTER` is enabled.
- Reuse the same helper in OpenAI `multi_api_server` so each API server subprocess receives an isolated metrics directory.
- Update related metrics, multi API server, cache transfer manager, and graph optimization tests.
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
本轮按风险优先审查了 Prometheus multiprocess 目录初始化、DP worker 启动顺序、OpenAI multi API server 的 env 传递,以及相关测试补丁;未发现需要阻塞合入的确定性实现问题。建议补齐 PR 描述,便于 release cherry-pick 追溯变更动机和验证范围。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.