Skip to content

[Cherry-Pick][BugFix] Seperate prometheus multiproc dir for single-server multi-dp services (#8059)#8067

Closed
liyonghua0910 wants to merge 3 commits into
PaddlePaddle:release/online/20260415from
liyonghua0910:release/online/20260415+20260622_fix_dp_metrics
Closed

[Cherry-Pick][BugFix] Seperate prometheus multiproc dir for single-server multi-dp services (#8059)#8067
liyonghua0910 wants to merge 3 commits into
PaddlePaddle:release/online/20260415from
liyonghua0910:release/online/20260415+20260622_fix_dp_metrics

Conversation

@liyonghua0910

Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-22 18:16:01

📋 Review 摘要

PR 概述:修复单 server 多 DP 场景 Prometheus multiprocess 目录共享导致的指标串扰。
变更范围:metrics helper、engine/common_engine DP 启动、OpenAI multi_api_server、相关单测。
影响面 Tag[Engine] [APIServer]

问题

未发现阻塞性问题。PR 规范问题在下面章节报,不要在这里重复

📝 PR 规范检查

PR 描述仍保留模板说明/占位内容,Checklist 未按本次 cherry-pick 和测试变更勾选;标题格式符合 cherry-pick 要求,但建议顺手修正 Seperate 拼写。

标题建议(可直接复制):

  • [Cherry-Pick][BugFix] Separate prometheus multiproc dir for single-server multi-DP services (#8059)
PR 描述建议(点击展开,可直接复制)
## Motivation
Fix inaccurate Prometheus multiprocess metrics in single-server multi-DP services. Each DP rank should write metrics to an isolated `PROMETHEUS_MULTIPROC_DIR` subdirectory instead of sharing one directory.

## Modifications
- Add original Prometheus multiprocess dir tracking and `setup_dp_prometheus_dir()` to create per-DP `dp{rank}` directories.
- Move existing DP0 `.db` files into `dp0/` and switch DP worker launch paths to the per-DP directory when `FD_ENABLE_INTERNAL_ADAPTER` is enabled.
- Reuse the same helper in OpenAI `multi_api_server` so each API server subprocess receives an isolated metrics directory.
- Update related metrics, multi API server, cache transfer manager, and graph optimization tests.

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

本轮按风险优先审查了 Prometheus multiprocess 目录初始化、DP worker 启动顺序、OpenAI multi API server 的 env 传递,以及相关测试补丁;未发现需要阻塞合入的确定性实现问题。建议补齐 PR 描述,便于 release cherry-pick 追溯变更动机和验证范围。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants