Skip to content

feat: Fast slow separate#3213

Merged
chejinge merged 14 commits intoOpenAtomFoundation:3.6from
Z-G-H1:fast-slow-separate
Jan 26, 2026
Merged

feat: Fast slow separate#3213
chejinge merged 14 commits intoOpenAtomFoundation:3.6from
Z-G-H1:fast-slow-separate

Conversation

@Z-G-H1
Copy link
Collaborator

@Z-G-H1 Z-G-H1 commented Jan 3, 2026

1. 概况

围绕“命令线程池借用”将命令处理拆分为快/慢线程池,并引入跨池借用机制,同时完善线程池相关的监控输出与在线配置能力,提升负载不均衡下的资源利用率与可观测性。

2. 核心功能点

2.1 快慢命令线程池分离(Fast / Slow)

命令处理任务在调度阶段会被分配到:

  • 快池:默认处理绝大多数命令
  • 慢池:当开启 slow-cmd-pool 且命令被判定为慢命令时使用(slow-cmd-list

2.2 跨池借用(Borrow)调度策略

引入借用逻辑:当某池处于忙碌、另一池处于空闲时,允许把任务投递到对方池子以分担压力。

  • 快池忙、慢池闲 → FAST 借用 SLOW
  • 慢池忙、快池闲 → SLOW 借用 FAST

决策入口统一在 PikaServer::DecidePoolType(...),路由入口在 PikaServer::ScheduleClientPool(...)
在任务参数中记录 pool_type,并在执行入口将其写入连接上下文,使得:

  1. 线程池统计(scheduled/active_tasks/latency)按实际执行池归属;
  2. 能够针对线程池维度进行观察。

3. 忙闲判断指标

使用队列等待时间以及线程池任务队列占比作为判断指标;

队列等待时间(queue-wait)EMA 指标

  • 排队等待时间:queue_wait_us = dequeue_ts_ - enqueue_ts_
  • 使用 EMA(指数移动平均)对 queue-wait 做平滑计算。

队列长度占比判断 busy/idle:

  • threadpool-borrow-threshold-percent
  • threadpool-idle-threshold-percent

活跃任务数占比 (busy_by_active)

  • 使用 active_tasks 实时追踪线程池正在执行的任务数
  • active_tasks >= (thread_num * 80%) 时 → 认为忙碌

组合判定规则(用于借用 Busy/Idle)

  • Busybusy = busy_by_queue || busy_by_ema || busy_by_active
  • Idleidle = idle_by_queue && idle_by_ema

Busy 判定:
只要任一信号表明“忙”,就认为忙。

Busy = busy_by_queue || busy_by_ema || busy_by_active

Idle 判定:更保守
必须两个信号都表明“闲”,才认为该池是空闲,才允许借用。

Idle = idle_by_queue && idle_by_ema && idle_by_active

4. 可观测性

线程池运行信息(info threadpool
线程池信息输出增强,包含:

  • fast/slow poolsize, queue_size, max_queue_size, usage
  • borrow 相关统计borrow_attempts
  • 调度统计tasks_scheduled

5. 参数动态调整

5.1 新增/扩展的 CONFIG GET

支持查询:

  • threadpool-borrow-enable
  • threadpool-borrow-threshold-percent
  • threadpool-idle-threshold-percent

5.2 新增/扩展的 CONFIG SET

支持在线设置:

  • thread-pool-size(快池线程数动态调整,队列为空时允许)
  • slow-cmd-thread-pool-size(慢池线程数动态调整,队列为空时允许)
  • threadpool-borrow-enable
  • threadpool-borrow-threshold-percent
  • threadpool-idle-threshold-percent

6. pika.conf 新增配置项(线程池 EMA 参数)

在配置文件中新增以下参数,用于控制 EMA 与阈值(单位 microseconds):

  • threadpool-ema-alpha-numerator
  • threadpool-ema-alpha-denominator
  • threadpool-fast-busy-threshold
  • threadpool-fast-idle-threshold
  • threadpool-slow-busy-threshold
  • threadpool-slow-idle-threshold

7. 调度链路(从接收请求到入池/借用)

  1. 网络线程解析命令后,构造后台任务参数(包含 redis_cmdspool_type 等)。
  2. 调度入口 ScheduleClientPool 调用 DecidePoolType 选择目标池:
  • 默认 fast/slow 分流
  • 满足借用条件时进行跨池投递

Copilot AI review requested due to automatic review settings January 3, 2026 14:19
@coderabbitai
Copy link

coderabbitai bot commented Jan 3, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a fast/slow thread pool separation mechanism with cross-pool borrowing to improve resource utilization under unbalanced load conditions. Commands are classified and routed to either a fast or slow pool, with intelligent borrowing when one pool is busy and another is idle.

Key changes include:

  • Thread pool task borrowing mechanism based on queue occupancy and EMA-smoothed wait times
  • Enhanced observability with thread pool metrics (latency buckets, scheduled/completed task counts, borrow attempts)
  • Dynamic thread pool resizing via CONFIG SET commands

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
conf/pika.conf Adds thread pool borrowing configuration parameters, EMA thresholds, and enables slow-cmd-pool by default
tools/pika_migrate/conf/pika.conf Contains an unintended typo in userblacklist and adds blank line
include/pika_define.h Defines TaskPoolType enum for fast/slow pool classification
include/pika_conf.h Adds configuration fields and accessors for borrowing thresholds and EMA parameters
include/pika_server.h Introduces ThreadPoolMetrics struct and methods for pool management, borrowing logic, and EMA tracking
include/pika_admin.h Extends INFO command to support "threadpool" section
include/pika_client_conn.h Adds pool_type tracking to BgTaskArg and connection state
include/pika_binlog.h Fixes printf format specifier from %llu to %lu
src/pika_conf.cc Loads borrowing and EMA configuration from pika.conf
src/pika_server.cc Implements core borrowing logic, metrics tracking, EMA calculations, pool resizing, and decision algorithms
src/pika_client_conn.cc Records queue wait times, updates EMA statistics, and tracks per-pool metrics during task execution
src/pika_admin.cc Implements CONFIG GET/SET for borrowing parameters and INFO threadpool output

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

bool PikaServer::ResizeSlowCmdThreadPool(size_t new_size) {
if (new_size == 0 || new_size > 24) {
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upper bound check is inconsistent with the error message. The code checks for 24 but the error message states the limit is 1024. For slow command thread pool, a limit of 1024 would be reasonable, but the validation should be consistent. Consider either changing the check to allow up to 1024 or updating the error message to reflect the actual limit of 24.

Suggested change
if (new_size == 0 || new_size > 24) {
if (new_size == 0 || new_size > 1024) {

Copilot uses AI. Check for mistakes.
// QUEUE_SIZE_THRESHOLD_PERCENTAGE is used to represent a percentage value and should be within the range of 0 to 100.
const size_t QUEUE_SIZE_THRESHOLD_PERCENTAGE = 75;

// ThreadPoolMetrics
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment has a typo: "threadpool metrics" should be capitalized or include proper punctuation as "ThreadPool Metrics" or "thread pool metrics".

Suggested change
// ThreadPoolMetrics
// ThreadPool Metrics

Copilot uses AI. Check for mistakes.
role_(PIKA_ROLE_SINGLE) {
role_(PIKA_ROLE_SINGLE),
fast_pool_metrics_(new ThreadPoolMetrics()),
slow_pool_metrics_(new ThreadPoolMetrics()){
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initialization uses curly braces instead of parentheses which is inconsistent with the previous line (line 110). While both are valid, consistency within the same initialization list is preferred.

Copilot uses AI. Check for mistakes.
# [NOTICE] If this user password is the same as admin password (including both being empty),
# the value of this parameter will be ignored and all users are considered as administrators,
# in this scenario, users are not subject to the restrictions imposed by the userblacklist.
# in this scenario, users are not subject to the restrictions imposed by the userb lacklist.
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an extra space inserted in the middle of "userblacklist", changing it to "userb lacklist". This appears to be an unintentional edit and should be reverted.

Suggested change
# in this scenario, users are not subject to the restrictions imposed by the userb lacklist.
# in this scenario, users are not subject to the restrictions imposed by the userblacklist.

Copilot uses AI. Check for mistakes.
Comment on lines +940 to +942
while (SlowCmdThreadPoolCurQueueSize() != 0) {
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking wait in resize operation could cause performance issues. The code waits in a tight loop checking queue size every 100ms, which blocks the CONFIG SET command execution. If the queue never drains (e.g., due to continuous load), this could hang indefinitely. Consider adding a timeout or a maximum retry count to prevent indefinite blocking.

Copilot uses AI. Check for mistakes.
res_.AppendStringRaw("-ERR Can't resize slow-cmd-pool-size when there are tasks in the queue\r\n");
return;
}
long int slow_cmd_thread_pool_size = (1 > ival || 24 < ival) ? 1 : ival;
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation bound (24) is inconsistent with the upper bound check in line 2733 which allows values up to 1024. This creates a mismatch where ConfigSet allows larger values than ResizeSlowCmdThreadPool can actually handle. The limits should be consistent across both validation points.

Copilot uses AI. Check for mistakes.
Comment on lines +2153 to +2158
if (last > 0) {
uint64_t elapsed = now - last;
if (elapsed > 1000) { // If last update was more than 1ms ago, increase alpha to adapt faster
alpha_num = std::min(alpha_num * 2, alpha_den / 2);
}
}
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The adaptive alpha calculation could cause issues. When elapsed > 1000 microseconds (1ms), alpha_num is doubled, but this modification is only applied locally and doesn't update the global EMA parameters. This means the adaptation effect is inconsistent - it only affects this one update. Additionally, if multiple threads call this simultaneously with large elapsed times, they could all compute different alpha values leading to unpredictable EMA behavior. Consider either removing this adaptive logic or implementing it more carefully with proper synchronization.

Suggested change
if (last > 0) {
uint64_t elapsed = now - last;
if (elapsed > 1000) { // If last update was more than 1ms ago, increase alpha to adapt faster
alpha_num = std::min(alpha_num * 2, alpha_den / 2);
}
}
(void)last; // last is currently unused; retained for potential future logic

Copilot uses AI. Check for mistakes.
std::atomic<uint64_t> tasks_completed{0};
std::atomic<uint64_t> borrow_attempts{0};

// latency(1ms, 5ms, 10ms, 50ms, 100ms, 500ms, 1s, 5s, >5s)
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is unclear and has a typo. "latency(1ms" uses Chinese parentheses instead of English ones, and the description "latency" should be more specific like "Latency buckets in microseconds".

Suggested change
// latency(1ms, 5ms, 10ms, 50ms, 100ms, 500ms, 1s, 5s, >5s
// Latency buckets (1ms, 5ms, 10ms, 50ms, 100ms, 500ms, 1s, 5s, >5s)

Copilot uses AI. Check for mistakes.
}
pika_client_processor_->SchedulePool(func, arg);
// if slow cmd thread pool disabled
if(!g_pika_conf->slow_cmd_pool()) {
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after 'if'. The code has 'if(!g_pika_conf->slow_cmd_pool())' but should be 'if (!g_pika_conf->slow_cmd_pool())' for consistent code style.

Suggested change
if(!g_pika_conf->slow_cmd_pool()) {
if (!g_pika_conf->slow_cmd_pool()) {

Copilot uses AI. Check for mistakes.
}
size_t current_queue_size = g_pika_server->SlowCmdThreadPoolCurQueueSize();
if (current_queue_size > 0) {
res_.AppendStringRaw("-ERR Can't resize slow-cmd-pool-size when there are tasks in the queue\r\n");
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message uses inconsistent terminology. Earlier in the code and function name, it's referred to as "slow-cmd-thread-pool-size", but the error message says "slow-cmd-pool-size". The error message should match the actual configuration parameter name for clarity.

Suggested change
res_.AppendStringRaw("-ERR Can't resize slow-cmd-pool-size when there are tasks in the queue\r\n");
res_.AppendStringRaw("-ERR Can't resize slow-cmd-thread-pool-size when there are tasks in the queue\r\n");

Copilot uses AI. Check for mistakes.
@Z-G-H1 Z-G-H1 changed the title Fast slow separate feat: Fast slow separate Jan 4, 2026
@github-actions github-actions bot added ✏️ Feature New feature or request and removed Invalid PR Title labels Jan 4, 2026
@Z-G-H1 Z-G-H1 force-pushed the fast-slow-separate branch 7 times, most recently from aa9a567 to cb1696b Compare January 5, 2026 06:22
@Z-G-H1 Z-G-H1 force-pushed the fast-slow-separate branch from cb1696b to f030ae5 Compare January 6, 2026 01:56
@chejinge chejinge changed the base branch from 3.5 to 3.6 January 19, 2026 02:11
@chejinge chejinge merged commit d1a606e into OpenAtomFoundation:3.6 Jan 26, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

✏️ Feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants