Skip to content

[Feature] rewrite rope kernel; remove flashinfer dependencies#18844

Open
DarkSharpness wants to merge 4 commits intosgl-project:mainfrom
DarkSharpness:jit_rope
Open

[Feature] rewrite rope kernel; remove flashinfer dependencies#18844
DarkSharpness wants to merge 4 commits intosgl-project:mainfrom
DarkSharpness:jit_rope

Conversation

@DarkSharpness
Copy link
Collaborator

Motivation

TL;DR: remove the flashinfer implementation.

Modifications

Implement rope and fused rope + store kv, for head_dim = 64, 128, 256, 512.

TODO: integrate it.

Accuracy Tests

Benchmarking and Profiling

Slightly improve performance (see v2).

Details

rope-performance

num_q_k_heads is_neox batch_size FlashInfer SGL RoPE v0 SGL RoPE v1 SGL RoPE v2
8,1 True 1 1.465905 1.688739 1.466271 1.050686
8,1 True 2 1.496625 1.721555 1.498425 1.052896
8,1 True 4 1.480840 1.699827 1.483009 1.047787
8,1 True 8 1.494758 1.779481 1.496295 1.094511
8,1 True 16 1.526189 1.797929 1.528095 1.116982
8,1 True 32 1.562607 1.835823 1.562705 1.160891
8,1 True 64 1.571138 1.846398 1.572661 1.162693
8,1 True 128 1.699871 1.979522 1.700327 1.084263
8,1 True 256 1.856436 2.144919 1.840478 1.447725
8,1 True 512 2.150097 2.587559 2.169062 1.852024
8,1 True 1024 2.682306 4.136819 2.693590 2.621214
8,1 True 2048 3.903368 6.687196 3.896879 3.857555
8,1 True 4096 6.068143 11.577039 6.048273 4.612054
8,1 True 8192 10.199478 21.367086 10.135196 6.557919
8,1 True 16384 20.660779 52.216889 20.606564 16.801611
8,1 True 32768 46.883762 118.681792 46.869635 41.676848
8,1 False 1 1.375527 1.699136 1.375482 1.061899
8,1 False 2 1.400020 1.732768 1.401262 1.067356
8,1 False 4 1.384943 1.702849 1.386458 1.061088
8,1 False 8 1.404279 1.767364 1.405318 1.094284
8,1 False 16 1.427348 1.807284 1.428448 1.126933
8,1 False 32 1.470702 1.842470 1.472000 1.162220
8,1 False 64 1.481351 1.846485 1.482182 1.173171
8,1 False 128 1.599877 1.971486 1.601497 1.098079
8,1 False 256 1.701565 2.113714 1.727705 1.422117
8,1 False 512 1.951569 2.571708 1.996244 1.853644
8,1 False 1024 2.326750 4.102251 2.328822 2.620368
8,1 False 2048 3.503764 6.673601 3.507369 3.861049
8,1 False 4096 5.066251 11.600890 5.048945 4.692637
8,1 False 8192 8.227467 21.366927 8.180892 6.819223
8,1 False 16384 19.668075 52.819877 19.620339 16.941540
8,1 False 32768 44.016149 119.019832 44.008486 42.057362
16,2 True 1 1.500070 1.992493 1.501611 1.053603
16,2 True 2 1.496236 1.977119 1.497117 1.046030
16,2 True 4 1.490956 1.982266 1.492431 1.093157
16,2 True 8 1.522642 2.074182 1.523706 1.120162
16,2 True 16 1.563096 2.088641 1.564316 1.171524
16,2 True 32 1.571848 2.102617 1.573702 1.177443
16,2 True 64 1.698376 2.146727 1.698157 1.259440
16,2 True 128 1.850087 2.268625 1.838429 1.457342
16,2 True 256 2.153008 2.431773 2.155328 1.853566
16,2 True 512 2.750640 3.372808 2.721494 2.615994
16,2 True 1024 3.876177 5.576473 3.862385 3.859709
16,2 True 2048 6.046595 9.303536 6.033190 4.594606
16,2 True 4096 10.149801 16.638324 10.125561 6.489297
16,2 True 8192 20.583990 38.158450 20.582179 14.529920
16,2 True 16384 45.494797 92.231773 45.441596 40.066955
16,2 True 32768 88.105310 181.552292 88.083548 77.640854
16,2 False 1 1.411144 1.998317 1.406448 1.054829
16,2 False 2 1.403795 1.991069 1.405814 1.057253
16,2 False 4 1.403175 1.999909 1.404495 1.116034
16,2 False 8 1.426203 2.083990 1.427465 1.118297
16,2 False 16 1.469662 2.101696 1.471450 1.154722
16,2 False 32 1.481905 2.132956 1.482911 1.179482
16,2 False 64 1.590724 2.148521 1.592570 1.269481
16,2 False 128 1.699139 2.231800 1.703096 1.442501
16,2 False 256 1.956735 2.417129 1.979635 1.854525
16,2 False 512 2.367511 3.291859 2.323186 2.619185
16,2 False 1024 3.340365 5.417764 3.340495 3.855119
16,2 False 2048 4.936565 9.128251 4.909766 4.713610
16,2 False 4096 8.060439 16.313987 7.999094 6.786168
16,2 False 8192 17.756383 36.414209 17.683300 14.569438
16,2 False 16384 41.868449 86.039222 41.837568 40.435153
16,2 False 32768 83.802112 168.793572 83.767932 78.007713
32,8 True 1 1.547714 2.565703 1.538442 1.094401
32,8 True 2 1.533345 2.582451 1.534082 1.095742
32,8 True 4 1.549325 2.593576 1.551083 1.118481
32,8 True 8 1.558672 2.652998 1.559190 1.141171
32,8 True 16 1.588127 2.645978 1.588875 1.178735
32,8 True 32 1.708854 2.698090 1.709635 1.284519
32,8 True 64 1.867873 2.727647 1.883694 1.495413
32,8 True 128 2.204598 2.843131 2.218983 1.954961
32,8 True 256 2.768494 3.317246 2.745605 2.772988
32,8 True 512 4.114256 5.104745 4.114662 3.952286
32,8 True 1024 6.439529 9.060432 6.435383 4.885927
32,8 True 2048 11.052336 16.165705 11.033120 6.997610
32,8 True 4096 23.494529 41.299627 23.489955 17.679204
32,8 True 8192 49.825750 93.401235 49.823003 43.197811
32,8 True 16384 96.854418 182.907113 96.843483 83.881681
32,8 True 32768 190.183927 362.858949 190.068465 166.388568
32,8 False 1 1.452906 2.577525 1.450533 1.072778
32,8 False 2 1.433787 2.593793 1.435516 1.112655
32,8 False 4 1.453016 2.598063 1.454140 1.129278
32,8 False 8 1.468181 2.685703 1.469687 1.161137
32,8 False 16 1.495218 2.670711 1.496430 1.182766
32,8 False 32 1.611368 2.731814 1.612246 1.287726
32,8 False 64 1.712166 2.747212 1.718795 1.475284
32,8 False 128 1.929803 2.861852 1.930260 1.951967
32,8 False 256 2.369279 3.309460 2.386835 2.773668
32,8 False 512 3.456697 4.971324 3.457744 3.947887
32,8 False 1024 5.134408 8.631698 5.120338 5.014669
32,8 False 2048 8.579153 15.383846 8.508154 7.178327
32,8 False 4096 21.057723 34.422907 21.035541 17.589021
32,8 False 8192 46.204745 74.307438 46.159339 43.452207
32,8 False 16384 91.796304 145.091873 91.847622 84.395289
32,8 False 32768 178.914035 285.918139 178.837525 166.952004

rope-store-performance

num_q_k_heads is_neox batch_size SGL RoPE + Store v0 SGL RoPE + Store v1 SGL RoPE + Store v2
8,1 True 1 1.501530 1.069705 1.146772
8,1 True 2 1.855672 1.283424 1.147284
8,1 True 4 1.871656 1.303827 1.167137
8,1 True 8 1.924820 1.304778 1.195446
8,1 True 16 1.958561 1.338893 1.219059
8,1 True 32 2.004317 1.350598 1.258091
8,1 True 64 2.097294 1.426414 1.276438
8,1 True 128 2.201477 1.463960 1.393515
8,1 True 256 2.408261 1.698295 1.571760
8,1 True 512 2.938759 2.136045 2.012780
8,1 True 1024 3.786154 2.861506 2.735244
8,1 True 2048 5.386977 4.293168 4.116696
8,1 True 4096 6.829920 7.085961 5.167309
8,1 True 8192 10.183406 10.740274 7.273840
8,1 True 16384 27.338417 30.349545 24.180913
8,1 True 32768 52.692111 55.325235 47.226159
8,1 False 1 1.893212 1.101923 1.161132
8,1 False 2 1.869795 1.155493 1.124730
8,1 False 4 1.903683 1.173441 1.163394
8,1 False 8 1.940271 1.251988 1.194802
8,1 False 16 1.953642 1.265088 1.195717
8,1 False 32 2.074449 1.101788 1.251232
8,1 False 64 2.122945 1.289947 1.280454
8,1 False 128 2.222100 1.391274 1.382383
8,1 False 256 2.429395 1.574957 1.543677
8,1 False 512 2.924664 1.986138 1.975842
8,1 False 1024 3.795915 2.774116 2.872588
8,1 False 2048 5.371883 4.314102 4.118552
8,1 False 4096 6.939690 7.136623 5.093441
8,1 False 8192 10.404934 12.827360 7.206322
8,1 False 16384 27.702673 26.512637 23.768610
8,1 False 32768 52.997290 49.990181 46.448503
16,2 True 1 1.519934 1.244739 1.133463
16,2 True 2 1.879703 1.283250 1.174973
16,2 True 4 1.928291 1.292576 1.217183
16,2 True 8 1.938168 1.333630 1.213554
16,2 True 16 2.005221 1.363596 1.263786
16,2 True 32 2.097003 1.408978 1.296604
16,2 True 64 2.184783 1.469828 1.390116
16,2 True 128 2.411589 1.701521 1.579916
16,2 True 256 2.870595 2.122580 2.025193
16,2 True 512 3.729159 2.869728 2.905558
16,2 True 1024 5.029063 4.310820 3.929155
16,2 True 2048 6.135365 7.093782 5.169978
16,2 True 4096 8.830602 12.747241 7.129128
16,2 True 8192 24.566463 32.711641 22.889283
16,2 True 16384 49.337367 57.151719 45.760788
16,2 True 32768 96.290678 103.881654 89.097847
16,2 False 1 1.879413 1.138956 1.118635
16,2 False 2 1.911116 1.159444 1.163345
16,2 False 4 1.955946 1.196264 1.210980
16,2 False 8 1.942596 1.277328 1.193927
16,2 False 16 2.021247 1.294169 1.252746
16,2 False 32 2.119432 1.139106 1.275601
16,2 False 64 2.239173 1.387136 1.381653
16,2 False 128 2.421391 1.579352 1.548451
16,2 False 256 2.866294 1.965325 1.992771
16,2 False 512 3.732518 2.759365 2.866410
16,2 False 1024 5.039539 4.259567 4.094223
16,2 False 2048 6.220263 7.132135 5.100184
16,2 False 4096 9.056353 12.771845 7.088739
16,2 False 8192 24.813617 25.854980 22.488359
16,2 False 16384 49.744501 51.050385 44.826011
16,2 False 32768 96.879442 95.046490 87.288818
32,8 True 1 1.555933 1.277164 1.178751
32,8 True 2 2.041247 1.298421 1.221050
32,8 True 4 2.041440 1.315357 1.222321
32,8 True 8 2.021853 1.362363 1.258056
32,8 True 16 2.130635 1.408542 1.324837
32,8 True 32 2.264945 1.489806 1.406744
32,8 True 64 2.486641 1.728049 1.644317
32,8 True 128 3.033159 2.151436 2.182893
32,8 True 256 3.933767 3.070315 3.241516
32,8 True 512 5.504190 4.593622 4.450881
32,8 True 1024 7.116248 7.701377 5.064358
32,8 True 2048 10.899568 14.743538 8.543146
32,8 True 4096 30.605005 33.021432 27.836558
32,8 True 8192 61.512406 77.742546 53.569546
32,8 True 16384 119.979191 133.753835 105.092343
32,8 True 32768 235.423054 245.938993 207.621564
32,8 False 1 1.905801 1.142943 1.157248
32,8 False 2 2.058847 1.184524 1.211946
32,8 False 4 2.058650 1.211242 1.212214
32,8 False 8 2.088970 1.298377 1.234327
32,8 False 16 2.109849 1.300255 1.314047
32,8 False 32 2.242838 1.245760 1.391347
32,8 False 64 2.476000 1.625049 1.609899
32,8 False 128 3.010011 2.041704 2.143426
32,8 False 256 3.953410 2.906295 3.195842
32,8 False 512 5.485879 4.548299 4.372080
32,8 False 1024 7.296242 7.675740 5.296103
32,8 False 2048 11.167942 14.129414 8.386598
32,8 False 4096 30.868913 31.681977 27.283209
32,8 False 8192 61.956734 62.784673 52.545371
32,8 False 16384 120.109954 122.055160 102.835000
32,8 False 32768 235.621713 230.134886 203.241230

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@DarkSharpness DarkSharpness requested a review from BBuf as a code owner February 14, 2026 13:42
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @DarkSharpness, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request undertakes a significant refactoring of the Rotary Position Embedding (RoPE) implementation within the SGLang JIT kernel. The core objective is to eliminate the reliance on the external flashinfer library by integrating a custom-built, highly optimized RoPE kernel. This new kernel not only efficiently applies RoPE but also introduces a fused operation that combines RoPE application with KV cache storage, resulting in notable performance gains across various configurations.

Highlights

  • FlashInfer Dependency Removal: The external flashinfer library's Rotary Position Embedding (RoPE) implementation has been replaced with a custom, optimized SGLang kernel.
  • New RoPE Kernel (v2): A new rope_v2 kernel has been introduced, capable of both standard RoPE application and a fused RoPE + KV cache store operation, supporting head_dim values of 64, 128, 256, and 512.
  • Performance Improvements: Benchmarking results indicate significant performance enhancements with the new rope_v2 kernel compared to previous SGLang versions and the FlashInfer implementation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/jit_kernel/benchmark/bench_rope.py
    • Added a new benchmark script to evaluate the performance of the new rope_v2 kernel, comparing it against FlashInfer and previous SGLang RoPE implementations for both standalone RoPE and fused RoPE + KV cache store operations.
  • python/sglang/jit_kernel/csrc/elementwise/rope.cuh
    • Removed the previous FlashInfer-based CUDA kernel implementation for Rotary Position Embedding.
  • python/sglang/jit_kernel/csrc/elementwise/rope_v2.cuh
    • Added a new CUDA kernel (fused_rope_kernel and fused_rope_store_kernel) that implements optimized RoPE and fused RoPE with KV cache storage, supporting different head dimensions and NeoX/interleaved styles.
  • python/sglang/jit_kernel/include/sgl_kernel/utils.cuh
    • Included <type_traits> header for C++ utilities.
    • Added load_as and store_as templated utility functions for safe type-aware loading and storing from void pointers on the device.
  • python/sglang/jit_kernel/rope_v2.py
    • Added Python binding functions (fused_rope_inplace and fused_rope_inplace_with_kvcache) to expose the new rope_v2 CUDA kernels to the Python API.
  • python/sglang/jit_kernel/tests/test_rope.py
    • Updated the test suite to remove old benchmarking utilities and introduce new parameterized tests for the rope_v2 and fused_rope_store kernels, validating their correctness against FlashInfer's implementation and ensuring support for various configurations and position data types.
Activity
  • The pull request author, DarkSharpness, has provided a detailed description of the motivation, modifications, and extensive benchmarking results for the new RoPE kernel.
  • No external review comments or activity have been recorded yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an excellent pull request that rewrites the RoPE kernel, removing the dependency on flashinfer and introducing a new, highly optimized custom CUDA kernel. The new implementation is well-structured and leverages modern C++ and CUDA features for performance. The test suite is comprehensive, covering a wide range of parameters and ensuring correctness. The benchmarks also demonstrate the performance benefits of the new kernel. I have a couple of minor suggestions to improve clarity and consistency in the benchmark and Python wrapper code.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@DarkSharpness DarkSharpness marked this pull request as draft February 14, 2026 16:51
@DarkSharpness
Copy link
Collaborator Author

/tag-and-rerun-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant