[Feature] rewrite rope kernel; remove flashinfer dependencies by DarkSharpness · Pull Request #18844 · sgl-project/sglang

DarkSharpness · 2026-02-14T13:42:57Z

Motivation

TL;DR: remove the flashinfer implementation.

Modifications

Implement rope and fused rope + store kv, for head_dim = 64, 128, 256, 512.

TODO: integrate it.

Accuracy Tests

Benchmarking and Profiling

Slightly improve performance (see v2).

Details

rope-performance

num_q_k_heads	is_neox	batch_size	FlashInfer	SGL RoPE v0	SGL RoPE v1	SGL RoPE v2
8,1	True	1	1.465905	1.688739	1.466271	1.050686
8,1	True	2	1.496625	1.721555	1.498425	1.052896
8,1	True	4	1.480840	1.699827	1.483009	1.047787
8,1	True	8	1.494758	1.779481	1.496295	1.094511
8,1	True	16	1.526189	1.797929	1.528095	1.116982
8,1	True	32	1.562607	1.835823	1.562705	1.160891
8,1	True	64	1.571138	1.846398	1.572661	1.162693
8,1	True	128	1.699871	1.979522	1.700327	1.084263
8,1	True	256	1.856436	2.144919	1.840478	1.447725
8,1	True	512	2.150097	2.587559	2.169062	1.852024
8,1	True	1024	2.682306	4.136819	2.693590	2.621214
8,1	True	2048	3.903368	6.687196	3.896879	3.857555
8,1	True	4096	6.068143	11.577039	6.048273	4.612054
8,1	True	8192	10.199478	21.367086	10.135196	6.557919
8,1	True	16384	20.660779	52.216889	20.606564	16.801611
8,1	True	32768	46.883762	118.681792	46.869635	41.676848
8,1	False	1	1.375527	1.699136	1.375482	1.061899
8,1	False	2	1.400020	1.732768	1.401262	1.067356
8,1	False	4	1.384943	1.702849	1.386458	1.061088
8,1	False	8	1.404279	1.767364	1.405318	1.094284
8,1	False	16	1.427348	1.807284	1.428448	1.126933
8,1	False	32	1.470702	1.842470	1.472000	1.162220
8,1	False	64	1.481351	1.846485	1.482182	1.173171
8,1	False	128	1.599877	1.971486	1.601497	1.098079
8,1	False	256	1.701565	2.113714	1.727705	1.422117
8,1	False	512	1.951569	2.571708	1.996244	1.853644
8,1	False	1024	2.326750	4.102251	2.328822	2.620368
8,1	False	2048	3.503764	6.673601	3.507369	3.861049
8,1	False	4096	5.066251	11.600890	5.048945	4.692637
8,1	False	8192	8.227467	21.366927	8.180892	6.819223
8,1	False	16384	19.668075	52.819877	19.620339	16.941540
8,1	False	32768	44.016149	119.019832	44.008486	42.057362
16,2	True	1	1.500070	1.992493	1.501611	1.053603
16,2	True	2	1.496236	1.977119	1.497117	1.046030
16,2	True	4	1.490956	1.982266	1.492431	1.093157
16,2	True	8	1.522642	2.074182	1.523706	1.120162
16,2	True	16	1.563096	2.088641	1.564316	1.171524
16,2	True	32	1.571848	2.102617	1.573702	1.177443
16,2	True	64	1.698376	2.146727	1.698157	1.259440
16,2	True	128	1.850087	2.268625	1.838429	1.457342
16,2	True	256	2.153008	2.431773	2.155328	1.853566
16,2	True	512	2.750640	3.372808	2.721494	2.615994
16,2	True	1024	3.876177	5.576473	3.862385	3.859709
16,2	True	2048	6.046595	9.303536	6.033190	4.594606
16,2	True	4096	10.149801	16.638324	10.125561	6.489297
16,2	True	8192	20.583990	38.158450	20.582179	14.529920
16,2	True	16384	45.494797	92.231773	45.441596	40.066955
16,2	True	32768	88.105310	181.552292	88.083548	77.640854
16,2	False	1	1.411144	1.998317	1.406448	1.054829
16,2	False	2	1.403795	1.991069	1.405814	1.057253
16,2	False	4	1.403175	1.999909	1.404495	1.116034
16,2	False	8	1.426203	2.083990	1.427465	1.118297
16,2	False	16	1.469662	2.101696	1.471450	1.154722
16,2	False	32	1.481905	2.132956	1.482911	1.179482
16,2	False	64	1.590724	2.148521	1.592570	1.269481
16,2	False	128	1.699139	2.231800	1.703096	1.442501
16,2	False	256	1.956735	2.417129	1.979635	1.854525
16,2	False	512	2.367511	3.291859	2.323186	2.619185
16,2	False	1024	3.340365	5.417764	3.340495	3.855119
16,2	False	2048	4.936565	9.128251	4.909766	4.713610
16,2	False	4096	8.060439	16.313987	7.999094	6.786168
16,2	False	8192	17.756383	36.414209	17.683300	14.569438
16,2	False	16384	41.868449	86.039222	41.837568	40.435153
16,2	False	32768	83.802112	168.793572	83.767932	78.007713
32,8	True	1	1.547714	2.565703	1.538442	1.094401
32,8	True	2	1.533345	2.582451	1.534082	1.095742
32,8	True	4	1.549325	2.593576	1.551083	1.118481
32,8	True	8	1.558672	2.652998	1.559190	1.141171
32,8	True	16	1.588127	2.645978	1.588875	1.178735
32,8	True	32	1.708854	2.698090	1.709635	1.284519
32,8	True	64	1.867873	2.727647	1.883694	1.495413
32,8	True	128	2.204598	2.843131	2.218983	1.954961
32,8	True	256	2.768494	3.317246	2.745605	2.772988
32,8	True	512	4.114256	5.104745	4.114662	3.952286
32,8	True	1024	6.439529	9.060432	6.435383	4.885927
32,8	True	2048	11.052336	16.165705	11.033120	6.997610
32,8	True	4096	23.494529	41.299627	23.489955	17.679204
32,8	True	8192	49.825750	93.401235	49.823003	43.197811
32,8	True	16384	96.854418	182.907113	96.843483	83.881681
32,8	True	32768	190.183927	362.858949	190.068465	166.388568
32,8	False	1	1.452906	2.577525	1.450533	1.072778
32,8	False	2	1.433787	2.593793	1.435516	1.112655
32,8	False	4	1.453016	2.598063	1.454140	1.129278
32,8	False	8	1.468181	2.685703	1.469687	1.161137
32,8	False	16	1.495218	2.670711	1.496430	1.182766
32,8	False	32	1.611368	2.731814	1.612246	1.287726
32,8	False	64	1.712166	2.747212	1.718795	1.475284
32,8	False	128	1.929803	2.861852	1.930260	1.951967
32,8	False	256	2.369279	3.309460	2.386835	2.773668
32,8	False	512	3.456697	4.971324	3.457744	3.947887
32,8	False	1024	5.134408	8.631698	5.120338	5.014669
32,8	False	2048	8.579153	15.383846	8.508154	7.178327
32,8	False	4096	21.057723	34.422907	21.035541	17.589021
32,8	False	8192	46.204745	74.307438	46.159339	43.452207
32,8	False	16384	91.796304	145.091873	91.847622	84.395289
32,8	False	32768	178.914035	285.918139	178.837525	166.952004

rope-store-performance

num_q_k_heads	is_neox	batch_size	SGL RoPE + Store v0	SGL RoPE + Store v1	SGL RoPE + Store v2
8,1	True	1	1.501530	1.069705	1.146772
8,1	True	2	1.855672	1.283424	1.147284
8,1	True	4	1.871656	1.303827	1.167137
8,1	True	8	1.924820	1.304778	1.195446
8,1	True	16	1.958561	1.338893	1.219059
8,1	True	32	2.004317	1.350598	1.258091
8,1	True	64	2.097294	1.426414	1.276438
8,1	True	128	2.201477	1.463960	1.393515
8,1	True	256	2.408261	1.698295	1.571760
8,1	True	512	2.938759	2.136045	2.012780
8,1	True	1024	3.786154	2.861506	2.735244
8,1	True	2048	5.386977	4.293168	4.116696
8,1	True	4096	6.829920	7.085961	5.167309
8,1	True	8192	10.183406	10.740274	7.273840
8,1	True	16384	27.338417	30.349545	24.180913
8,1	True	32768	52.692111	55.325235	47.226159
8,1	False	1	1.893212	1.101923	1.161132
8,1	False	2	1.869795	1.155493	1.124730
8,1	False	4	1.903683	1.173441	1.163394
8,1	False	8	1.940271	1.251988	1.194802
8,1	False	16	1.953642	1.265088	1.195717
8,1	False	32	2.074449	1.101788	1.251232
8,1	False	64	2.122945	1.289947	1.280454
8,1	False	128	2.222100	1.391274	1.382383
8,1	False	256	2.429395	1.574957	1.543677
8,1	False	512	2.924664	1.986138	1.975842
8,1	False	1024	3.795915	2.774116	2.872588
8,1	False	2048	5.371883	4.314102	4.118552
8,1	False	4096	6.939690	7.136623	5.093441
8,1	False	8192	10.404934	12.827360	7.206322
8,1	False	16384	27.702673	26.512637	23.768610
8,1	False	32768	52.997290	49.990181	46.448503
16,2	True	1	1.519934	1.244739	1.133463
16,2	True	2	1.879703	1.283250	1.174973
16,2	True	4	1.928291	1.292576	1.217183
16,2	True	8	1.938168	1.333630	1.213554
16,2	True	16	2.005221	1.363596	1.263786
16,2	True	32	2.097003	1.408978	1.296604
16,2	True	64	2.184783	1.469828	1.390116
16,2	True	128	2.411589	1.701521	1.579916
16,2	True	256	2.870595	2.122580	2.025193
16,2	True	512	3.729159	2.869728	2.905558
16,2	True	1024	5.029063	4.310820	3.929155
16,2	True	2048	6.135365	7.093782	5.169978
16,2	True	4096	8.830602	12.747241	7.129128
16,2	True	8192	24.566463	32.711641	22.889283
16,2	True	16384	49.337367	57.151719	45.760788
16,2	True	32768	96.290678	103.881654	89.097847
16,2	False	1	1.879413	1.138956	1.118635
16,2	False	2	1.911116	1.159444	1.163345
16,2	False	4	1.955946	1.196264	1.210980
16,2	False	8	1.942596	1.277328	1.193927
16,2	False	16	2.021247	1.294169	1.252746
16,2	False	32	2.119432	1.139106	1.275601
16,2	False	64	2.239173	1.387136	1.381653
16,2	False	128	2.421391	1.579352	1.548451
16,2	False	256	2.866294	1.965325	1.992771
16,2	False	512	3.732518	2.759365	2.866410
16,2	False	1024	5.039539	4.259567	4.094223
16,2	False	2048	6.220263	7.132135	5.100184
16,2	False	4096	9.056353	12.771845	7.088739
16,2	False	8192	24.813617	25.854980	22.488359
16,2	False	16384	49.744501	51.050385	44.826011
16,2	False	32768	96.879442	95.046490	87.288818
32,8	True	1	1.555933	1.277164	1.178751
32,8	True	2	2.041247	1.298421	1.221050
32,8	True	4	2.041440	1.315357	1.222321
32,8	True	8	2.021853	1.362363	1.258056
32,8	True	16	2.130635	1.408542	1.324837
32,8	True	32	2.264945	1.489806	1.406744
32,8	True	64	2.486641	1.728049	1.644317
32,8	True	128	3.033159	2.151436	2.182893
32,8	True	256	3.933767	3.070315	3.241516
32,8	True	512	5.504190	4.593622	4.450881
32,8	True	1024	7.116248	7.701377	5.064358
32,8	True	2048	10.899568	14.743538	8.543146
32,8	True	4096	30.605005	33.021432	27.836558
32,8	True	8192	61.512406	77.742546	53.569546
32,8	True	16384	119.979191	133.753835	105.092343
32,8	True	32768	235.423054	245.938993	207.621564
32,8	False	1	1.905801	1.142943	1.157248
32,8	False	2	2.058847	1.184524	1.211946
32,8	False	4	2.058650	1.211242	1.212214
32,8	False	8	2.088970	1.298377	1.234327
32,8	False	16	2.109849	1.300255	1.314047
32,8	False	32	2.242838	1.245760	1.391347
32,8	False	64	2.476000	1.625049	1.609899
32,8	False	128	3.010011	2.041704	2.143426
32,8	False	256	3.953410	2.906295	3.195842
32,8	False	512	5.485879	4.548299	4.372080
32,8	False	1024	7.296242	7.675740	5.296103
32,8	False	2048	11.167942	14.129414	8.386598
32,8	False	4096	30.868913	31.681977	27.283209
32,8	False	8192	61.956734	62.784673	52.545371
32,8	False	16384	120.109954	122.055160	102.835000
32,8	False	32768	235.621713	230.134886	203.241230

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-14T13:43:18Z

Summary of Changes

Hello @DarkSharpness, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request undertakes a significant refactoring of the Rotary Position Embedding (RoPE) implementation within the SGLang JIT kernel. The core objective is to eliminate the reliance on the external flashinfer library by integrating a custom-built, highly optimized RoPE kernel. This new kernel not only efficiently applies RoPE but also introduces a fused operation that combines RoPE application with KV cache storage, resulting in notable performance gains across various configurations.

Highlights

FlashInfer Dependency Removal: The external flashinfer library's Rotary Position Embedding (RoPE) implementation has been replaced with a custom, optimized SGLang kernel.
New RoPE Kernel (v2): A new rope_v2 kernel has been introduced, capable of both standard RoPE application and a fused RoPE + KV cache store operation, supporting head_dim values of 64, 128, 256, and 512.
Performance Improvements: Benchmarking results indicate significant performance enhancements with the new rope_v2 kernel compared to previous SGLang versions and the FlashInfer implementation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/jit_kernel/benchmark/bench_rope.py
- Added a new benchmark script to evaluate the performance of the new rope_v2 kernel, comparing it against FlashInfer and previous SGLang RoPE implementations for both standalone RoPE and fused RoPE + KV cache store operations.
python/sglang/jit_kernel/csrc/elementwise/rope.cuh
- Removed the previous FlashInfer-based CUDA kernel implementation for Rotary Position Embedding.
python/sglang/jit_kernel/csrc/elementwise/rope_v2.cuh
- Added a new CUDA kernel (fused_rope_kernel and fused_rope_store_kernel) that implements optimized RoPE and fused RoPE with KV cache storage, supporting different head dimensions and NeoX/interleaved styles.
python/sglang/jit_kernel/include/sgl_kernel/utils.cuh
- Included <type_traits> header for C++ utilities.
- Added load_as and store_as templated utility functions for safe type-aware loading and storing from void pointers on the device.
python/sglang/jit_kernel/rope_v2.py
- Added Python binding functions (fused_rope_inplace and fused_rope_inplace_with_kvcache) to expose the new rope_v2 CUDA kernels to the Python API.
python/sglang/jit_kernel/tests/test_rope.py
- Updated the test suite to remove old benchmarking utilities and introduce new parameterized tests for the rope_v2 and fused_rope_store kernels, validating their correctness against FlashInfer's implementation and ensuring support for various configurations and position data types.

Activity

The pull request author, DarkSharpness, has provided a detailed description of the motivation, modifications, and extensive benchmarking results for the new RoPE kernel.
No external review comments or activity have been recorded yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This is an excellent pull request that rewrites the RoPE kernel, removing the dependency on flashinfer and introducing a new, highly optimized custom CUDA kernel. The new implementation is well-structured and leverages modern C++ and CUDA features for performance. The test suite is comprehensive, covering a wide range of parameters and ensuring correctness. The benchmarks also demonstrate the performance benefits of the new kernel. I have a couple of minor suggestions to improve clarity and consistency in the benchmark and Python wrapper code.

python/sglang/jit_kernel/benchmark/bench_rope.py

python/sglang/jit_kernel/rope_v2.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

DarkSharpness · 2026-02-15T06:14:48Z

/tag-and-rerun-ci

feat: rewrite rope kernel; remove flashinfer dependencies

fa7483c

DarkSharpness requested a review from BBuf as a code owner February 14, 2026 13:42

gemini-code-assist bot reviewed Feb 14, 2026

View reviewed changes

python/sglang/jit_kernel/benchmark/bench_rope.py Show resolved Hide resolved

python/sglang/jit_kernel/rope_v2.py Outdated Show resolved Hide resolved

Update python/sglang/jit_kernel/rope_v2.py

d36849c

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

DarkSharpness marked this pull request as draft February 14, 2026 16:51

DarkSharpness added 2 commits February 15, 2026 14:04

feat: integrate into srt and clean up old rop

0e5721d

fix: minor fix shape

f6bc9be

DarkSharpness marked this pull request as ready for review February 15, 2026 06:14

DarkSharpness requested review from Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners February 15, 2026 06:14

github-actions bot added the run-ci label Feb 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] rewrite rope kernel; remove flashinfer dependencies#18844

[Feature] rewrite rope kernel; remove flashinfer dependencies#18844
DarkSharpness wants to merge 4 commits intosgl-project:mainfrom
DarkSharpness:jit_rope

DarkSharpness commented Feb 14, 2026

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

DarkSharpness commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DarkSharpness commented Feb 14, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

rope-performance

rope-store-performance

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

DarkSharpness commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant