ggml-cuda : add rope f16, restore performance with parallel decoding by slaren · Pull Request #3272 · ggml-org/llama.cpp