Skip to content

abort_callback and encoder_begin_callback do not interrupt mid-computation (mid-transcription) #3718

@Econ01

Description

@Econ01

Problem

The abort_callback field in whisper_full_params does not actually interrupt an in-progress transcription. It only fires after each encode/decode step completes, making it ineffective for real-time cancellation. This makes it impossible to build a responsive stop feature in applications that use whisper_full(). It can also create ghost processes in certain applications if we rely purely on operating system's garbage collection. (e.g. macOS)

Root Cause

There are two overloads of the internal ggml_graph_compute_helper:

  1. The ggml_cgraph * version (line 169): Correctly accepts and wires abort_callback into ggml backend.
  2. The abort_callback_sched_t version (line 191): Has no abort_callback parameter at all.

All actual encoder and decoder compute calls inside whisper_encode_internaland whisper_decode_internaluse the second (shed) overload exclusively. As a result, abort_callback is never passed to the ggml backend during computation. The only places it fires are the post-hoc checks at the very end of those functions (lines 2447 and 2977), after all the work is already done.

Additionally, the main token sampling loop (whisper_full_with_state, line 6783) has no abort check at all. It runs up to n_text_ctx / 2 iterations with no opportunity to exit early.

encoder_begin_callback does work correctly. It fires before each audio chunk but this only helps with multi-chunk audio. For short clips processed as a single chunk, by the time a user requests a stop, the single chunk is already being processed and encoder_begin_callback will not fire again.

Proposed Fix

I propose 3 changes to whisper.cpp with no API changes:

  1. Add abort_callback support to the sched overload of ggml_graph_compute_helper, ısing the same ggml_backend_set_abort_callback pattern already present in the non-shed overload.
  2. Pass abort_callback and abort_callback_user_data through the ggml_graph_compute_helper(schedule, ...) calls inside whisper_encode_internal (lines 2406, 2431, 2447) and inside whisper_decode_internal (line 2944). Note that whisper_decode_internal is called from 4 external sites, but lines 3940 and 8847 already pass nullptr and are unrelated to user-initiated abort.
  3. Add an abort_callback check at the top of the token sampling loop in whisper_full_with_state so it can exit between token generations.

Discussion

Before implementing these, I wanted to check whether is this the right layer to fix it, or would you prefer the abort mechanism live deeper in ggml? Also do you have any concerns with the proposed approach?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions