Skip to content

Unusual dropping of sentences, infinite duplication of a sentence for longer recordings. #3729

@vincenthawke

Description

@vincenthawke

Input video sample: https://mega.nz/file/qF0RjQra#qrzSmYrg-ou21bIZoQmTsi9f2W9ooDXy5muXqDwB2_Y

Whisper ran with:
whisper-server.exe -m models\ggml-large-v3.bin --host 0.0.0.0 --port 8080 --convert

Output:

{
    "text": " To je Dnevnik Televizije Slovenija.\n 
Potem v kodi Zinez Kočar.\n 
Zgodba o dobrodelnem rokovskem festivalu na Velikem Platnu.\n 
Dokumentarni film premjerno predstavili v Cankarjevem domu.\n 
Lepo pozdravljeni. Več kot 30 tisoč ljudi se je danes zbr\nalo v planici.\n 
Zgodba o dobrodelnem rokovskem festivalu na Velikem Platnu.\n 
Premljali so dramatično dogajanje na letalnici, a se tudi\n 
veselili nove zmage za Slovenijo.\n 
Med moškimi skakalnimi ekipami so po težavah v naši ekipi\n 
slavili avstrici,\n 
slovenci so bili peti, med skakavkami pa je že štiridesetič\n 
na najviše stopničko stopila Nika Prevce.\n"
}

What should have been the output, according to native subtitles:

00:03 - To je Dnevnik Televizije Slovenija.
00:08 - Prva ženska tekma v smučarskih poletih pod Poncami. Zmagala je Nika
00:12 - Prevc, letos skoraj nepremagljiva šampionka.
00:16 - Po hudi ujmi na severu Slovenije čas za oceno škode in hitro obnovo.
00:20 - Veter najbolj pustošil v občinah Žirovnica in Tržič.
00:25 - Kako in kdaj lahko izseliš najemnika, ki ne plačuje ne stroškov
00:28 - ne najemnine in povzroča škodo? O tem v Kodi z Ines Kočar.
00:33 - Zgodba o dobrodelnem rokovskem festivalu na vélikem platnu.
00:36 - Dokumentarni film premierno predstavili v Cankarjevem domu.
00:56 - Lepo pozdravljeni. Več kot 30 tisoč ljudi se je danes zbralo v Planici.
01:00 - Spremljali so dramatično dogajanje na letalnici, veselili so se nove zmage
01:04 - za Slovenijo. Med moškimi skakalnimi ekipami so po težavah v naši ekipi
01:09 - slavili Avstrijci, Slovenci so bili peti. Med skakalkami pa
01:13 - je že 40-ič na najvišjo stopničko stopila Nika Prevc.

If I feed it a longer 30min audio recording, it starts repeating a sentence early on and keeps going all the way throughout. I thought the issue is length, but I got another 1min video of same language where it performs near perfect. In the example above, it dropped whole segments of spoken words. I also reencoded the video to rule out any DRM (low chance) but results are the same: missing whole segments of sentences.

Inference was done over http endpoint:

curl -L '127.0.0.1:8080/inference' \
-F 'file=@"/C:/Users/User/temp/short_sample.mp4"' \
-F 'temperature="0.0"' \
-F 'temperature_inc="0.2"' \
-F 'response_format="json"' \
-F 'language="auto"'
ffmpeg is available.
whisper_init_from_file_with_params_no_state: loading model from 'models\ggml-large-v3.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 1
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_init_with_params_no_state: devices    = 2
whisper_init_with_params_no_state: backends   = 2
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
whisper_model_load:        CUDA0 total size =  3094.36 MB
whisper_model_load: model size    = 3094.36 MB
whisper_backend_init_gpu: device 0: CUDA0 (type: 1)
whisper_backend_init_gpu: found GPU device 0: CUDA0 (type: 1, cnt: 0)
whisper_backend_init_gpu: using CUDA0 backend
whisper_init_state: kv self size  =   83.89 MB
whisper_init_state: kv cross size =  251.66 MB
whisper_init_state: kv pad  size  =    7.86 MB
whisper_init_state: compute buffer (conv)   =   37.69 MB
whisper_init_state: compute buffer (encode) =   55.35 MB
whisper_init_state: compute buffer (cross)  =    9.27 MB
whisper_init_state: compute buffer (decode) =  100.04 MB

OS: Windows 10.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions