Fix bpe_gpt2_preprocess#5446
Fix bpe_gpt2_preprocess#5446bobqianic wants to merge 12 commits intoggml-org:masterfrom bobqianic:master
bpe_gpt2_preprocess#5446Conversation
|
There is a tokenization test that is failing with this change: I.e. the string |
This comment was marked as resolved.
This comment was marked as resolved.
|
It's odd, but this PR seems to undo the bug fix introduced in PR #3567, which aimed to address the issue reported in Issue #3502. I'll check if reverting to the version before PR #3567 eliminates the error encountered in Edit: After rolling back to the version prior to PR #3567, the situation become even worse. The error rate surged to |
|
We might need to add some additional code to ensure our tokenizer aligns with the Falcon tokenizer's behavior. It seems there aren't any major issues with this pull request. Echoing @apage43's comments in this discussion, to match the Falcon tokenizer exactly, we should consider developing specific functions aimed at this requirement, rather than altering the |
|
This time, I will conduct some stricter tests, using a test set that is Edit: Great news. No errors were detected! |
|
Ok, got it. We should however fix the test - maybe just use the new set of tokens and add a short comment explaining why there is a difference (or just link to this discussion) |
This comment was marked as resolved.
This comment was marked as resolved.
|
Btw, #5464 claims to fix some issues in the BPE preprocess function - might be useful |
I attempted to build the PR you mentioned, but unfortunately, it didn't succeed. |
|
Good, nearly all of Falcon's tokenizer tests have failed, which indicates that our implementation of |
This is a byproduct of the work on whisper.cpp. Could anyone conduct a few tests to verify if it resolves the issue detailed at #3502? @ggerganov
In the development of whisper.cpp, I've performed tests on a relatively large dataset, and the results have been quite promising. You can find more details about this testing at ggml-org/whisper.cpp#1854.