Fix for loading tokenizers using non-utf8 strings by ThomasProg · Pull Request #55 · mlc-ai/tokenizers-cpp

ThomasProg · 2025-01-13T14:23:56Z

After upgrading to Tokenizers 0.20.0 or higher, BPEs can now encode into strings with non-utf8 characters.
It makes the current version of tokenizers-cpp impossible to load a tokenizer created with a newer version of Tokenizers.

This pull request is to solve that issue, simply replacing std::string::from_utf8(), causing a previous exception, into String::from_utf8_lossy(), which allows non-utf8 strings.

…n utf8 data error in case of BPE use

tqchen · 2025-02-24T15:35:09Z

Thanks @ThomasProg !this is now merged

ThomasProg added 3 commits January 13, 2025 23:03

std::str::from_utf8() replaced by String::from_utf8_lossy() to fix no…

0658eef

…n utf8 data error in case of BPE use

Merge remote-tracking branch 'upstream/main' into utf8fix

873f641

fixed invalid type

085a8ca

tqchen merged commit c290994 into mlc-ai:main Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for loading tokenizers using non-utf8 strings#55

Fix for loading tokenizers using non-utf8 strings#55
tqchen merged 3 commits intomlc-ai:mainfrom
ThomasProg:utf8fix

ThomasProg commented Jan 13, 2025

Uh oh!

tqchen commented Feb 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ThomasProg commented Jan 13, 2025

Uh oh!

tqchen commented Feb 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants