Skip to content

Fix for loading tokenizers using non-utf8 strings#55

Merged
tqchen merged 3 commits intomlc-ai:mainfrom
ThomasProg:utf8fix
Feb 24, 2025
Merged

Fix for loading tokenizers using non-utf8 strings#55
tqchen merged 3 commits intomlc-ai:mainfrom
ThomasProg:utf8fix

Conversation

@ThomasProg
Copy link
Copy Markdown
Contributor

After upgrading to Tokenizers 0.20.0 or higher, BPEs can now encode into strings with non-utf8 characters.
It makes the current version of tokenizers-cpp impossible to load a tokenizer created with a newer version of Tokenizers.

This pull request is to solve that issue, simply replacing std::string::from_utf8(), causing a previous exception, into String::from_utf8_lossy(), which allows non-utf8 strings.

@tqchen tqchen merged commit c290994 into mlc-ai:main Feb 24, 2025
@tqchen
Copy link
Copy Markdown
Contributor

tqchen commented Feb 24, 2025

Thanks @ThomasProg !this is now merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants