Skip to content

add initial implementation of parallel encoding#1254

Merged
taku910 merged 11 commits into
masterfrom
paralle-encoding
Jun 2, 2026
Merged

add initial implementation of parallel encoding#1254
taku910 merged 11 commits into
masterfrom
paralle-encoding

Conversation

@taku910

@taku910 taku910 commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Added initial implementation of parallel encoding. syncing from internal repos.

taku910 and others added 11 commits May 28, 2026 17:59
WORD tokenization emits pieces with the escaped whitespace prefix, so
user_defined_symbols must include the prefixed form during training.
Verify that a user-defined symbol is encoded as its own piece instead of
unk when training a word model.
Train a word model with a user-defined symbol and assert encode output
uses the symbol piece rather than unk.
Keep the end-to-end piece output assertion while the C++ regression test
covers id-level behavior for user_defined_symbols in WORD models.
…ed-symbols

Fix user_defined_symbols encoding for WORD model
@taku910 taku910 merged commit 0e9e7fb into master Jun 2, 2026
32 of 37 checks passed
@taku910 taku910 deleted the paralle-encoding branch June 3, 2026 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants