-
Notifications
You must be signed in to change notification settings - Fork 32.8k
PreTrainedTokenizer (slow) strip tokens that are around unique_no_split_tokens #21120
Copy link
Copy link
Closed
Closed
Copy link
Labels
Core: TokenizationInternals of the library; Tokenization.Internals of the library; Tokenization.
Description
System Info
transformersversion: 4.24.0- Platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.31
- Python version: 3.10.8
- Huggingface_hub version: 0.11.1
- PyTorch version (GPU?): 1.13.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Steps to reproduce the behavior:
- load a
PreTrainedTokenizerthat containsunique_no_split_tokens, e.g.EleutherAI/gpt-j-6B.
tokenizer = transformers.GPT2Tokenizer.from_pretrained('EleutherAI/gpt-j-6B')- use the tokenizer to split a string that contains a
unique_no_split_tokens, e.g." <|extratoken_1|> ".
print(tokenizer(" <|extratoken_1|> ").input_ids)Expected behavior
The tokenizer splits the string into 3 tokens (" ", "<|extratoken_1|>" and " "), and gives their ids ([220, 50257, 220]). This is the behavior of PreTrainedTokenizerFast.
But the actual behavior is that the PreTrainedTokenizer only gives the id of "<|extratoken_1|>", i.e. 50257
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Core: TokenizationInternals of the library; Tokenization.Internals of the library; Tokenization.