-
Notifications
You must be signed in to change notification settings - Fork 32.8k
[TokenizerSlow] replace_additional_special_tokens is not doing much #24276
Copy link
Copy link
Closed
Labels
Core: TokenizationInternals of the library; Tokenization.Internals of the library; Tokenization.
Description
Just flagging this as the add_special_tokens method got pretty complicated, adding a kwargs, replace_additional_special_tokens, that supposedly can prevent replacing the self._additional_special_tokens attribute.
For any tokenizer, this will remove it from the list, but will not update the internal trie and thus has no effect at all:
>>> from transformers import XLMRobertaTokenizer
>>> tokenizer_a = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')
>>> tokenizer_a.add_special_tokens({"additional_special_tokens":["<//s>"]})
>>> tokenizer_a.additional_special_tokens
['<//s>']
>>> print(tokenizer_a.tokenize("This is a <//s>"))
['▁This', '▁is', '▁a', '<//s>']
>>> tokenizer_a.add_special_tokens({"additional_special_tokens":["<///s>"]}, replace_additional_special_tokens= True)
>>> print(tokenizer_a.tokenize("This is a <//s>"))
['▁This', '▁is', '▁a', '<//s>']This will be addressed in #23909
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Core: TokenizationInternals of the library; Tokenization.Internals of the library; Tokenization.