`PreTrainedTokenizer` (slow) strip tokens that are around `unique_no_split_tokens`

### System Info

- `transformers` version: 4.24.0
- Platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.31
- Python version: 3.10.8
- Huggingface_hub version: 0.11.1
- PyTorch version (GPU?): 1.13.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no

### Who can help?

@ArthurZucker

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

Steps to reproduce the behavior:

1. load a `PreTrainedTokenizer` that contains `unique_no_split_tokens`, e.g. `EleutherAI/gpt-j-6B`.
```python
tokenizer = transformers.GPT2Tokenizer.from_pretrained('EleutherAI/gpt-j-6B')
```
2. use the tokenizer to split a string that contains a `unique_no_split_tokens`, e.g. `" <|extratoken_1|> "`.
```python
print(tokenizer(" <|extratoken_1|> ").input_ids)
```

### Expected behavior

The tokenizer splits the string into 3 tokens (`" "`, `"<|extratoken_1|>"` and `" "`), and gives their ids (`[220, 50257, 220]`). This is the behavior of `PreTrainedTokenizerFast`.

But the actual behavior is that the `PreTrainedTokenizer` only gives the id of `"<|extratoken_1|>"`, i.e. `50257`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`PreTrainedTokenizer` (slow) strip tokens that are around `unique_no_split_tokens` #21120

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PreTrainedTokenizer (slow) strip tokens that are around unique_no_split_tokens #21120

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`PreTrainedTokenizer` (slow) strip tokens that are around `unique_no_split_tokens` #21120