-
-
Notifications
You must be signed in to change notification settings - Fork 151
Description
Currently the process module has the following functions:
| function | kind | explanation |
|---|---|---|
| extractOne | one x many | returns the best match as (choice, score, index/key) |
| extract | one x many | returns the best matches until limit as list[(choice, score, index/key)] |
| extract_iter | one x many | generator yielding (choice, score, index/key). Usage is not really recommended, since it is far slower than the others |
| cdist | many x many | returns all results as numpy matrix |
It would be nice to have equivalents of extractOne / extract for many x many. They would need less memory than cdist, which can take a large amount of memory when len(queries) and len(choices) are large.
| function | kind | explanation |
|---|---|---|
| - | many x many | returns the best matches as list[(choice, score, index)] |
| - | many x many | returns the best matches until limit as list[list[(choice, score, index)]] |
| - | one x many | returns all result without any sorting like cdist |
A first thought might be to overload the existing extractOne / extract on the type passed as query / queries. However this is not possible, since the following is a valid usage of these methods:
extractOne(["hello", "world"], [["hello", "world"]])which can not be distinguished from many x many. For this reason these functions need a new API.
Beside this in many cases users are not actually interested, but only care about finding elements with a score, which is better than the score_cutoff. These could potentially be implemented more efficiently, since the implementation could quit once it is known, that they are better than score_cutoff. These could be cases:
| function | kind | explanation |
|---|---|---|
| - | many x many | returns matrix of bool |
| - | one x many | returns list of bool when there is a matching choice (e.g. https://stackoverflow.com/questions/70770842/matching-strings-within-two-lists/70780527#70780527) |
This could be automatically done when the user passes dtype=bool.
Any suggestions on the naming of these new API's are welcome.