Skip to content

Extend process module #188

@maxbachmann

Description

@maxbachmann

Currently the process module has the following functions:

function kind explanation
extractOne one x many returns the best match as (choice, score, index/key)
extract one x many returns the best matches until limit as list[(choice, score, index/key)]
extract_iter one x many generator yielding (choice, score, index/key). Usage is not really recommended, since it is far slower than the others
cdist many x many returns all results as numpy matrix

It would be nice to have equivalents of extractOne / extract for many x many. They would need less memory than cdist, which can take a large amount of memory when len(queries) and len(choices) are large.

function kind explanation
- many x many returns the best matches as list[(choice, score, index)]
- many x many returns the best matches until limit as list[list[(choice, score, index)]]
- one x many returns all result without any sorting like cdist

A first thought might be to overload the existing extractOne / extract on the type passed as query / queries. However this is not possible, since the following is a valid usage of these methods:

extractOne(["hello", "world"], [["hello", "world"]])

which can not be distinguished from many x many. For this reason these functions need a new API.

Beside this in many cases users are not actually interested, but only care about finding elements with a score, which is better than the score_cutoff. These could potentially be implemented more efficiently, since the implementation could quit once it is known, that they are better than score_cutoff. These could be cases:

function kind explanation
- many x many returns matrix of bool
- one x many returns list of bool when there is a matching choice (e.g. https://stackoverflow.com/questions/70770842/matching-strings-within-two-lists/70780527#70780527)

This could be automatically done when the user passes dtype=bool.

Any suggestions on the naming of these new API's are welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions