Skip to content

v5.4.0

Latest

Choose a tag to compare

@pudo pudo released this 13 May 12:35
· 5 commits to main since this release

Note: Triggers full index rebuild (search-side analyzer changes for weak aliases and de-duplicated name fields)

The headline of this release is performance: with the upgraded rigour and nomenklatura stacks, end-to-end matching throughput on the /match endpoint is roughly ~50% faster. Scoring quality has improved on several axes alongside the speedup, and we've carefully guarded against regressions across our existing test corpora. The gain comes from two complementary changes upstream — a rewritten name-comparison core (nomenklatura#303, #305) and pre-match pruning that avoids scoring query × candidate name pairs that share no script or symbolic evidence. In practice that means lower latency under load and meaningfully reduced compute cost per million matches.

Alongside the speedup, this release tightens matching quality and observability:

  • Better cross-script handling. When a Latin-script candidate already carries the symbolic evidence of a same-script match, an Arabic / Cyrillic / etc. rendering of the same name no longer outranks it. This fixes a class of bugs where the wrong-script alias was being preferred despite identical symbols.

  • ES query weighting aligned with logic-v2. The Elasticsearch candidate-fetch stage now boosts typed matches (IDs, dates, addresses) and weights name fields in line with how the second-stage scorer actually values them, so the candidate set fed to scoring is closer to the final ranking.

  • Weak aliases are now indexed and queried. Names that previously only contributed at scoring time are visible to the candidate-fetch stage, improving recall for entities with many partial-name aliases.

  • Concurrent matcher execution. The /match endpoint now runs the full per-query matching pipeline concurrently rather than phase-by-phase, which improves utilisation when many queries are batched.

  • OpenTelemetry instrumentation. Yente now ships with OTel base instrumentation — automatic FastAPI and elasticsearch-py spans, plus manual spans around scoring and the OpenSearch provider. See documentation on monitoring for how to configure it. For debugging, the README has a short note on running tracing on your local machine. Thanks @dimoschifor making that happen!

  • Reconciliation queries are validated. Recon API requests now go through a Pydantic model, so malformed queries payloads return a clear 422 instead of a downstream error (fixes #1127).

  • Error handling. FastAPI error handlers are now mounted properly, bulk indexing errors are surfaced with a sample failure, and InvalidData from value parsing is handled cleanly rather than 500-ing.

  • Security. The ReDoc bundle served from /openapi is now pinned with Subresource Integrity (#1113).

  • New measurement & validation harness. Two tools land alongside this release for users who want to track matching behaviour across versions:

    • yente's contrib/validation_report runs the API against a fixture set and produces an HTML report on accuracy and ID-match rates at a configurable threshold.
    • In nomenklatura, the new contrib/name_bench harness exercises the name-distance scorer against a labelled set of name pairs, with both an accuracy mode (confusion matrix, per-quality calibration check) and a perf mode (μs mean/p50/p95 plus a slowest-cases leaderboard).

    validation_report measures the whole-entity API surface; name_bench measures the name-distance primitive that sits underneath it. Both are aimed at making it easier to spot regressions and recalibrate thresholds when integrating a new yente version.

As usual, this release also contains the standard round of dependency bumps across the whole stack (FastAPI, uvicorn, opensearch-py, cryptography, python-multipart, followthemoney, rigour, nomenklatura).

Elasticsearch 8: end of the line

This is the last yente release that supports Elasticsearch 8. The next release will require a index server running Elasticsearch 9.x to work. If you haven't yet upgraded, please move your Elastic server to 8.19.x first, restart, then upgrade to 9.x — the transition is documented here. The docker-compose.yml shipped with this release pins ES 8.19.13 to give you a clean stepping stone.