[Store] Add HybridStore with BM25 ranking for PostgreSQL#783
[Store] Add HybridStore with BM25 ranking for PostgreSQL#783ahmed-bhs wants to merge 13 commits intosymfony:mainfrom
Conversation
3807878 to
8d4ccfe
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR introduces PostgresHybridStore, a new vector store implementation that combines semantic vector search (pgvector) with PostgreSQL Full-Text Search (FTS) using Reciprocal Rank Fusion (RRF), following Supabase's hybrid search approach.
Key changes:
- Implements configurable hybrid search with adjustable semantic ratio (0.0 for pure FTS, 1.0 for pure vector, 0.5 for balanced)
- Uses RRF algorithm with k=60 default to merge vector similarity and ts_rank_cd rankings
- Supports multilingual content through configurable PostgreSQL text search configurations
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/store/src/Bridge/Postgres/PostgresHybridStore.php | Core implementation of hybrid store with vector/FTS query building, RRF fusion logic, and table setup with tsvector generation |
| src/store/tests/Bridge/Postgres/PostgresHybridStoreTest.php | Comprehensive test coverage for constructor validation, setup, pure vector/FTS queries, hybrid RRF queries, and various configuration options |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
chr-hertel
left a comment
There was a problem hiding this comment.
In general this is a super cool feature - some copilot findings seem valid to me - please check.
On top, I was unsure if all sprintf need to be sprintf or some values can/should be a prepared parameter - that'd be great to double check as well please.
Side-by-side comparison of FTS, Hybrid (RRF), and Semantic search. Uses Supabase (pgvector + PostgreSQL FTS). 30 sample articles with interactive Live Component. Related: symfony/ai#783 Author: Ahmed EBEN HASSINE <ahmedbhs123@gmail.com>
|
@ahmed-bhs could you please have a look at the pipeline failures - i think there's still some minor parts open |
19623bb to
2c7b49a
Compare
|
Open to finish this PR @ahmed-bhs ? |
|
Hi @OskarStark, The work on my side is complete, you can take a look whenever you have time. To give you a bit of context on the evolution of this work: Initially, I wanted to propose a hybrid search implementation on PostgreSQL, combining semantic search (pgvector) with PostgreSQL’s native text search based on TF-IDF ( However, TF-IDF has well-known scoring limitations:
That’s why I then suggested using BM25 (the algorithm used by Elasticsearch, Meilisearch, Lucene), which addresses these issues through saturation and document length normalization. BM25, however, requires the
I also extracted the RRF (Reciprocal Rank Fusion) logic into a dedicated class for reusability. Feel free to reach out if you have any questions or feedback! |
27954f9 to
d7446d5
Compare
chr-hertel
left a comment
There was a problem hiding this comment.
Hi @ahmed-bhs, coming back to this proposal now - I think it's great to promote Postgres more - since for quite some folks this is a great use case to combine it with given infrastructure.
One thing I'd like to see here tho is to identify and leverage synergies with Symfony\AI\Store\Bridge\Postgres\Store implementation.
For example:
- do we need to separate bundle configs with
postgresandpostgres_hybrid- I'd sayhybridcould be a keyword belowpostgresinstead. - let's extract some code in separate classes please, see
toPgvector+fromPgvectoror the different queries maybe?
Thanks already!
fa092d2 to
0782552
Compare
|
@chr-hertel I'll wait for the #1570 PR to be merged first. After that, I'll resume this one and rebase my work on top of it. For reference, I've put together a small demo here: |
|
Merged #1570 today - let's see if it works for your PR or we need some adaptations |
fdb9834 to
09c9ea2
Compare
|
Hey @ahmed-bhs, still want continue here? |
This PR introduces a new HybridStore implementation for PostgreSQL that combines:
- Semantic search via pgvector (cosine similarity)
- Full-text search via BM25 ranking (superior to ts_rank)
- Fuzzy matching via pg_trgm (typo tolerance)
- Reciprocal Rank Fusion (RRF) algorithm for optimal result ranking
Key Features:
- Configurable semantic_ratio (0.0-1.0) to balance vector vs text search
- BM25 text search strategy with language-specific stemming
- Field-specific boosting (title, overview, etc.)
- Score normalization to 0-100 range for better UX
- Fuzzy matching with configurable thresholds
- Comprehensive test suite with real-world examples
Configuration Example:
```yaml
ai:
store:
postgres:
movies:
dsn: 'pgsql:host=localhost;dbname=hybrid_search'
table_name: 'movies'
vector_field: 'embedding'
distance: cosine
hybrid:
enabled: true
content_field: 'content'
semantic_ratio: 0.3
text_search_strategy: 'bm25'
bm25_language: 'en'
rrf_k: 10
normalize_scores: true
searchable_attributes: []
```
Performance (31,944 movies dataset):
- Import: ~13 movies/sec with Ollama embeddings
- Search: 50-200ms depending on query complexity
Demo project with comparison:
https://github.com/ahmed-bhs/symfony-hybrid-search-comparison-postgres-typesense
Co-authored-by: Christopher Hertel <mail@christopher-hertel.de>
Co-authored-by: Christopher Hertel <mail@christopher-hertel.de>
09c9ea2 to
1ed2593
Compare
|
Hi @chr-hertel Yes! I just rebased on main and resolved all conflicts. What do you need from me to get this merged? |
|
Hi @ahmed-bhs, thanks for your patience and endurance on this one 🙏 The obvious open topics for me are php-cs-fixer pipeline and that the example doesn't work on my end: i see the |

Problem
Vector search and full-text search each have limitations:
Users often need both: "Find documents about space travel that mention Apollo"
Solution
New PostgreSQL HybridStore combining three search methods with Reciprocal Rank Fusion (RRF):
Why BM25 over native PostgreSQL FTS?
Native PostgreSQL uses TF-IDF, which has known limitations:
BM25 fixes these issues with saturation and length normalization — that's why Elasticsearch, Meilisearch, and Lucene all use it.
Text Search Strategy Selection
The behavior depends on which strategy the user explicitly configures:
ts_rank_cd— works with any PostgreSQL installationplpgsql_bm25for better ranking — requires extensionIf BM25 is configured but the extension is not installed, a clear RuntimeException is thrown with instructions to run
ai:store:setup.Automatic BM25 Installation
The
plpgsql_bm25extension is automatically installed when running:This executes the bundled
plpgsql_bm25.sqlwhich creates all required functions (bm25topk,bm25createindex, etc.).Features
0.0 = keyword-only → 1.0 = vector-onlypg_trgmConfiguration Example
Usage
References