Skip to content

[Store] Add HybridStore with BM25 ranking for PostgreSQL#783

Open
ahmed-bhs wants to merge 13 commits intosymfony:mainfrom
ahmed-bhs:feature/postgres-hybrid-search
Open

[Store] Add HybridStore with BM25 ranking for PostgreSQL#783
ahmed-bhs wants to merge 13 commits intosymfony:mainfrom
ahmed-bhs:feature/postgres-hybrid-search

Conversation

@ahmed-bhs
Copy link
Copy Markdown
Contributor

@ahmed-bhs ahmed-bhs commented Oct 15, 2025

Q A
Bug fix? no
New feature? yes
Docs? no
License MIT

Problem

Vector search and full-text search each have limitations:

  • Vector search: Great for semantic similarity, but may rank exact term matches lower.
  • Full-text search: Great for exact matches, but misses semantic relationships.

Users often need both: "Find documents about space travel that mention Apollo"

Solution

New PostgreSQL HybridStore combining three search methods with Reciprocal Rank Fusion (RRF):

Method Extension Purpose
Semantic pgvector Conceptual similarity
Keyword BM25 or native FTS Exact term matching
Fuzzy pg_trgm Typo tolerance

Why BM25 over native PostgreSQL FTS?

Native PostgreSQL uses TF-IDF, which has known limitations:

  • No document length normalization (long documents score higher unfairly)
  • Term frequency grows unbounded (repeating a word 100x inflates score)

BM25 fixes these issues with saturation and length normalization — that's why Elasticsearch, Meilisearch, and Lucene all use it.

Text Search Strategy Selection

The behavior depends on which strategy the user explicitly configures:

  • PostgresTextSearchStrategy (default): Uses ts_rank_cd — works with any PostgreSQL installation
  • Bm25TextSearchStrategy: Uses plpgsql_bm25 for better ranking — requires extension
// Native FTS (default, no extension required)
$store = new HybridStore($pdo, 'movies');

// BM25 for better ranking (requires plpgsql_bm25 extension)
$store = new HybridStore(
    $pdo,
    'movies',
    textSearchStrategy: new Bm25TextSearchStrategy(bm25Language: 'en')
);

If BM25 is configured but the extension is not installed, a clear RuntimeException is thrown with instructions to run ai:store:setup.

Automatic BM25 Installation

The plpgsql_bm25 extension is automatically installed when running:

php bin/console ai:store:setup ai.store.postgres.movies

This executes the bundled plpgsql_bm25.sql which creates all required functions (bm25topk, bm25createindex, etc.).

Features

  • Pluggable text search: BM25 or native PostgreSQL FTS
  • RRF fusion: Merges vector + keyword + fuzzy rankings
  • Configurable ratio: 0.0 = keyword-only → 1.0 = vector-only
  • Fuzzy matching: Typo tolerance via pg_trgm

Configuration Example

ai:
    platform:
        ollama:
            host_url: 'http://127.0.0.1:11434'

    store:
        postgres:
            movies:
                dsn: 'pgsql:host=localhost;dbname=hybrid_search'
                username: 'postgres'
                password: 'postgres'
                table_name: 'movies'
                vector_field: 'embedding'
                distance: cosine  # Cosine distance for normalized embeddings

                hybrid:
                    enabled: true
                    content_field: 'content'
                    semantic_ratio: 0.3        # 30% semantic, 70% BM25
                    language: 'english'
                    bm25_language: 'en'
                    text_search_strategy: 'bm25'
                    rrf_k: 10
                    default_min_score: 0
                    normalize_scores: true
                    fuzzy_weight: 0.4

Usage

$store = new HybridStore($pdo, 'movies', semanticRatio: 0.7);
$results = $store->query($vector, ['q' => 'space adventure', 'limit' => 10]);

References

@carsonbot carsonbot added Feature New feature Store Issues & PRs about the AI Store component Status: Needs Review labels Oct 15, 2025
@ahmed-bhs ahmed-bhs force-pushed the feature/postgres-hybrid-search branch 2 times, most recently from 3807878 to 8d4ccfe Compare October 16, 2025 07:36
@chr-hertel chr-hertel requested a review from Copilot October 23, 2025 19:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces PostgresHybridStore, a new vector store implementation that combines semantic vector search (pgvector) with PostgreSQL Full-Text Search (FTS) using Reciprocal Rank Fusion (RRF), following Supabase's hybrid search approach.

Key changes:

  • Implements configurable hybrid search with adjustable semantic ratio (0.0 for pure FTS, 1.0 for pure vector, 0.5 for balanced)
  • Uses RRF algorithm with k=60 default to merge vector similarity and ts_rank_cd rankings
  • Supports multilingual content through configurable PostgreSQL text search configurations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/store/src/Bridge/Postgres/PostgresHybridStore.php Core implementation of hybrid store with vector/FTS query building, RRF fusion logic, and table setup with tsvector generation
src/store/tests/Bridge/Postgres/PostgresHybridStoreTest.php Comprehensive test coverage for constructor validation, setup, pure vector/FTS queries, hybrid RRF queries, and various configuration options

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread src/store/src/Bridge/Postgres/PostgresHybridStore.php Outdated
Comment thread src/store/src/Bridge/Postgres/PostgresHybridStore.php Outdated
Comment thread src/store/src/Bridge/Postgres/PostgresHybridStore.php Outdated
Comment thread src/store/src/Bridge/Postgres/PostgresHybridStore.php Outdated
Copy link
Copy Markdown
Member

@chr-hertel chr-hertel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this is a super cool feature - some copilot findings seem valid to me - please check.

On top, I was unsure if all sprintf need to be sprintf or some values can/should be a prepared parameter - that'd be great to double check as well please.

Comment thread src/store/src/Bridge/Postgres/PostgresHybridStore.php Outdated
ahmed-bhs added a commit to ahmed-bhs/symfony-ai-component-demo-application that referenced this pull request Oct 30, 2025
Side-by-side comparison of FTS, Hybrid (RRF), and Semantic search.
Uses Supabase (pgvector + PostgreSQL FTS).
30 sample articles with interactive Live Component.

Related: symfony/ai#783
Author: Ahmed EBEN HASSINE <ahmedbhs123@gmail.com>
@chr-hertel
Copy link
Copy Markdown
Member

@ahmed-bhs could you please have a look at the pipeline failures - i think there's still some minor parts open

@ahmed-bhs ahmed-bhs force-pushed the feature/postgres-hybrid-search branch 3 times, most recently from 19623bb to 2c7b49a Compare November 7, 2025 13:57
@ahmed-bhs ahmed-bhs changed the title [Store] Add PostgresHybridStore with RRF following Supabase approach [Store] Add HybridStore with BM25 ranking for PostgreSQL Nov 23, 2025
@OskarStark
Copy link
Copy Markdown
Contributor

Open to finish this PR @ahmed-bhs ?

@ahmed-bhs
Copy link
Copy Markdown
Contributor Author

Hi @OskarStark,

The work on my side is complete, you can take a look whenever you have time.

To give you a bit of context on the evolution of this work:

Initially, I wanted to propose a hybrid search implementation on PostgreSQL, combining semantic search (pgvector) with PostgreSQL’s native text search based on TF-IDF (ts_rank_cd).

However, TF-IDF has well-known scoring limitations:

  • No length normalization: longer documents are unfairly favored
  • Unbounded term frequency: repeating a word 100 times artificially inflates the score

That’s why I then suggested using BM25 (the algorithm used by Elasticsearch, Meilisearch, Lucene), which addresses these issues through saturation and document length normalization.

BM25, however, requires the plpgsql_bm25 extension, which is not installed by default. So I implemented a fallback architecture:

  • Default: PostgresTextSearchStrategy using native FTS (works everywhere)
  • Optional: Bm25TextSearchStrategy for better ranking (requires the extension)

I also extracted the RRF (Reciprocal Rank Fusion) logic into a dedicated class for reusability.

Feel free to reach out if you have any questions or feedback!

@ahmed-bhs ahmed-bhs force-pushed the feature/postgres-hybrid-search branch 2 times, most recently from 27954f9 to d7446d5 Compare November 26, 2025 04:12
Copy link
Copy Markdown
Member

@chr-hertel chr-hertel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ahmed-bhs, coming back to this proposal now - I think it's great to promote Postgres more - since for quite some folks this is a great use case to combine it with given infrastructure.

One thing I'd like to see here tho is to identify and leverage synergies with Symfony\AI\Store\Bridge\Postgres\Store implementation.

For example:

  1. do we need to separate bundle configs with postgres and postgres_hybrid - I'd say hybrid could be a keyword below postgres instead.
  2. let's extract some code in separate classes please, see toPgvector + fromPgvector or the different queries maybe?

Thanks already!

@ahmed-bhs
Copy link
Copy Markdown
Contributor Author

@chr-hertel I'll wait for the #1570 PR to be merged first. After that, I'll resume this one and rebase my work on top of it.

For reference, I've put together a small demo here:
https://github.com/ahmed-bhs/symfony-ai-hybrid-search-demo

@chr-hertel
Copy link
Copy Markdown
Member

Merged #1570 today - let's see if it works for your PR or we need some adaptations

@ahmed-bhs ahmed-bhs force-pushed the feature/postgres-hybrid-search branch 7 times, most recently from fdb9834 to 09c9ea2 Compare February 17, 2026 21:04
@chr-hertel
Copy link
Copy Markdown
Member

Hey @ahmed-bhs, still want continue here?

ahmed-bhs and others added 12 commits April 6, 2026 21:56
This PR introduces a new HybridStore implementation for PostgreSQL that combines:
- Semantic search via pgvector (cosine similarity)
- Full-text search via BM25 ranking (superior to ts_rank)
- Fuzzy matching via pg_trgm (typo tolerance)
- Reciprocal Rank Fusion (RRF) algorithm for optimal result ranking

Key Features:
- Configurable semantic_ratio (0.0-1.0) to balance vector vs text search
- BM25 text search strategy with language-specific stemming
- Field-specific boosting (title, overview, etc.)
- Score normalization to 0-100 range for better UX
- Fuzzy matching with configurable thresholds
- Comprehensive test suite with real-world examples

Configuration Example:
```yaml
ai:
    store:
        postgres:
            movies:
                dsn: 'pgsql:host=localhost;dbname=hybrid_search'
                table_name: 'movies'
                vector_field: 'embedding'
                distance: cosine
                hybrid:
                    enabled: true
                    content_field: 'content'
                    semantic_ratio: 0.3
                    text_search_strategy: 'bm25'
                    bm25_language: 'en'
                    rrf_k: 10
                    normalize_scores: true
                    searchable_attributes: []
```

Performance (31,944 movies dataset):
- Import: ~13 movies/sec with Ollama embeddings
- Search: 50-200ms depending on query complexity

Demo project with comparison:
https://github.com/ahmed-bhs/symfony-hybrid-search-comparison-postgres-typesense
Co-authored-by: Christopher Hertel <mail@christopher-hertel.de>
Co-authored-by: Christopher Hertel <mail@christopher-hertel.de>
@ahmed-bhs ahmed-bhs force-pushed the feature/postgres-hybrid-search branch from 09c9ea2 to 1ed2593 Compare April 6, 2026 19:58
@ahmed-bhs
Copy link
Copy Markdown
Contributor Author

Hi @chr-hertel Yes! I just rebased on main and resolved all conflicts. What do you need from me to get this merged?

@chr-hertel
Copy link
Copy Markdown
Member

Hi @ahmed-bhs, thanks for your patience and endurance on this one 🙏

The obvious open topics for me are php-cs-fixer pipeline and that the example doesn't work on my end:
image

i see the setup is being called, but did not further investigate why the extension does not kick in. any clue what's missing there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature New feature Status: Needs Review Store Issues & PRs about the AI Store component

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants