[Feature] Neo4j Knowledge Graph Integration for Document Processing and RAG Retrieval
Summary
Integrate Neo4j as an optional knowledge graph backend to:
- Document processing: Update the knowledge graph when documents are processed (when Neo4j is available)
- RAG retrieval: Use Neo4j for graph-based retrieval during document Q&A (when Neo4j is available)
The integration should be optional and gracefully degradable—the system continues to work without Neo4j, falling back to the existing PostgreSQL-based knowledge graph and BM25+Vector ensemble.
Background / Current State
Existing Knowledge Graph (PostgreSQL)
The codebase already has a PostgreSQL-based knowledge graph:
| Component |
Location |
Purpose |
| Schema |
src/server/db/schema/knowledge-graph.ts |
kg_entities, kg_entity_mentions, kg_relationships |
| Entity extraction |
src/lib/ingestion/entity-extraction.ts |
Calls sidecar /extract-entities, stores in PostgreSQL |
| Graph retriever |
src/lib/tools/rag/retrievers/graph-retriever.ts |
Traverses 1–2 hops via Drizzle/SQL |
| Pipeline hook |
src/lib/tools/doc-ingestion/index.ts |
maybeExtractEntities() runs after storeDocument() when sidecar is available |
Current RAG Retrieval
- Ensemble: BM25 + Vector only (weights
[0.4, 0.6])
- Graph retriever: Implemented but not wired into the ensemble
- Reranking: Optional sidecar cross-encoder when
SIDECAR_URL is set
Document Processing Flow
Upload → Ingest → Chunk → Embed → Store (pgvector) → [Optional] Extract Entities → [Proposed] Sync to Neo4j
Motivation
- Graph-native traversal: Neo4j excels at multi-hop graph traversal and path queries; PostgreSQL uses recursive CTEs which can be slower on large graphs.
- Cypher expressiveness: Cypher allows concise, readable graph queries (e.g. variable-length paths, pattern matching).
- Scalability: For companies with large document corpora (100K+ entities), Neo4j can provide better query performance.
- Future extensibility: Enables graph algorithms (PageRank, community detection) and richer relationship types without schema changes.
Proposed Solution
1. Document Processing: Sync to Neo4j When Available
Trigger: After entity extraction completes (PostgreSQL write), if NEO4J_URI is configured, sync entities and relationships to Neo4j.
Data flow:
- Read from
kg_entities, kg_entityMentions, kg_relationships (or use the in-memory result from extraction)
- Batch-write to Neo4j using
MERGE for idempotency
- Link nodes to
documentId and companyId via properties for scoping
Cypher model (conceptual):
MERGE (e:Entity {name: $name, label: $label, companyId: $companyId})
ON CREATE SET e.displayName = $displayName, e.confidence = $confidence
MERGE (s:Section {id: $sectionId, documentId: $documentId})
MERGE (e)-[:MENTIONED_IN {confidence: $conf}]->(s)
MERGE (e1)-[r:CO_OCCURS]->(e2)
ON CREATE SET r.weight = 0.5, r.evidenceCount = 1
Implementation:
- Add
src/lib/graph/neo4j-client.ts — Neo4j driver wrapper, connection pooling, health check
- Add
src/lib/graph/neo4j-sync.ts — maps entities/relationships to Cypher MERGE statements
- In
maybeExtractEntities() or new step maybeSyncToNeo4j(), call sync after PostgreSQL write
- Keep PostgreSQL as source of truth; Neo4j as optional read-optimized layer
2. RAG Retrieval: Neo4j-Aware Graph Retrieval
Trigger: When performing ensemble search, if NEO4J_URI and ENABLE_GRAPH_RETRIEVAL are set, include a graph retriever (Neo4j or PostgreSQL fallback).
Flow:
- Extract query terms from user question
- If Neo4j available: run Cypher traversal in Neo4j
- Else if graph enabled: use existing PostgreSQL
GraphRetriever
- Else: skip graph retrieval
- Fuse graph results with BM25 + Vector via RRF (Reciprocal Rank Fusion)
- Optional reranking via sidecar
Example Cypher for retrieval:
MATCH (e:Entity)
WHERE e.companyId = $companyId AND toLower(e.name) CONTAINS toLower($term)
MATCH (e)-[:MENTIONED_IN]->(s:Section)
WHERE s.documentId IN $documentIds
WITH s LIMIT $topK
RETURN s.id AS sectionId
Implementation:
- Add
src/lib/tools/rag/retrievers/neo4j-graph-retriever.ts — Neo4jGraphRetriever extending LangChain BaseRetriever
- Same interface as
GraphRetriever: _getRelevantDocuments(query) returns Document[]
- Section content fetched from PostgreSQL
documentSections (Neo4j stores section IDs only)
- Wire into
createDocumentEnsembleRetriever, createCompanyEnsembleRetriever, createMultiDocEnsembleRetriever with configurable weight (e.g. [0.3, 0.5, 0.2] for BM25, Vector, Graph)
3. Configuration
| Env var |
Required |
Description |
NEO4J_URI |
No |
Neo4j connection URI (e.g. neo4j://localhost:7687). If unset, Neo4j features are disabled. |
NEO4J_USER |
No |
Neo4j username (default: neo4j) |
NEO4J_PASSWORD |
No |
Neo4j password |
ENABLE_GRAPH_RETRIEVAL |
No |
If true, include graph retriever in ensemble. Default: false until validated. |
Implementation Tasks
Acceptance Criteria
Alternatives Considered
| Approach |
Pros |
Cons |
| Stay with PostgreSQL only |
No new infra, schema exists |
Traversal may slow on large graphs; no Cypher |
| Apache AGE |
Cypher in PostgreSQL, single DB |
Less mature, operational unknowns |
| Neo4j as primary graph |
Best graph performance |
Migration from PostgreSQL KG, more infra |
| Neo4j as optional sync (chosen) |
Gradual adoption, fallback to PG |
Dual-write complexity, eventual consistency |
Risks and Trade-offs
Operational Complexity
- Risk: Neo4j is another service to deploy, monitor, backup.
- Mitigation: Make Neo4j strictly optional. System works without it.
Data Consistency
- Risk: Dual-write (PostgreSQL + Neo4j) can diverge if Neo4j write fails.
- Mitigation: PostgreSQL is source of truth. Consider async sync via queue for resilience.
Graph Retriever Not Yet in Ensemble
- Note: The existing
GraphRetriever is not wired into the ensemble. Phase 1 could be: wire PostgreSQL GraphRetriever first, measure impact, then add Neo4j.
Entity Extraction Quality
- Note: Graph retrieval is only as good as extracted entities. Current extraction uses NER + CO_OCCURS. Improving relation extraction may yield more benefit than Neo4j alone.
Phased Rollout (Recommended)
- Phase 1: Wire existing PostgreSQL
GraphRetriever into ensemble. Measure recall/latency. Validate graph retrieval adds value.
- Phase 2: Add Neo4j sync in document processing. Keep PostgreSQL as source of truth.
- Phase 3: Implement
Neo4jGraphRetriever; use when NEO4J_URI is set.
- Phase 4: Consider improving entity/relationship extraction before scaling.
References
[Feature] Neo4j Knowledge Graph Integration for Document Processing and RAG Retrieval
Summary
Integrate Neo4j as an optional knowledge graph backend to:
The integration should be optional and gracefully degradable—the system continues to work without Neo4j, falling back to the existing PostgreSQL-based knowledge graph and BM25+Vector ensemble.
Background / Current State
Existing Knowledge Graph (PostgreSQL)
The codebase already has a PostgreSQL-based knowledge graph:
src/server/db/schema/knowledge-graph.tskg_entities,kg_entity_mentions,kg_relationshipssrc/lib/ingestion/entity-extraction.ts/extract-entities, stores in PostgreSQLsrc/lib/tools/rag/retrievers/graph-retriever.tssrc/lib/tools/doc-ingestion/index.tsmaybeExtractEntities()runs afterstoreDocument()when sidecar is availableCurrent RAG Retrieval
[0.4, 0.6])SIDECAR_URLis setDocument Processing Flow
Motivation
Proposed Solution
1. Document Processing: Sync to Neo4j When Available
Trigger: After entity extraction completes (PostgreSQL write), if
NEO4J_URIis configured, sync entities and relationships to Neo4j.Data flow:
kg_entities,kg_entityMentions,kg_relationships(or use the in-memory result from extraction)MERGEfor idempotencydocumentIdandcompanyIdvia properties for scopingCypher model (conceptual):
Implementation:
src/lib/graph/neo4j-client.ts— Neo4j driver wrapper, connection pooling, health checksrc/lib/graph/neo4j-sync.ts— maps entities/relationships to CypherMERGEstatementsmaybeExtractEntities()or new stepmaybeSyncToNeo4j(), call sync after PostgreSQL write2. RAG Retrieval: Neo4j-Aware Graph Retrieval
Trigger: When performing ensemble search, if
NEO4J_URIandENABLE_GRAPH_RETRIEVALare set, include a graph retriever (Neo4j or PostgreSQL fallback).Flow:
GraphRetrieverExample Cypher for retrieval:
Implementation:
src/lib/tools/rag/retrievers/neo4j-graph-retriever.ts—Neo4jGraphRetrieverextending LangChainBaseRetrieverGraphRetriever:_getRelevantDocuments(query)returnsDocument[]documentSections(Neo4j stores section IDs only)createDocumentEnsembleRetriever,createCompanyEnsembleRetriever,createMultiDocEnsembleRetrieverwith configurable weight (e.g.[0.3, 0.5, 0.2]for BM25, Vector, Graph)3. Configuration
NEO4J_URIneo4j://localhost:7687). If unset, Neo4j features are disabled.NEO4J_USERneo4j)NEO4J_PASSWORDENABLE_GRAPH_RETRIEVALtrue, include graph retriever in ensemble. Default:falseuntil validated.Implementation Tasks
neo4j-driverdependencysrc/lib/graph/neo4j-client.ts(driver, health check, connection handling)src/lib/graph/neo4j-sync.ts(entity/relationship sync from PostgreSQL to Neo4j)maybeSyncToNeo4j()step in document ingestion pipeline (aftermaybeExtractEntities)src/lib/tools/rag/retrievers/neo4j-graph-retriever.tsAcceptance Criteria
NEO4J_URIis set and a document is processed with entity extraction, entities and relationships are synced to Neo4jNEO4J_URIandENABLE_GRAPH_RETRIEVALare set, RAG queries include graph-based results in the ensembleAlternatives Considered
Risks and Trade-offs
Operational Complexity
Data Consistency
Graph Retriever Not Yet in Ensemble
GraphRetrieveris not wired into the ensemble. Phase 1 could be: wire PostgreSQLGraphRetrieverfirst, measure impact, then add Neo4j.Entity Extraction Quality
Phased Rollout (Recommended)
GraphRetrieverinto ensemble. Measure recall/latency. Validate graph retrieval adds value.Neo4jGraphRetriever; use whenNEO4J_URIis set.References
src/server/db/schema/knowledge-graph.tssrc/lib/tools/rag/retrievers/graph-retriever.ts