Summary
To maximize the value of the upcoming Neo4j graph integration (and the existing PostgreSQL graph), we need to improve the quality of the data entering the graph and how we query it. This sub-issue focuses on two key enhancements:
- Refining Document Ingestion: Moving beyond simple
CO_OCCURS relationships by identifying richer, semantic relationships between entities, and refining/deduplicating the entities themselves.
- Query-Side Entity Extraction: Applying Named Entity Recognition (NER) to the user's query at runtime to identify precise seed nodes for graph traversal, improving retrieval accuracy.
Proposed Solution
1. Refine Entity & Relationship Extraction (Ingestion)
- Semantic Relationships: Update the sidecar
/extract-entities logic (or the LLM prompt powering it) to identify specific relationship types rather than just co-occurrence. Examples include BELONGS_TO, DEPENDS_ON, REPORTS_TO, SIMILAR_TO, etc.
- Entity Refinement & Resolution: Introduce a deduplication or canonicalization step during ingestion. For example, resolving "AWS" and "Amazon Web Services" to the same underlying entity, or merging entities with high embedding similarity before syncing to the database/Neo4j.
2. Query-Side Entity Extraction (Retrieval)
- User Query NER: Before hitting the RAG ensemble, pass the user query through a lightweight extraction step (either a fast LLM call or a dedicated NER model in the sidecar) to identify key entities.
- Targeted Graph Traversal: Pass these extracted entities to the
GraphRetriever and the proposed Neo4jGraphRetriever. Use these exact entity names/labels as the starting nodes for multi-hop graph traversals.
Implementation Tasks
Acceptance Criteria
Summary
To maximize the value of the upcoming Neo4j graph integration (and the existing PostgreSQL graph), we need to improve the quality of the data entering the graph and how we query it. This sub-issue focuses on two key enhancements:
CO_OCCURSrelationships by identifying richer, semantic relationships between entities, and refining/deduplicating the entities themselves.Proposed Solution
1. Refine Entity & Relationship Extraction (Ingestion)
/extract-entitieslogic (or the LLM prompt powering it) to identify specific relationship types rather than just co-occurrence. Examples includeBELONGS_TO,DEPENDS_ON,REPORTS_TO,SIMILAR_TO, etc.2. Query-Side Entity Extraction (Retrieval)
GraphRetrieverand the proposedNeo4jGraphRetriever. Use these exact entity names/labels as the starting nodes for multi-hop graph traversals.Implementation Tasks
src/lib/ingestion/entity-extraction.ts(and the sidecar) to output semantic relationship types.kg_entities.extractQueryEntities(query: string)utility function to process user queries.src/lib/tools/rag/retrievers/graph-retriever.tsand the future Neo4j retriever) to accept extracted entities as search parameters.Acceptance Criteria
CO_OCCURSare successfully identified and stored in PostgreSQL/Neo4j.