Complete guide to evaluating and optimizing RAGDOC retrieval quality using standard RAG metrics.
- Quick Start
- Understanding RAG Metrics
- Running Evaluations
- Interpreting Results
- Creating Custom Datasets
- Troubleshooting
RAGDOC must be fully set up with:
- ChromaDB database indexed
- Voyage AI and Cohere API keys configured
- Python environment with all dependencies
# 1. Generate synthetic test dataset (30 queries)
python scripts/generate_test_dataset.py --n_queries 30
# 2. Run evaluation with default settings
python tests/evaluate_ragdoc.py
# 3. View results
cat tests/results/evaluation_report_latest.mdEvaluation takes ~2-3 minutes for 30 queries.
Output files:
evaluation_report_latest.md- Human-readable comparison reportevaluation_detailed_latest.json- Full results for analysisevaluation_aggregate_latest.csv- Metrics table (Excel-friendly)
RAGDOC uses 4 standard Information Retrieval metrics to measure retrieval quality.
What it measures: What proportion of relevant documents were found in the top-K results?
Formula: |relevant ∩ retrieved@K| / |relevant|
Example:
- Relevant documents: {doc2, doc4, doc7}
- Top-10 results: [doc1, doc2, doc3, doc4, doc5, ...]
- Found in top-10: {doc2, doc4} = 2 out of 3
- Recall@10 = 0.67 (67%)
Interpretation:
- 0.70-0.85: Good - Finding most relevant docs
- 0.85-0.95: Excellent - Finding almost all relevant docs
- 0.95-1.00: Outstanding - Finding all relevant docs
When to prioritize: When you need comprehensive coverage (research, legal)
What it measures: What proportion of retrieved documents are relevant?
Formula: |relevant ∩ retrieved@K| / K
Example:
- Top-10 results: [doc1, doc2, doc3, doc4, doc5, ...]
- Relevant in top-10: {doc2, doc4} = 2 out of 10
- Precision@10 = 0.20 (20%)
Interpretation:
- 0.40-0.60: Good - Low noise in results
- 0.60-0.80: Excellent - Very clean results
- 0.80-1.00: Outstanding - Nearly perfect results
When to prioritize: When you need high-quality results (RAG with LLMs)
Note: Precision is often lower than Recall because large databases have many irrelevant documents.
What it measures: How early does the first relevant document appear?
Formula: 1 / rank_of_first_relevant
Example:
- Top-10 results: [doc1, doc2, doc3, doc4, ...]
- First relevant: doc3 at position 3
- MRR = 1/3 = 0.33 (33%)
Interpretation:
- MRR = 1.0: First result is relevant (perfect!)
- MRR = 0.5: First relevant at position 2
- MRR = 0.33: First relevant at position 3
- MRR = 0.1: First relevant at position 10
- MRR = 0.0: No relevant results found
Interpretation ranges:
- 0.50-0.70: Good - Relevant result in top 2-3
- 0.70-0.85: Excellent - Relevant result in top 1-2
- 0.85-1.00: Outstanding - First result usually relevant
When to prioritize: When users only look at top results (search engines)
What it measures: How well are documents ranked? (considers graded relevance and position)
How it works:
- Assigns higher scores to relevant docs that appear earlier
- Normalizes by the ideal ranking (perfect order)
- Range: 0.0 (worst) to 1.0 (perfect ranking)
Formula: DCG@K / IDCG@K
Interpretation:
- 0.60-0.75: Good - Decent ranking quality
- 0.75-0.85: Excellent - High-quality ranking
- 0.85-1.00: Outstanding - Near-perfect ranking
When to prioritize: When ranking quality matters more than just presence (recommendations)
Note: NDCG is especially useful when you have graded relevance (highly relevant, relevant, marginally relevant).
What it measures: Harmonic mean of Precision and Recall
Formula: 2 * (Precision * Recall) / (Precision + Recall)
When to use: When you want a single metric balancing both Precision and Recall
Evaluate with default settings (alpha 0.3, 0.5, 0.7, 1.0):
python tests/evaluate_ragdoc.pyTest specific alpha configurations:
# Test pure BM25 vs pure Semantic
python tests/evaluate_ragdoc.py --alpha 0.0 1.0
# Test hybrid configurations
python tests/evaluate_ragdoc.py --alpha 0.4 0.5 0.6 0.7
# Test many configurations for fine-tuning
python tests/evaluate_ragdoc.py --alpha 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Alpha parameter explained:
alpha=0.0: Pure BM25 (lexical matching only)alpha=0.5: Balanced hybrid (50% BM25, 50% Semantic)alpha=0.7: Semantic-favored hybrid (30% BM25, 70% Semantic)alpha=1.0: Pure Semantic (embeddings only)
Test different result counts:
python tests/evaluate_ragdoc.py --k_values 3 5 10Use your own test queries:
python tests/evaluate_ragdoc.py --dataset tests/test_datasets/my_custom_dataset.jsonpython tests/evaluate_ragdoc.py \
--dataset tests/test_datasets/synthetic_ragdoc_qa.json \
--alpha 0.5 0.7 0.9 \
--k_values 5 10 20 \
--output_dir tests/results/experiment_001Example report:
## Performance by Alpha Value
| Alpha | Mode | Recall@10 | Precision@10 | MRR | NDCG@10 | F1@10 | Time(s) |
|-------|------|-----------|--------------|-----|---------|-------|---------|
| 0.50 | hybrid | 0.9667 | 0.0967 | 0.9190 | 0.9298 | 0.1758 | 24.40 |
| 0.70 | hybrid | 0.9667 | 0.0967 | 0.9167 | 0.9298 | 0.1758 | 25.69 |How to interpret this:
-
Recall@10 = 0.9667 (96.67%)
- RAGDOC finds 96.67% of relevant documents in top-10
- Verdict: Excellent! Nearly comprehensive retrieval
-
Precision@10 = 0.0967 (9.67%)
- 9.67% of top-10 results are relevant
- Context: With 25,000+ chunks, this is actually good
- For RAG: Reranking and LLM will filter further
-
MRR = 0.9190 (91.90%)
- First relevant result appears at position ~1.1
- Verdict: Outstanding! Top result is almost always relevant
-
NDCG@10 = 0.9298 (92.98%)
- Ranking quality is near-perfect
- Verdict: Excellent ordering of results
-
Time = 24.40s for 30 queries
- ~0.8s per query (acceptable for batch evaluation)
- Production MCP uses caching (much faster)
When alpha=0.5 beats alpha=0.7:
- Queries benefit from exact term matching
- Scientific terminology, formulas, acronyms
- Action: Consider lowering default alpha
When alpha=0.7 beats alpha=0.5:
- Queries are more conceptual
- Semantic understanding helps
- Action: Keep current default
When all alphas are similar:
- System is robust across configurations
- Action: Stick with alpha=0.5 (current default - balanced hybrid)
Based on RAG literature and industry standards:
| Metric | Acceptable | Good | Excellent | Outstanding |
|---|---|---|---|---|
| Recall@10 | 0.60+ | 0.70-0.85 | 0.85-0.95 | 0.95+ |
| Precision@10 | 0.30+ | 0.40-0.60 | 0.60-0.80 | 0.80+ |
| MRR | 0.40+ | 0.50-0.70 | 0.70-0.85 | 0.85+ |
| NDCG@10 | 0.50+ | 0.60-0.75 | 0.75-0.85 | 0.85+ |
RAGDOC's typical performance:
- Recall@10: 0.95-0.97 (Outstanding)
- Precision@10: 0.10-0.15 (Context-dependent*)
- MRR: 0.90-0.95 (Outstanding)
- NDCG@10: 0.90-0.95 (Outstanding)
*Precision appears low due to large corpus (25K+ chunks) and binary relevance. With reranking and LLM filtering, effective precision is much higher.
# Generate 50 diverse queries
python scripts/generate_test_dataset.py \
--n_queries 50 \
--diversity_mode balanced \
--output tests/test_datasets/my_dataset.jsonDiversity modes:
balanced: Equal queries per document (recommended)random: Pure random samplingmixed: 70% balanced + 30% random
Create a JSON file with this structure:
{
"name": "My Custom Test Set",
"version": "1.0",
"description": "Manually curated queries for glacier research",
"created": "2025-11-15",
"num_queries": 10,
"strategy": "manual",
"queries": [
{
"id": "q001",
"query": "How does black carbon affect glacier albedo?",
"relevant_chunks": ["1982_RGSP_chunk_042", "Bond_2013_chunk_105"],
"relevant_docs": ["1982_RGSP.md", "Bond_et_al_2013.md"],
"relevance_scores": {
"1982_RGSP_chunk_042": 3,
"Bond_2013_chunk_105": 2
},
"category": "impurities",
"difficulty": "medium"
},
{
"id": "q002",
"query": "Remote sensing techniques for ice mass balance",
"relevant_chunks": ["Bolch_2010_chunk_023"],
"relevant_docs": ["Bolch_et_al_2010.md"],
"relevance_scores": {
"Bolch_2010_chunk_023": 3
},
"category": "remote_sensing",
"difficulty": "hard"
}
]
}How to create manual queries:
-
Think of real research questions your users would ask
-
Find relevant chunks using RAGDOC:
# In Cursor/Claude Desktop search_by_source("your query", sources=["DocumentName.md"])
-
Add chunk IDs and relevance scores:
- Score 3: Highly relevant (answers the query directly)
- Score 2: Relevant (contains related information)
- Score 1: Marginally relevant (tangentially related)
- Score 0: Not relevant
-
Save to JSON and test:
python tests/evaluate_ragdoc.py --dataset tests/test_datasets/my_manual_set.json
- Generate 30 synthetic queries (fast baseline)
- Add 10-20 manual queries (high-quality validation)
- Combine into single dataset
# Generate synthetic
python scripts/generate_test_dataset.py --n_queries 30 --output synthetic.json
# Manually create manual.json with 10 queries
# Combine (manually merge JSON files or write script)
# Use combined dataset for evaluationProblem: All metrics near 0.0
Possible causes:
- Dataset chunk IDs don't match ChromaDB IDs
- ChromaDB collection name mismatch
- Documents not indexed
Solutions:
# Check ChromaDB collection
python -c "import chromadb; client = chromadb.PersistentClient(path='chroma_db_new'); print(client.get_collection('zotero_research_context_hybrid_v3').count())"
# Verify dataset chunk IDs exist
# Open tests/test_datasets/synthetic_ragdoc_qa.json
# Check if chunk IDs match format: "DocumentName_chunk_042"
# Reindex if needed
python scripts/index_incremental.pyProblem: Evaluation takes >5 minutes for 30 queries
Possible causes:
- Network latency (Voyage API)
- Too many queries
- Large K values
Solutions:
# Reduce queries
python tests/evaluate_ragdoc.py --dataset small_dataset.json
# Reduce K values
python tests/evaluate_ragdoc.py --k_values 10
# Reduce alpha configurations
python tests/evaluate_ragdoc.py --alpha 0.5 0.7Problem: Results vary significantly between runs
Possible causes:
- Non-deterministic BM25 (shouldn't happen)
- Database changes between runs
- Different random seeds
Solutions:
# Use same dataset with fixed seed
python scripts/generate_test_dataset.py --seed 42
# Verify database hasn't changed
# Check indexed_date in metadataProblem: ModuleNotFoundError: No module named 'fastmcp'
Solution:
# Use correct Python environment
"C:\Users\thier\miniforge3\envs\ragdoc-env\python.exe" tests/evaluate_ragdoc.py
# Or activate environment first
source ragdoc-env/bin/activate # macOS/Linux
.\ragdoc-env\Scripts\activate # Windows
python tests/evaluate_ragdoc.pyCompare two configurations statistically:
from scipy import stats
# Load results from two evaluations
results_a = ... # Alpha=0.5 per-query recalls
results_b = ... # Alpha=0.7 per-query recalls
# Paired t-test
t_stat, p_value = stats.ttest_rel(results_a, results_b)
if p_value < 0.05:
print("Difference is statistically significant!")Split your dataset for more robust evaluation:
# 70% queries for tuning, 30% for final validation
# Prevents overfitting to test setTrack metrics over time as you improve RAGDOC:
# Run evaluation weekly
python tests/evaluate_ragdoc.py --output_dir results/week_$(date +%V)
# Compare trends
python scripts/compare_temporal_results.py- Start with synthetic datasets (fast, reproducible)
- Add manual queries for validation (10-20 is enough)
- Evaluate multiple alpha values (0.5, 0.7 at minimum)
- Track metrics over time (detect regressions)
- Focus on Recall and MRR for RAG applications
- Don't over-optimize on a small test set
- Information Retrieval Metrics (Wikipedia)
- RAGAS Framework - RAG-specific evaluation
- MS MARCO Benchmark - Large-scale IR evaluation
Questions or issues? Open an issue on GitHub.
Built for the scientific research community 🔬