Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/branch.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ jobs:
SLACK_BOT: ${{ secrets.SLACK_BOT }}
NETLIFY_AUTH_TOKEN: ${{ secrets.NETLIFY_AUTH_TOKEN }}
GOOGLE_CONTAINER_ID: ${{ secrets.GOOGLE_CONTAINER_ID }}
WEAVIATE_VIBE_EVAL_URL: ${{ secrets.WEAVIATE_VIBE_EVAL_URL }}
WEAVIATE_VIBE_EVAL_KEY: ${{ secrets.WEAVIATE_VIBE_EVAL_KEY }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
Expand Down
9 changes: 5 additions & 4 deletions docs/weaviate/benchmarks/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,13 @@ image: og/docs/benchmarks.jpg
---


You can find the following vector database performance benchmarks:
You can find the following benchmarks:

1. [ANN (unfiltered vector search) latencies and throughput](./ann.md)
2. Filtered ANN (benchmark coming soon)
2. Scalar filters / Inverted Index (benchmark coming soon)
3. Large-scale ANN (benchmark coming soon)
2. [LLM Weaviate code generation](./vibe-coding-evaluation.mdx) — how well LLMs generate correct Weaviate v4 Python client code
3. Filtered ANN (benchmark coming soon)
4. Scalar filters / Inverted Index (benchmark coming soon)
5. Large-scale ANN (benchmark coming soon)

## Benchmark code

Expand Down
73 changes: 73 additions & 0 deletions docs/weaviate/benchmarks/vibe-coding-evaluation.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
---
title: LLM Weaviate Code Generation Benchmark
sidebar_position: 2
description: "Benchmark evaluating how well LLMs generate correct Weaviate v4 Python client code across zero-shot and few-shot scenarios."
---

import VibeEvalDashboard from "@site/src/components/VibeEvalDashboard";

This benchmark evaluates how well large language models (LLMs) generate **working Weaviate v4 Python client code** when given natural language task descriptions. It measures whether an LLM can produce code that actually connects to a Weaviate cluster and performs the requested operation without errors.

## Results

<VibeEvalDashboard />

## What is being tested

Each LLM is prompted to generate Python code for a specific Weaviate operation. The generated code is then executed inside a Docker container against a real Weaviate Cloud cluster. A task **passes** if the code runs with exit code 0, and **fails** otherwise.

The benchmark covers these operations:

| Task | What it tests |
| ------------------------- | ------------------------------------------------------------------- |
| **connect** | Connecting to a Weaviate Cloud instance and verifying readiness |
| **create_collection** | Creating a collection with typed properties (text, number, boolean) |
| **batch_import** | Batch importing 50 objects into a collection |
| **basic_semantic_search** | Running a `near_text` semantic search query |
| **complex_hybrid_query** | Hybrid search with filters, metadata, and multiple conditions |

### Task variants

Each task is run in multiple variants to measure the effect of providing examples:

- **Zero-shot** — The LLM receives only the task description with no code examples
- **Simple example** — The LLM receives one concise code example alongside the task
- **Extensive examples** — The LLM receives full API documentation as in-context examples

This lets you see how much a model improves when given reference code versus relying purely on its training data.

## How to interpret the results

- **Pass rate** is the primary metric — the percentage of tasks where the generated code executed successfully. A higher pass rate means the model produces more reliable Weaviate client code.
- **Avg duration** includes both the LLM generation time and the Docker execution time. It's useful for comparing relative speed but not absolute latency, since it depends on API response times.
- **Similarity score** (1–5, when available) is an LLM-judged comparison of the generated code against a canonical implementation, focusing on correct Weaviate API usage rather than general code style.

### What a failure means

A failure means the generated code threw a Python exception or returned a non-zero exit code. Common causes include:

- Using deprecated v3 client syntax instead of the current v4 API
- Incorrect method names, parameter names, or import paths
- Missing authentication setup or wrong connection patterns
- Hallucinated API methods that don't exist in the Weaviate client

The **Task Breakdown** tab shows per-task results. When LLM judge analysis is enabled, you can expand failed tasks to see the diagnosed root cause and suggested fix.

### Limitations

- Results reflect a point in time. LLM providers update their models, and results may change between runs.
- The benchmark uses `temperature=0.1` for near-deterministic output, but some variance is expected. When multiple repetitions are run, the pass rate accounts for this.
- Tasks test the Weaviate Python v4 client specifically. Results don't generalize to other Weaviate clients (TypeScript, Go, Java) or other database APIs.
- Pass/fail is binary based on exit code. A task can pass with suboptimal code or fail due to a minor syntax issue.

## How the benchmark is generated

The benchmark is run monthly via a [GitHub Actions workflow](https://github.com/weaviate-tutorials/weaviate-vibe-eval) and can also be triggered manually. The process is:

1. Each model is prompted with each task variant
2. Python code is extracted from the LLM response
3. The code is executed in a sandboxed Docker container with network access to a Weaviate Cloud cluster
4. Results (pass/fail, duration, generated code, stdout/stderr) are stored in a remote Weaviate cluster
5. During the docs build, results are fetched and rendered in the dashboard below

The benchmark source code, task definitions, and full methodology are available at [github.com/weaviate-tutorials/weaviate-vibe-eval](https://github.com/weaviate-tutorials/weaviate-vibe-eval).
3 changes: 2 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
"scripts": {
"docusaurus": "docusaurus",
"start": "docusaurus start",
"build": "docusaurus build",
"fetch-vibe-eval": "node tools/fetch-vibe-eval-results.js",
"build": "npm run fetch-vibe-eval; docusaurus build",
"build-dev": "docusaurus build --config docusaurus.dev.config.js --out-dir build.dev",
"validate-links-dev": "node ./_build_scripts/validate-links-pr.js",
"swizzle": "docusaurus swizzle",
Expand Down
2 changes: 1 addition & 1 deletion sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -878,7 +878,7 @@ const sidebars = {
type: "doc",
id: "weaviate/benchmarks/index",
},
items: ["weaviate/benchmarks/ann"],
items: ["weaviate/benchmarks/ann", "weaviate/benchmarks/vibe-coding-evaluation"],
},
{
type: "category",
Expand Down
Loading
Loading