Skip to content

pantha704/CrawlMind

Repository files navigation

πŸ•·οΈ CrawlMind

AI-Powered Web Crawling & Research Platform

Next.js Cloudflare Prisma License: MIT

Paste a URL. Describe your research. Let AI do the rest.

CrawlMind combines Cloudflare's crawl infrastructure with AI-powered URL discovery and multi-hop research synthesis β€” turning any query into structured, crawled knowledge.

Getting Started Β· Features Β· Architecture Β· Deploy


✨ Features

Core Crawling

  • Smart Input β€” Auto-detects URLs vs. natural language; just paste or type
  • Cloudflare-Powered β€” Fast, reliable crawling via Cloudflare's Browser Rendering API
  • Multi-Format Output β€” Markdown, HTML, plaintext, or cleaned readable HTML
  • JS Rendering β€” Crawl JavaScript-heavy SPAs with headless rendering
  • Advanced Controls β€” Depth, page limits, subdomain inclusion, URL patterns, date filters

🧠 AI Discovery (New)

  • AI URL Discovery β€” Describe what you need; Groq finds the best sources to crawl
  • Depth Tiers β€” Quick (~30s), Deep Dive (~2min), or Multi-hop Research (~5min)
  • Multi-Hop Research β€” Crawl β†’ analyze gaps β†’ discover follow-up sources β†’ repeat (up to 3 rounds)
  • AI Synthesis β€” NVIDIA NIM generates a comprehensive research report from all crawled data
  • Parent-Child Jobs β€” Research jobs manage multiple sub-crawls independently, no interference with normal crawls

Platform

  • AI Chat β€” Ask questions about crawl results with full context awareness
  • Soft-Delete Library β€” Archive, restore, and manage past crawls
  • Analytics Dashboard β€” Track crawl usage, search patterns, and AI queries
  • Plan-Based Limits β€” Tiered pricing with Stripe integration
  • Auth β€” GitHub, Google, and email sign-in via Better Auth

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        USER INPUT                               β”‚
β”‚         URL / Natural Language / AI Discovery Toggle             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                          β”‚
        URL detected              AI Discovery ON
              β”‚                          β”‚
              β–Ό                          β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  POST /api/crawl β”‚     β”‚   POST /api/research     β”‚
    β”‚  Normal Pipeline β”‚     β”‚   AI Research Pipeline   β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚                           β”‚
             β–Ό                           β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Cloudflare Crawl β”‚     β”‚ Groq: Discover URLs      β”‚
    β”‚ Single Job       β”‚     β”‚ (llama-3.3-70b-versatile)β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚                           β”‚
             β”‚                           β–Ό
             β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚              β”‚ Spawn Parallel Sub-Crawls β”‚
             β”‚              β”‚ via Cloudflare Crawl API  β”‚
             β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚                           β”‚
             β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚              β”‚ RESEARCH tier only:       β”‚
             β”‚              β”‚ NIM Gap Analysis β†’        β”‚
             β”‚              β”‚ Follow-up Crawls (Γ—3)     β”‚
             β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚                           β”‚
             β”‚                           β–Ό
             β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚              β”‚ NIM: Synthesis Report     β”‚
             β”‚              β”‚ (nemotron-super-49b)      β”‚
             β–Ό              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚
    β”‚  Neon PostgreSQL β”‚β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚  (Prisma ORM)    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ Tech Stack

Layer Technology Purpose
Framework Next.js 15 (App Router) Full-stack React with server components
Database Neon PostgreSQL + Prisma Serverless Postgres with type-safe ORM
Auth Better Auth GitHub, Google, email authentication
Crawling Cloudflare Crawl API Browser rendering + web crawling at scale
AI β€” Fast Groq (llama-3.3-70b) URL discovery (~200ms responses)
AI β€” Deep NVIDIA NIM (nemotron-super-49b) Gap analysis + synthesis reports
AI Chat Vercel AI SDK Streaming chat over crawl results
Payments Stripe Subscription billing + webhooks
Styling Tailwind CSS + shadcn/ui Utility-first CSS + accessible components
Deployment Vercel Edge-optimized serverless hosting

πŸš€ Getting Started

Prerequisites

Quick Start

# Clone
git clone https://github.com/pantha704/CrawlMind.git
cd CrawlMind

# Install
bun install

# Configure
cp .env.example .env.local
# Edit .env.local with your keys (see below)

# Database
bunx prisma db push
bunx prisma generate

# Run
bun run dev

Environment Variables

# Database (Neon)
DATABASE_URL=postgresql://...

# Auth
BETTER_AUTH_SECRET=your-secret
BETTER_AUTH_URL=http://localhost:3001
GITHUB_CLIENT_ID=...
GITHUB_CLIENT_SECRET=...
GOOGLE_CLIENT_ID=...
GOOGLE_CLIENT_SECRET=...

# Cloudflare
CLOUDFLARE_API_TOKEN=...
CLOUDFLARE_ACCOUNT_ID=...

# AI
GROQ_API_KEY=...          # For URL discovery (Groq)
NVIDIA_NIM_API_KEY=...    # For synthesis (NVIDIA NIM)

# Stripe
STRIPE_SECRET_KEY=...
STRIPE_WEBHOOK_SECRET=...
NEXT_PUBLIC_STRIPE_PUBLISHABLE_KEY=...

# App
NEXT_PUBLIC_APP_URL=http://localhost:3001

πŸ“ Project Structure

src/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”œβ”€β”€ crawl/              # Crawl CRUD, results proxy, cancel
β”‚   β”‚   β”œβ”€β”€ research/           # AI Discovery β€” create, poll, active
β”‚   β”‚   β”œβ”€β”€ chat/               # AI chat endpoint
β”‚   β”‚   β”œβ”€β”€ stripe/             # Payment webhooks
β”‚   β”‚   └── user/               # Usage tracking & settings
β”‚   β”œβ”€β”€ dashboard/
β”‚   β”‚   β”œβ”€β”€ page.tsx            # Main dashboard
β”‚   β”‚   β”œβ”€β”€ jobs/               # Crawl job list + detail
β”‚   β”‚   β”œβ”€β”€ research/           # AI research detail page
β”‚   β”‚   β”œβ”€β”€ chat/               # AI chat interface
β”‚   β”‚   β”œβ”€β”€ library/            # Archived results
β”‚   β”‚   └── analytics/          # Usage analytics
β”‚   β”œβ”€β”€ pricing/                # Pricing page
β”‚   └── (auth)/                 # Sign in / sign up
β”œβ”€β”€ components/
β”‚   β”œβ”€β”€ dashboard/              # Dashboard UI (crawl-input, active-jobs, etc.)
β”‚   β”œβ”€β”€ landing/                # Landing page components
β”‚   └── ui/                     # shadcn/ui primitives
└── lib/
    β”œβ”€β”€ auth.ts                 # Better Auth config
    β”œβ”€β”€ cloudflare.ts           # Cloudflare Crawl API client
    β”œβ”€β”€ research.ts             # AI Discovery β€” Groq + NIM integration
    β”œβ”€β”€ ai.ts                   # AI model configuration
    β”œβ”€β”€ prisma.ts               # Prisma client
    └── stripe.ts               # Stripe client

🧠 AI Discovery β€” How It Works

Tier What Happens Sources Time
⚑ Quick AI finds 3-5 relevant sources, crawls them 3-5 ~30s
πŸ” Deep Dive AI discovers 10-15 categorized sources 10-15 ~2min
🧠 Research Multi-hop: crawl β†’ gap analysis β†’ follow-up crawls (Γ—3 rounds) β†’ synthesis 15-30+ ~5min

Models used:

  • Groq (llama-3.3-70b-versatile) β€” Fast URL discovery (~200ms)
  • NVIDIA NIM (nemotron-super-49b-v1.5) β€” Deep analysis & comprehensive synthesis

πŸ’³ Pricing Tiers

Plan Price Crawls/day Pages/crawl AI Chat JS Render
Spark Free 2 30 3 queries ❌
Pro $12/mo 25 500 Unlimited βœ…
Pro+ $24/mo 75 1,000 Unlimited βœ…
Scale $39/mo 150 5,000 Unlimited βœ…

🚒 Deploy

Vercel (Recommended)

  1. Push to GitHub
  2. Import in Vercel
  3. Add all environment variables
  4. Set NEXT_PUBLIC_APP_URL to your Vercel domain
  5. Deploy

Note: Ensure NEXT_PUBLIC_APP_URL points to your deployed domain (not localhost) for webhooks and auth callbacks.


πŸ“„ License

MIT β€” see LICENSE for details.


Built with β˜• and curiosity

About

Scraping made easier using the /crawl endpoint by cloudflare

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors