Name	Name	Last commit message	Last commit date
parent directory ..
scripts	scripts
src/types	src/types
README.mdx	README.mdx
package.json	package.json
tsconfig.json	tsconfig.json
vitest.config.ts	vitest.config.ts

wikipedia.org.ai

Canonical entity references from Wikipedia - Structured data extraction from Wikipedia articles using dumpster-dive and wtf_wikipedia.

Overview

This package provides:

Canonical URLs for Wikipedia entities: https://wikipedia.org.ai/EntityName
Structured data extraction from Wikipedia articles (infoboxes, categories, links, etc.)
Semantic references for entities across the .do platform

Canonical Entity Pattern

Instead of using raw Wikipedia URLs, we use wikipedia.org.ai as the canonical namespace:

// ❌ Don't use raw Wikipedia URLs
const anthropic = { $id: 'https://en.wikipedia.org/wiki/Anthropic' }

// ✅ Use canonical wikipedia.org.ai URLs
const anthropic = { $id: 'https://wikipedia.org/wiki/Anthropic' }

This provides:

Consistency - Same format across all entity references
Permanence - Our URLs don't change even if Wikipedia reorganizes
Integration - Works seamlessly with our semantic triple patterns

Installation

pnpm add wikipedia.org.ai

Usage

Get Sample Articles

import { articles, getArticle, searchArticles } from 'wikipedia.org.ai/data'

// Get specific article
const anthropic = getArticle('Anthropic')
console.log(anthropic.$id) // https://wikipedia.org.ai/Anthropic
console.log(anthropic.summary) // First paragraph

// Search articles
const aiArticles = searchArticles('artificial intelligence')

Canonical Entity URLs

import { getEntityUrl } from 'wikipedia.org.ai/data'

// Generate canonical URLs
const url = getEntityUrl('Anthropic')
// https://wikipedia.org.ai/Anthropic

// Use in semantic triples
const company = {
  $type: 'Organization',
  $id: url,
  name: 'Anthropic',
  sameAs: {
    $id: 'https://en.wikipedia.org/wiki/Anthropic',
  },
}

With sdk.do

import { $ } from 'sdk.do'

// Reference Wikipedia entities
const anthropic = {
  $type: $.Organization,
  $id: 'https://wikipedia.org.ai/Anthropic',
  name: 'Anthropic',
}

// Query relationships
const founders = await db.related(anthropic, $.foundedBy, $.Person)

Data Import

Sample Mode (Default)

Import ~10 sample articles for testing:

pnpm import

This uses the Wikipedia MediaWiki API to fetch sample articles.

Full Mode (Production Pipeline)

For the full Wikipedia dump, we need to setup Cloudflare infrastructure:

pnpm import:full  # Shows TODO requirements

Requirements for full import:

Cloudflare Containers - Run dumpster-dive workers to process dumps
Cloudflare Pipelines - Stream processed data
R2 Storage - Store in Apache Iceberg format
R2 SQL - Query structured data catalog

See Cloudflare Pipeline Setup below.

Data Structure

WikipediaArticle

interface WikipediaArticle {
  $type: 'WikipediaArticle'
  $id: string // https://wikipedia.org.ai/PageTitle
  title: string
  pageId?: number
  wikiUrl: string // Original Wikipedia URL
  summary?: string // Lead paragraph
  infobox?: Infobox // Structured data
  categories?: Category[]
  images?: Image[]
  links?: Link[]
  citations?: Citation[]
  sameAs?: Ref[] // Wikidata, DBpedia, etc.
}

WikipediaEntity

Canonical entity reference with extracted data:

interface WikipediaEntity {
  $type: string // Person, Organization, Place, etc.
  $id: string // https://wikipedia.org.ai/EntityName
  wikipediaArticle: string // Link to WikipediaArticle
  wikiUrl: string
  infobox?: Infobox
  summary?: string
  wikidataId?: string
  dbpediaUri?: string
}

Cloudflare Pipeline Setup

For production deployment with full Wikipedia data:

1. Download Wikipedia Dump

# English Wikipedia (13GB compressed, ~60GB uncompressed)
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

2. Setup Cloudflare Containers

Create workers to process the dump using dumpster-dive:

// workers/wikipedia-import/src/index.ts
import { dumpsterDive } from 'dumpster-dive'

export default {
  async scheduled(event: ScheduledEvent, env: Env) {
    // Process Wikipedia dump in chunks
    await dumpsterDive({
      file: env.WIKIPEDIA_DUMP,
      output: env.WIKIPEDIA_PIPELINE,
      workers: 8, // Parallel processing
    })
  },
}

3. Configure Pipelines

Setup Cloudflare Pipelines for streaming:

# wrangler.toml
[[pipelines]]
name = "wikipedia-structured-data"
source = { binding = "WIKIPEDIA_DUMP" }
destination = { r2_bucket = "wikipedia-catalog" }
transform = "workers/wikipedia-transform"

4. Apache Iceberg Schema

Define schema for R2 storage:

CREATE TABLE wikipedia.articles (
  id STRING,
  title STRING,
  page_id INT,
  summary STRING,
  infobox STRUCT<type: STRING, data: MAP<STRING, STRING>>,
  categories ARRAY<STRING>,
  links ARRAY<STRING>,
  last_modified TIMESTAMP
)
USING iceberg
LOCATION 'r2://wikipedia-catalog/articles'
PARTITIONED BY (days(last_modified))

5. R2 SQL Catalog

Query the catalog:

-- Find all AI companies
SELECT title, infobox.data
FROM wikipedia.articles
WHERE infobox.type = 'company'
  AND array_contains(categories, 'Artificial intelligence companies')

Issues to Create

Create these GitHub issues for full implementation:

Setup Cloudflare Pipeline for Wikipedia ingestion
- Configure containers for dumpster-dive workers
- Setup scheduled dump processing
- Configure streaming to R2
Apache Iceberg schema for Wikipedia structured data
- Design table schema for articles
- Define partitioning strategy
- Setup catalog in R2
R2 SQL catalog for Wikipedia entities
- Configure SQL access to Iceberg tables
- Create views for common queries
- Setup entity resolution

License

MIT

Attribution

This package uses data from Wikipedia, which is licensed under CC BY-SA 3.0.

When using Wikipedia data, include attribution:

This page uses content from Wikipedia. The original article can be found at [URL]. Wikipedia content is licensed under CC BY-SA 3.0.

Resources

Wikipedia Dumps: https://dumps.wikimedia.org/
dumpster-dive: https://github.com/spencermountain/dumpster-dive
wtf_wikipedia: https://github.com/spencermountain/wtf_wikipedia
Cloudflare Pipelines: https://developers.cloudflare.com/pipelines/
Apache Iceberg: https://iceberg.apache.org/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.mdx

wikipedia.org.ai

Overview

Canonical Entity Pattern

Installation

Usage

Get Sample Articles

Canonical Entity URLs

With sdk.do

Data Import

Sample Mode (Default)

Full Mode (Production Pipeline)

Data Structure

WikipediaArticle

WikipediaEntity

Cloudflare Pipeline Setup

1. Download Wikipedia Dump

2. Setup Cloudflare Containers

3. Configure Pipelines

4. Apache Iceberg Schema

5. R2 SQL Catalog

Issues to Create

License

Attribution

Resources

FilesExpand file tree

wikipedia.org.ai

Directory actions

More options

Directory actions

More options

Latest commit

History

wikipedia.org.ai

Folders and files

parent directory

README.mdx

wikipedia.org.ai

Overview

Canonical Entity Pattern

Installation

Usage

Get Sample Articles

Canonical Entity URLs

With sdk.do

Data Import

Sample Mode (Default)

Full Mode (Production Pipeline)

Data Structure

WikipediaArticle

WikipediaEntity

Cloudflare Pipeline Setup

1. Download Wikipedia Dump

2. Setup Cloudflare Containers

3. Configure Pipelines

4. Apache Iceberg Schema

5. R2 SQL Catalog

Issues to Create

License

Attribution

Resources