Canonical entity references from Wikipedia - Structured data extraction from Wikipedia articles using dumpster-dive and wtf_wikipedia.
This package provides:
- Canonical URLs for Wikipedia entities:
https://wikipedia.org.ai/EntityName - Structured data extraction from Wikipedia articles (infoboxes, categories, links, etc.)
- Semantic references for entities across the
.doplatform
Instead of using raw Wikipedia URLs, we use wikipedia.org.ai as the canonical namespace:
// ❌ Don't use raw Wikipedia URLs
const anthropic = { $id: 'https://en.wikipedia.org/wiki/Anthropic' }
// ✅ Use canonical wikipedia.org.ai URLs
const anthropic = { $id: 'https://wikipedia.org/wiki/Anthropic' }This provides:
- Consistency - Same format across all entity references
- Permanence - Our URLs don't change even if Wikipedia reorganizes
- Integration - Works seamlessly with our semantic triple patterns
pnpm add wikipedia.org.aiimport { articles, getArticle, searchArticles } from 'wikipedia.org.ai/data'
// Get specific article
const anthropic = getArticle('Anthropic')
console.log(anthropic.$id) // https://wikipedia.org.ai/Anthropic
console.log(anthropic.summary) // First paragraph
// Search articles
const aiArticles = searchArticles('artificial intelligence')import { getEntityUrl } from 'wikipedia.org.ai/data'
// Generate canonical URLs
const url = getEntityUrl('Anthropic')
// https://wikipedia.org.ai/Anthropic
// Use in semantic triples
const company = {
$type: 'Organization',
$id: url,
name: 'Anthropic',
sameAs: {
$id: 'https://en.wikipedia.org/wiki/Anthropic',
},
}import { $ } from 'sdk.do'
// Reference Wikipedia entities
const anthropic = {
$type: $.Organization,
$id: 'https://wikipedia.org.ai/Anthropic',
name: 'Anthropic',
}
// Query relationships
const founders = await db.related(anthropic, $.foundedBy, $.Person)Import ~10 sample articles for testing:
pnpm importThis uses the Wikipedia MediaWiki API to fetch sample articles.
For the full Wikipedia dump, we need to setup Cloudflare infrastructure:
pnpm import:full # Shows TODO requirementsRequirements for full import:
- Cloudflare Containers - Run
dumpster-diveworkers to process dumps - Cloudflare Pipelines - Stream processed data
- R2 Storage - Store in Apache Iceberg format
- R2 SQL - Query structured data catalog
See Cloudflare Pipeline Setup below.
interface WikipediaArticle {
$type: 'WikipediaArticle'
$id: string // https://wikipedia.org.ai/PageTitle
title: string
pageId?: number
wikiUrl: string // Original Wikipedia URL
summary?: string // Lead paragraph
infobox?: Infobox // Structured data
categories?: Category[]
images?: Image[]
links?: Link[]
citations?: Citation[]
sameAs?: Ref[] // Wikidata, DBpedia, etc.
}Canonical entity reference with extracted data:
interface WikipediaEntity {
$type: string // Person, Organization, Place, etc.
$id: string // https://wikipedia.org.ai/EntityName
wikipediaArticle: string // Link to WikipediaArticle
wikiUrl: string
infobox?: Infobox
summary?: string
wikidataId?: string
dbpediaUri?: string
}For production deployment with full Wikipedia data:
# English Wikipedia (13GB compressed, ~60GB uncompressed)
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2Create workers to process the dump using dumpster-dive:
// workers/wikipedia-import/src/index.ts
import { dumpsterDive } from 'dumpster-dive'
export default {
async scheduled(event: ScheduledEvent, env: Env) {
// Process Wikipedia dump in chunks
await dumpsterDive({
file: env.WIKIPEDIA_DUMP,
output: env.WIKIPEDIA_PIPELINE,
workers: 8, // Parallel processing
})
},
}Setup Cloudflare Pipelines for streaming:
# wrangler.toml
[[pipelines]]
name = "wikipedia-structured-data"
source = { binding = "WIKIPEDIA_DUMP" }
destination = { r2_bucket = "wikipedia-catalog" }
transform = "workers/wikipedia-transform"Define schema for R2 storage:
CREATE TABLE wikipedia.articles (
id STRING,
title STRING,
page_id INT,
summary STRING,
infobox STRUCT<type: STRING, data: MAP<STRING, STRING>>,
categories ARRAY<STRING>,
links ARRAY<STRING>,
last_modified TIMESTAMP
)
USING iceberg
LOCATION 'r2://wikipedia-catalog/articles'
PARTITIONED BY (days(last_modified))Query the catalog:
-- Find all AI companies
SELECT title, infobox.data
FROM wikipedia.articles
WHERE infobox.type = 'company'
AND array_contains(categories, 'Artificial intelligence companies')Create these GitHub issues for full implementation:
-
Setup Cloudflare Pipeline for Wikipedia ingestion
- Configure containers for
dumpster-diveworkers - Setup scheduled dump processing
- Configure streaming to R2
- Configure containers for
-
Apache Iceberg schema for Wikipedia structured data
- Design table schema for articles
- Define partitioning strategy
- Setup catalog in R2
-
R2 SQL catalog for Wikipedia entities
- Configure SQL access to Iceberg tables
- Create views for common queries
- Setup entity resolution
MIT
This package uses data from Wikipedia, which is licensed under CC BY-SA 3.0.
When using Wikipedia data, include attribution:
This page uses content from Wikipedia. The original article can be found at [URL]. Wikipedia content is licensed under CC BY-SA 3.0.
- Wikipedia Dumps: https://dumps.wikimedia.org/
- dumpster-dive: https://github.com/spencermountain/dumpster-dive
- wtf_wikipedia: https://github.com/spencermountain/wtf_wikipedia
- Cloudflare Pipelines: https://developers.cloudflare.com/pipelines/
- Apache Iceberg: https://iceberg.apache.org/