Skip to content

OscarSouth/dataplatform-template

Repository files navigation

Data Platform Template

A reusable data platform built on Dagster + Neo4j + Polars + marimo, with a comprehensive MCP server ecosystem for agent-assisted development. Ships with an example domain that demonstrates every architectural pattern.

What This Is

A production-quality scaffold for building analytics platforms where data naturally forms a graph. The template provides:

  • Dagster orchestration with medallion-layer asset organization (raw/stg/enr/dim/fct)
  • Neo4j graph persistence via a custom IO Manager that bridges Polars DataFrames to graph nodes
  • Polars for all in-memory computation (no pandas)
  • marimo reactive notebooks for analysis and visualization
  • MCP servers (neo4j, dagster, serena, context7, memory, marimo) enabling agent-assisted development where the agent can query the graph, trigger pipelines, inspect results, and navigate code — all within a single conversation

An example domain is included to demonstrate every pattern. See EXAMPLE.md for its specification.

Quick Start

Prerequisites: Python 3.12+, uv, Docker, just (recommended)

just setup    # Install deps, start Neo4j, apply schema
just dagster  # Start pipeline UI at localhost:3000 (separate terminal)
just notebook # Start notebook editor at localhost:2718 (separate terminal)

Or without just:

./dev.sh      # Start Neo4j, sync deps, apply schema, launch Dagster

Open the Dagster UI at localhost:3000 and click Materialize All, or materialize assets individually in dependency order.

just check    # Full verification: mypy + ruff + pytest

See CONFIG.md for manual setup, MCP server configuration, and troubleshooting.

Architecture at a Glance

External Data Source
       │
  :Raw ──▶ :Stg ──▶ :Enr ──▶ :Fct
                                │
  :Dim (calendar, entities, groupings)

Each layer is a Neo4j node label. Nodes carry both a layer label and a domain label (:Enr:DataPoint, :Fct:Alert), enabling queries like MATCH (r:Enr:DataPoint) or MATCH (r:DataPoint) across all layers.

Three-layer code architecture with strict dependency direction:

  • Layer A (Data) — API wrapper and raw ingestion. The only layer with external I/O.
  • Layer B (Computation) — Pure functions for domain metrics. No imports from A or C. Testable with synthetic data alone.
  • Layer C (Persistence) — A custom Dagster IO Manager bridges Polars DataFrames to Neo4j. Assets return DataFrames and declare Cypher templates in metadata.

See ARCHITECTURE.md for the full technical design.

What Gets Built

Dagster assets across 5 materialization layers:

  • Data pipeline: raw_*stg_*enr_* — raw ingestion, validation, enrichment with computed metrics
  • Dimensions: dim_calendar, dim_* — temporal backbone, entity dimensions, grouping dimensions, temporal events
  • Fact tables: fct_* — classified outputs derived from enriched data

Plus purge_graph (ops group) for memory-safe graph reset via APOC batches.

Component Tool
Language Python 3.12+
Package management uv
Task runner just
Data processing Polars
Orchestration Dagster
Persistence Neo4j
Notebooks marimo
Quality mypy, ruff, pytest + hypothesis

MCP Server Ecosystem

The platform includes six MCP servers that enable deep agent-component interaction:

Server What It Enables
neo4j Live graph exploration, Cypher query prototyping, data verification — the graph becomes a reasoning surface
dagster Trigger materializations, inspect run status/logs/failures as structured data
serena Semantic code navigation, symbol search, find all references
context7 Up-to-date library documentation for Dagster/Polars/Neo4j/marimo APIs
memory Persistent knowledge graph across sessions
marimo Interact with running notebook sessions

See MCP_GUIDE.md for detailed capabilities and configuration.

Starting a New Project

Quick Path

just init my_project_name

This strips the example domain and scaffolds a blank project with the correct patterns in place.

Manual Path

  1. Replace dataplatform/domain.py with your domain constants and types
  2. Replace dataplatform/metrics/ with your domain metrics (pure functions)
  3. Replace dataplatform/resources/ with your data source wrapper
  4. Update assets in dataplatform/assets/ with your Cypher templates
  5. Update dataplatform/graph/schema.py with your constraints
  6. Update tests to match your domain
  7. Run just check to verify everything passes

Recommended build order: dimensions first (dim/), then raw ingestion (raw/), then validation (stg/), then enrichment (enr/), then fact tables (fct/).

See DATA_MODELLING.md for guidance on graph data modelling.

Research Notebooks

Interactive notebooks built with marimo for exploring the graph and analyzing data.

just notebook                  # Open notebook editor at localhost:2718
just notebook-run file.py      # Run a single notebook as an app

See MARIMO_GUIDE.md for notebook capabilities and usage.

Documentation

Document What It Covers
EXAMPLE.md Included example domain — remove when starting your own project
ARCHITECTURE.md Technical design — layers, IO manager, constraints, idempotency
CONFIG.md Setup and configuration — prerequisites, manual setup, MCP servers, troubleshooting
DATA_MODELLING.md Graph data modelling guide — medallion layers, Cypher templates, calendar integration
MCP_GUIDE.md MCP ecosystem — server capabilities, development workflows, configuration
MARIMO_GUIDE.md Notebook guide — marimo capabilities, Neo4j integration, visualization
CLAUDE.md Agent guidelines — development workflow, verification, coding standards

About

the ultimate data platform framework

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors