LSM-tree storage engine in Rust. Embedded library; provides keyed point reads, prefix and range scans, MVCC snapshots, compaction, and a block cache. No write-ahead log: durability is the caller's responsibility. Built for CoordiNode; usable standalone.
On-disk format version V5. V5 introduces a wire-format break for filter blocks (BuRR replaces Bloom); V3 and V4 databases are not readable by this version and vice versa. Versioning is single-monotonic: every breaking format change bumps to the next version with explicit migration notes.
cargo add coordinode-lsm-tree@5.7or add it to Cargo.toml directly:
[dependencies]
coordinode-lsm-tree = "5.7"use lsm_tree::{AbstractTree, Config, SeqNo, SequenceNumberCounter};
let seqno = SequenceNumberCounter::default();
// Point the engine at a directory on disk. Two counters: one hands out write
// sequence numbers, the other tracks the persisted watermark. The default build
// adds no compression or encryption (see the feature-flag table below to opt in).
let tree = Config::new("my_db", seqno.clone(), SequenceNumberCounter::default()).open()?;
tree.insert("key", "value", seqno.next());
// Read at the latest visible snapshot. MVCC: pass any past SeqNo to read as-of.
let value = tree.get("key", SeqNo::MAX)?;
assert_eq!(value.as_deref(), Some(b"value".as_ref()));
// No write-ahead log by design: call flush when you need durability (or layer
// your own WAL, see docs/external-wal.md).
tree.flush_active_memtable(0)?;- Point reads via
get/multi_get(batch-optimized). PinnableSlicefor zero-copy reads.BurrFilterAMQ filter (Bumped Ribbon Retrieval, Walzer & Dillinger 2022): ~1% memory overhead vs the information-theoretic minimum: ~30% smaller filter blocks than a same-FPR Bloom filter, or ~10× tighter FPR at the same memory budget. Used for both per-key and per-prefix membership checks.- Forward and reverse range / prefix iteration.
- Block cache with size cap.
- File-descriptor cache to bound
fopensyscalls.
WriteBatchwith seqno-grouped batch writes: caller-controlled atomic visibility.- Single deletion tombstones (
remove_weak). - Range tombstones (
delete_range/delete_prefix). - Merge operators for commutative LSM operations.
- Optional key-value separation (BlobTree) for large-value workloads with automatic garbage collection.
- Leveled, size-tiered, dynamic-leveled, and FIFO strategies.
- Intra-L0 compaction for overlapping runs.
- Major compaction (full force flush + merge).
- Optional compaction filters for custom logic during compactions.
- Merge-aware compaction resolves operands lazily.
- Block-based tables with optional compression (none / LZ4 / Zstd) and prefix truncation.
- Per-table data block size policy and per-table compression policy.
- Optional zstd dictionary compression, trained per-table or per-column for small (4-64 KiB) blocks and blob files.
- Optional columnar PAX block layout (
columnarfeature): per-row-group column-major storage of key / seqno / value-type / value sub-columns, for column-projection pushdown, vectorized predicate scans, per-column dictionaries, and positional delete-bitmap MVCC. - Optional block-level encryption at rest: AES-256-GCM, key supplied by caller.
- Optional per-block error-correcting parity (
page_eccfeature): Reed-Solomon shards or per-word SEC-DED after each block, with on-read self-healing, three-state verification, and a patrol scrub for latent bit-rot. - Optional per-table block hash indexes for faster point lookups [3].
- Optional partitioned block index & filters for better cache efficiency [1].
- Per-level filter/index block pinning configuration.
- Thread-safe
BTreeMap-like API. SequenceNumberGeneratortrait: pluggable seqno source.- Custom
UserComparatorfor non-lexicographic ordering. - MVCC: snapshot reads at a chosen
SeqNo. - Point-in-time recovery snapshots via
Tree::create_checkpoint: hard-link every live SST + blob file into a fresh directory in O(1) per file, zero extra disk until the source files compact away. Compaction continues during the checkpoint (deletions are deferred), and the resulting directory opens as an independent tree.
- Storage footprint and entry shape (
storage_stats): used / capacity / available bytes, item & table counts, average on-disk entry size, reclaimable-bytes estimate, and a coarse storage status. - Per-level and per-segment size / entry / access stats (
level_segment_stats) for tiering and placement decisions. - Approximate range size (
approximate_range_stats) and cardinality + selectivity (approximate_range_cardinality) for query planning, estimated from block-index offsets and per-block zone maps without reading data blocks. - Compaction-debt estimate (
compaction_debt): pending-compaction bytes above each level's target, a scheduler / tiering signal.
- 100% stable Rust, MSRV 1.92.
no_std+allocsupport: the core engine (read / write / compaction / recovery over the injectedFs) compiles withoutstd; std-only conveniences (threaded fan-out, system clock, the std filesystem backend) stay behind thestdfeature.- No FFI: zstd via
structured-zstd(pure-Rust), LZ4 vialz4_flex, AES viaaes-gcm. - Pluggable
Fstrait: back the engine on the standard filesystem, onio_uring, on an in-memoryMemFs, or on a custom implementation. - Pluggable
CompressionProviderfor third-party codecs.
A factual capability matrix, not a benchmark: pick by the capabilities your workload needs; for throughput / latency see Benchmarks.
| coordinode-lsm-tree | RocksDB | sled | redb | |
|---|---|---|---|---|
| Implementation | pure Rust | C++ (FFI bindings) | pure Rust | pure Rust |
| Data structure | LSM-tree | LSM-tree | log-structured | B+tree (copy-on-write) |
| MVCC snapshot reads | ✅ (read at any SeqNo) |
✅ (snapshots) | ✅ (transactions) | ✅ (read transactions) |
no_std + alloc |
✅ | no | no | no |
| Encryption at rest | ✅ built-in (AES-256-GCM) | wrapper / plugin | no | no |
| Per-block ECC + self-heal | ✅ | no | no | no |
| Columnar (PAX) blocks | ✅ | no | no | no |
| CDC surviving compaction | ✅ (data-block grain) | WAL tail only | no | no |
| Built-in durability / WAL | external (caller-owned) | ✅ | ✅ | ✅ |
Pluggable FS / io_uring |
✅ | no | no | no |
RocksDB is the closest peer in data model and feature breadth, but it is a C++
library reached over FFI; this engine targets the same LSM capabilities in pure
Rust: no C/C++ toolchain, no_std-capable, no FFI audit surface. sled and redb
are pure-Rust but different shapes: sled is log-structured and pre-1.0; redb is a
read-optimised B+tree. Fjall is a
higher-level key-value store built on fjall-rs/lsm-tree, the engine this crate
was forked from, so Fjall and this crate are siblings over a shared ancestor
(see Credits).
Reach for this engine when you want:
- a write-optimised embedded store (LSM beats a B+tree on write-heavy and ingest workloads);
- pure Rust, no C/C++ toolchain (cross-compilation,
no_stdtargets, reproducible builds, no FFI to audit); - integrity / compliance built in: checksums end to end, optional encryption at rest and per-block ECC with on-read self-healing;
- change-data-capture that survives compaction (
scan_since_seqno); - to own durability yourself (supply your own WAL, or accept flush-on-demand).
Look elsewhere when you need:
- millisecond tailing of in-flight, not-yet-flushed writes: there is no built-in WAL (pair with docs/external-wal.md, or use a WAL-backed engine);
- a read-mostly tiny dataset where a B+tree (redb / LMDB) has lower read amplification;
- a batteries-included database server with a query language: this is a storage engine, not a database.
Tree::scan_since_seqno(target) streams every change committed at or after a
sequence number as ScanSinceEvents (Insert, MergeOperand,
PointTombstone, RangeTombstone), in increasing seqno order. It is a
change-data-capture primitive: a downstream consumer (replica, Kafka connector,
Debezium-style pipeline) replays the events in order to reconstruct the source's
history. Superseded versions are preserved (no MVCC collapse) and tombstones are
exposed, so deletes replay faithfully.
With the runtime-toggleable seqno_in_index policy, each SST index entry
carries its data block's (seqno_min, seqno_max), and the scan skips any data
block whose seqno_max is below the target without reading it. On a
sparse-change workload (e.g. 1% of data changed since the last poll) this turns
an O(data) scan into O(changed blocks). Trees with a mix of seqno-bounded and
legacy SSTs are scanned transparently (legacy blocks fall back to a per-entry
filter), so the policy can be enabled on a live tree and takes effect as
compaction rewrites SSTs: no migration step, no format-version bump.
| Engine | CDC granularity | Survives compaction | Embeddable | Block-skip |
|---|---|---|---|---|
RocksDB GetUpdatesSince |
WAL events (lost after compaction) | no | yes | n/a |
| Pebble | SST file (64–128 MB) | yes | yes | no |
| CockroachDB changefeed | SST file | yes | no (distributed) | no |
| FoundationDB | per-event | yes | no (distributed) | n/a |
| coordinode-lsm-tree | data block (4–32 KB) | yes | yes | yes |
The trade-off: there is no write-ahead log, so we do not offer WAL-style
millisecond tailing of in-flight updates. For arbitrary historical-seqno queries
(not just "since X"), pair with Tree::create_checkpoint. To layer your own
durability on top of the engine, see the external-WAL integration contract in
docs/external-wal.md and its runnable worked example
(cargo run --example external_wal).
- Keys: up to 65,535 bytes (the on-disk encoding caps the key-length field at
u16). - Values: up to 4,294,967,295 bytes (
2³² − 1; the encoding caps the value-length field atu32). - Larger keys and values carry a proportional performance cost.
All optional. The default build (std + parallel) is the minimal core: no compression, no encryption, std filesystem, with the built-in parallel-compression executor. Every capability flag below is gated because it pulls in extra dependencies or runtime overhead. Turning std off (with alloc) selects the no_std build (see the platform note in Cargo.toml).
| Flag | Pulls in | Enable when |
|---|---|---|
lz4 |
lz4_flex |
Block compression wanted, decompression latency matters more than ratio. |
zstd |
structured-zstd (pure-Rust, no FFI) |
Block compression wanted, ratio matters more than raw decode speed (zstd trades some decode latency vs lz4 for a better ratio). Supports CompressionType::Zstd and dictionary-mode CompressionType::ZstdDict. The pure-Rust decode is competitive with the C reference (libzstd) on the engine's small (4-64 KiB) read-path blocks (see Benchmarks). |
columnar |
(pure code, alloc only) |
Columnar PAX SST block layout: each row-group stores its key / seqno / value-type / value sub-columns contiguously, for column-projection pushdown, vectorized predicate scans, per-column zstd dictionaries, and positional delete-bitmap MVCC. Enable for analytical or wide-row scans where reading a subset of columns dominates. |
encryption |
aes-gcm, rand_chacha |
AES-256-GCM block encryption at rest. Keys are caller-managed. |
page_ecc |
reed-solomon-simd |
Per-block error-correcting parity (Reed-Solomon shards or per-word SEC-DED) after each on-disk block, with on-read self-healing, three-state verification, and a patrol scrub. Enable for latent-bit-rot protection on long-lived cold data. |
crc32c |
crc32c |
Selects the hardware-accelerated CRC32C block checksum (ChecksumType::Crc32c) over the default XXH3. Enable when interop or CPU profile favours CRC32C. |
io-uring (linux only) |
io-uring |
I/O-bound workload on a modern Linux kernel: adds an io_uring Fs backend. |
io-uring-raw (linux only) |
syscalls |
Pure-Rust no_std io_uring Fs backend on raw Linux syscalls (no libc), for embedded or sandboxed Linux targets without std. |
bytes_1 |
bytes |
Consumer already speaks bytes::Bytes (tokio/hyper/tonic stack) and wants zero-copy interop with engine slices. |
parallel (default-on) |
rayon |
Built-in rayon-backed parallel block-compression executor. std-only; the CompactionSpawner trait lets a caller inject a custom executor without this flag. |
metrics |
(none) | Production observability or profiling. Compiles in atomic counters around block I/O, filter probes, compaction, and cache hit rates (tree.metrics()). Small but non-zero hot-path cost. |
ribbon-serde |
serde |
Snapshotting the internal RibbonFilterRepr for debugging or out-of-band transport. Not used by the on-disk format. |
CI runs db_bench on every push to main and on pull requests. Results from main are published to the benchmark dashboard. PRs regressing performance by more than 15% trigger an alert; more than 25% fails CI.
Flamegraphs are generated on every merge to main from instrumented db_bench runs and published under flamegraphs/<commit-sha>/flamegraph.svg on gh-pages.
Local Criterion microbenchmarks:
cargo bench --features lz4Local flamegraphs:
cd tools/db_bench
cargo run --release --features flamegraph -- \
--benchmark all --num 100000 --flamegraph
# Folded stacks: target/flamegraphs/all.folded
# Render: cargo install inferno && inferno-flamegraph target/flamegraphs/all.folded > flame.svg| Tool | Use |
|---|---|
tools/db_bench |
RocksDB-compatible benchmark suite, also drives the CI perf dashboard. |
tools/sst-dump |
Inspect / verify a single SST file out-of-band. Subcommands: verify (walk every block, check per-block XXH3, exit non-zero on corruption), properties (print the SST's stored metadata: id, key range, KV / tombstone counts, block counts, compression, timestamp), hex <offset> (raw hex dump of a region with optional Header decode; useful for inspecting a specific offset flagged by verify --verbose), index-dump (print TLI entries: end_key + offset + size + seqno per pointed-at block; useful for diagnosing range-read fan-out), dump (stream every KV entry to stdout with optional --from / --to / --max=N / --keys-only filters; full-index SSTs only), filter-stats (print BuRR filter sizing: section bytes, layer count, item count, approximate bits-per-key; single-block filters only, partitioned filters not yet supported), salvage <dest> (block-salvage a corrupt SST into a fresh file at <dest>: re-emit every block that passes its checksum, dropping and reporting the corrupt ones, so a single bad block costs only its own key range instead of the whole file). |
Every record is checksummed end to end, from the in-memory memtable, through each on-disk block, to the manifest, with optional Reed-Solomon ECC, self-healing scrub, and AAD-bound encryption at rest. Off by default, byte-identical when disabled; turned on, detection becomes recovery.
This is the tamper-evident, silent-corruption protection that clinical and regulated systems require: the integrity controls behind HIPAA §164.312(c), FDA 21 CFR Part 11, and ALCOA+.
See docs/data-integrity.md for the full integrity stack: block + per-KV checksums (including memtable-residence RAM bit-flip detection), the Page ECC spectrum, self-healing scrub, tamper-evident encryption, the five-layer manifest hardening surface, and the manifest recovery modes.
See docs/tight-space-compaction.md for the opt-in tight-space compaction: how a gated merge on a near-full disk is rewritten in key-range slices and reclaimed in place with hole punching, so a compaction completes on a disk far smaller than the data it rewrites.
See docs/INVARIANTS.md for the engine's load-bearing invariants grouped by subsystem (block layout, manifest, compaction, snapshot / seqno, range tombstones, recovery / ECC, columnar) — the rule, why it holds, and where it is enforced.
Originally created by Marvin Blum as part of fjall-rs/lsm-tree; this codebase carries the original copyright (Copyright (c) 2024-present, fjall-rs). The vendored Ribbon filter (src/table/filter/ribbon/) is by William Rågstad; see src/table/filter/ribbon/_vendored/ for the upstream license texts.
All source code is licensed under Apache-2.0. Each first-party .rs file carries an SPDX-License-Identifier: Apache-2.0 header alongside the original-author copyright and the maintainer copyright (Structured World Foundation). Contributions are accepted under the same license.
The vendored Ribbon filter (src/table/filter/ribbon/) keeps its upstream layout: it carries William Rågstad's per-module licensing commentary rather than per-file SPDX headers, plus the original LICENSE-APACHE and LICENSE-MIT preserved verbatim in src/table/filter/ribbon/_vendored/. The upstream crate is dual-licensed (MIT OR Apache-2.0); we redistribute the vendored copy only under the Apache-2.0 arm per Apache-2.0 §4.
Maintained by Structured World Foundation.
[1] https://rocksdb.org/blog/2017/05/12/partitioned-index-filter.html
[2] https://github.com/facebook/rocksdb/wiki/BlobDB
[3] https://rocksdb.org/blog/2018/08/23/data-block-hash-index.html