Skip to content

send/lexime-trie

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lexime-trie

A char-wise Double-Array Trie library for lexime. Zero dependencies.

Replaces trie-rs + bincode with a compact, cache-friendly trie that supports both dictionary (DoubleArray<char>) and romaji (DoubleArray<u8>) use cases.

Features

  • 8 bytes/node (#[repr(C)]) — fits 8 nodes per cache line
  • Char-wise indexing — 1.5-2x faster than byte-wise for Japanese text (based on crawdad benchmarks)
  • Frequency-ordered CodeMapper — remaps labels to dense codes for compact arrays
  • Terminal symbol approach — cleanly handles keys that are both exact matches and prefixes of other keys
  • Sibling chain (SoA) — enables predictive search via DFS without increasing node size for other operations
  • Fast serialization — binary format with LXTR magic header, ~5ms copy-load
  • Zero-copy deserializationDoubleArrayRef borrows nodes/siblings directly from mmap or byte buffer

Search Operations

Operation Description Use Case
exact_match O(m) key lookup Dictionary lookup
common_prefix_search All prefixes of a query Lattice construction (Viterbi)
predictive_search All keys starting with a prefix Autocomplete / predict
probe Key existence + children check (4-state) Romaji input (None/Prefix/Exact/ExactAndPrefix)

Usage

use lexime_trie::DoubleArray;

// Build from sorted keys (value_id = index)
let da = DoubleArray::<u8>::build(&[b"abc", b"abd", b"xyz"]);

// Exact match
assert_eq!(da.exact_match(b"abc"), Some(0));
assert_eq!(da.exact_match(b"zzz"), None);

// Common prefix search
for m in da.common_prefix_search(b"abcd") {
    println!("prefix len={}, value_id={}", m.len, m.value_id);
}

// Predictive search
for m in da.predictive_search(b"ab") {
    println!("key={:?}, value_id={}", m.key, m.value_id);
}

// Probe (romaji trie scenario)
let result = da.probe(b"ab");
// result.value: Option<u32>, result.has_children: bool

// Serialization
let bytes = da.as_bytes();
let da2 = DoubleArray::<u8>::from_bytes(&bytes).unwrap();

Zero-copy deserialization

The input buffer must be aligned to at least 4 bytes. Buffers from mmap or page-aligned allocations satisfy this requirement.

use lexime_trie::{DoubleArray, DoubleArrayRef};

let da = DoubleArray::<u8>::build(&[b"abc", b"abd", b"xyz"]);
let raw = da.as_bytes();

// In production, use mmap (page-aligned). For a quick demo, use Vec<u32>
// as backing storage to guarantee 4-byte alignment.
let mut backing = vec![0u32; (raw.len() + 3) / 4];
let bytes: &mut [u8] = unsafe {
    std::slice::from_raw_parts_mut(backing.as_mut_ptr() as *mut u8, raw.len())
};
bytes.copy_from_slice(&raw);

// Zero-copy: borrows nodes & siblings directly from the byte buffer
let da_ref = DoubleArrayRef::<u8>::from_bytes_ref(bytes).unwrap();
assert_eq!(da_ref.exact_match(b"abc"), Some(0));

// Convert to owned if needed
let da_owned = da_ref.to_owned();

Char keys (dictionary)

use lexime_trie::DoubleArray;

let keys: Vec<Vec<char>> = vec![
    "あい".chars().collect(),
    "あう".chars().collect(),
    "かき".chars().collect(),
];
let da = DoubleArray::<char>::build(&keys);
assert!(da.exact_match(&"あい".chars().collect::<Vec<_>>()).is_some());

Platform Requirements

  • Little-endian only (x86_64, aarch64, etc.). The crate emits a compile_error! on big-endian targets.
  • Serialization uses native LE layout for zero-copy performance; the binary format is not portable to big-endian platforms.

Design

See SPEC.md for the full design document.

License

MIT

About

A zero-dependency Double-Array Trie library for Rust

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages