-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Hey @sdd, thanks for the active development of this great package. I've implemented Chebyshev distance (max(|a_i - b_i|)), but it produces incorrect results with KdTree queries. The within() function misses points at the exact radius boundary. I suppose that is because the distance metric is ill-defined as per the current query logic. It would be great to get some input from you and possibly bring this metric to the package. On my fork, I have added some tests for the existing metrics that could be of benefit for the maintanability of the package, increasing the coverage. After the Euclidean and Manhattan metric, Chebychev is probably the one most used, so I'd be happy to contribute with a PR later.
Chebychev distance metric
use crate::float::kdtree::Axis;
use crate::traits::DistanceMetric;
pub struct Chebyshev {}
impl<A: Axis, const K: usize> DistanceMetric<A, K> for Chebyshev {
#[inline]
fn dist(a: &[A; K], b: &[A; K]) -> A {
a.iter()
.zip(b.iter())
.map(|(&a_val, &b_val)| (a_val - b_val).abs())
.fold(A::zero(), |acc, val| acc.max(val))
}
#[inline]
fn dist1(a: A, b: A) -> A {
(a - b).abs()
}
}Hypothesis
I have not looked too deep into the code, but my hypothesis is that the issue is that Kiddo's KdTree query logic uses the dist1 function for branch pruning, assuming that single-axis distance provides a lower bound for the full distance.
Lines 97 to 109 in 2056051
| pub trait DistanceMetric<A, const K: usize> { | |
| /// returns the distance between two K-d points, as measured | |
| /// by a particular distance metric | |
| fn dist(a: &[A; K], b: &[A; K]) -> A; | |
| /// returns the distance between two points along a single axis, | |
| /// as measured by a particular distance metric. | |
| /// | |
| /// (needs to be implemented as it is used by the NN query implementations | |
| /// to extend the minimum acceptable distance for a node when recursing | |
| /// back up the tree) | |
| fn dist1(a: A, b: A) -> A; | |
| } |
This assumption would fail for Chebyshev because it's a max-based metric (dist1 could be not a valid lower bound. The maximum difference could be on any axis.
Example
Trying to make this more graspable on an example: If point A = (0, 0) has query radius 1 and point B = (0, 5) where max(|0-0|, |0-5|) = 5 > 1 should be pruned. But dist1(A.x, B.x) = |0-0| = 0 < 1 keeps this branch incorrectly.
fn test_within_chebyshev_distance() {
let mut kdtree: KdTree<f32, 2> = KdTree::new();
let points = [
([0.0f32, 0.0f32], 0), // distance 0
([0.5f32, 0.5f32], 1), // Chebyshev: 0.5
([1.0f32, 0.0f32], 2), // Chebyshev: 1.0 <- would not be found but should be
([0.8f32, 0.9f32], 3), // Chebyshev: 0.9
];
for (point, index) in points { kdtree.add(&point, index); }
let query_point = [0.0f32, 0.0f32];
let radius = 1.0;
let results = kdtree.within::<Chebyshev>(&query_point, radius);
// Point at [1.0, 0.0] (index 2) has Chebyshev distance exactly 1.0 but is NOT found
let found_indices: Vec<u64> = results.iter().map(|r| r.item).collect();
assert!(found_indices.contains(&2)); // FAILS - point should be included!
}Expected: [0, 1, 2, 3] - Actual: [0, 1, 3]
(for long test implementation see https://github.com/cbueth/kiddo/blob/2d03b4fe304b990ec710802c66d79ba4729a24e7/src/float/distance.rs#L765)
Questions
So for knowing where to head to with this, hopefully bringing
- Is my hypothesis about
dist1being the root cause correct? - For adding the Chebyshev
DistanceMetric, which commit/tag/branch should I best work from before creating a PR? - If more complicated to solve, do you have an idea where I need to extend the current functionality?
- Has there been any work done towards max-based metrics?
Thanks for any guidance!
Carlson