Add hsm_mode: Half-Sample Mode for continuous data#984
Add hsm_mode: Half-Sample Mode for continuous data#984anurag-mds wants to merge 1 commit intoJuliaStats:masterfrom
Conversation
|
Was this PR AI-generated? |
|
The implementation, test and commits are mine. I reviewed existing StatsBase prs to match project conventions, and I used AI assistance only for wording clarity in the pr description and for general performance review and bugs made by me. it was not used for algorithm or code . since in ci pipeline documentation issues which I totally forgot was missing I am actively fixing that like using ceil(len/2) as per the definition. I am eager to explain any design or choices in detail |
I feel like the existing behavior (for floating-point numbers) is a bug. If your data are not integers, we must assume that they come from a continuous distribution and use the HSM algorithm or another estimator for continuous data. Or perhaps introduce a keyword argument: mode(x::AbstractVector{<:AbstractFloat}; discrete::Bool=false) =
discrete ? mode_discrete(x) : mode_hsm(x)
mode(x::AbstractVector) = mode_discrete(x)Here, I'm not a contributor to Distributions.jl, though (although I'd like to become a contributor; my complete PR with hyperbolic distributions seems to have joined its peers, unfortunately), that's just my opinion. |
|
I think using frequency-based mode for floats can be misleading. But making HSM the default for AbstractFloat might break cases where floats are actually categories or rounded values. Maybe a keyword like discrete=false or a dispatch mode(x, HSM()) works better. I can change the API like that if the maintainers think it’s right. |
|
@ForceBru Here's the evidence for why
As you can see Design:
Users can explicitly choose the right tool. No silent behavior changes. So are you okay with this separate function approach or shall we explore the keyword argument variant further ? |
|
All tests pass, edge cases handled, documentation added. @devmotion are there any remaining technical concerns with hsm_mode() or its API before approval? |
|
Thanks for the PR! I tend to agree that a keyword argument would make sense here. The mode is a statistical concept which can be estimated in various ways. The one currently used by The API could look like |
|
Thanks for your detailed review |
|
Apologies for the back-and-forth and the CI noise I'm currently away from my main development setup, but I’ve noted all changes precisely and will push a consolidated update shortly once I’m back. I’ll comment again once the updates are in. |
|
@nalimilan Does everything seems good? |
|
Thanks @nalimilan for the thorough review. I agree with the remaining points (doc wording, method naming, references, and test adjustments). I’m addressing them now and will push a final cleanup commit so that |
|
@nalimilan I have made necessary changes if anything to fix, modify or add do let me know! |
|
@nalimilan What do you think? |
| """ | ||
| function mode(a::AbstractArray{T}, r::UnitRange{T}; method::Symbol=:default) where T<:Integer | ||
| if method == :halfsample | ||
| return _hsm_mode(a) |
There was a problem hiding this comment.
This isn't correct, r shouldn't be ignored. Do you think you could implement it? Otherwise throw an error and document it.
test/scalarstats.jl
Outdated
| @test mode([1, 2, 2, 3, 4, 4, 4, 5], 1:5, method=:default) == 4 | ||
| @test mode([1, 2, 2, 3, 4, 4, 4, 5], 1:5, method=:halfsample) == 4.0 # Test halfsample with range | ||
| @test_throws ArgumentError mode([1, 2, 2, 3, 4, 4, 4, 5], 1:5, method=:invalid) |
There was a problem hiding this comment.
Test with 2:4 to actually test that argument.
|
Everything Noted. |
|
Rebasing took more time Is everything good @nalimilan ? |
|
Is there anything left @nalimilan ? |
src/scalarstats.jl
Outdated
| filtered = sort([x for x in a if isfinite(x)]) | ||
| len = length(filtered) | ||
|
|
||
| len == 0 && throw(ArgumentError("mode is not defined for collections with no finite values")) |
There was a problem hiding this comment.
I can't see this restriction in the other mode implementation above. So maybe that is only a limitation of half-sample mode?
There was a problem hiding this comment.
Yes, it is
As you can see for Regular Frequency Mode:
Consider [Nan, Inf, -Inf]
Here,
- It can still count frequencies
- Nan appears once, Inf appears once
- Returns first one
Now, for HSM Mode:
Consider the same array,
Here,
- Filters out all non-finite values
- Nothing left to calculate
- MUST throw error
- Cannot find a cluster with no data!
There was a problem hiding this comment.
Then the error message should be more explicit:
| len == 0 && throw(ArgumentError("mode is not defined for collections with no finite values")) | |
| len == 0 && throw(ArgumentError("mode with `method=:halfsample` is not defined " * | |
| "for collections with no finite values")) |
|
I've addressed the final performance suggestions (sort! and LazyString). Ready for final review. |
|
Can you please review the updated changes now? |
nalimilan
left a comment
There was a problem hiding this comment.
Sorry for the delay. Looks almost ready to me.
@devmotion Do you have better ideas of a name for method=:frequency? The definition of the mode implies frequency whatever the estimation method, but I can't find a better name...
test/scalarstats.jl
Outdated
| @test mode([1.0, 1.1, 1.2, 5.0, 5.1], method=:halfsample) ≈ 1.1 atol=0.1 | ||
|
|
||
| # Robustness to outliers | ||
| @test abs(mode([1.0, 1.05, 1.1, 1.15, 100.0], method=:halfsample) - 1.075) < 0.2 |
There was a problem hiding this comment.
Better test the exact expected return value.
d13f8a4 to
95ab2f5
Compare
|
@nalimilan Done applied the suggested changes |

Summary
This PR adds hsm_mode(), an implementation of the half-sample mode (HSM) which is a robust estimator of the mode for continuous distributions.It is introduced as a separate function ( not an overload of mode() ) to preserve existing behaviour while providing a statistically meaningful alternative for continuous data
This addresses and closes issue #957.
Motivation
StatsBase.mode() is frequency-based and works well for discrete data. For continuous distributions, however, samples are usually unique, which makes frequency counts unstable and highly variable in practiceIssue #957 documents this behaviour, particularly for heavy-tailed distributions, where mode() can show extreme variance.
This PR provides an estimator designed specifically for continuous data
Approach
hsm_mode() implements the standard half-sample method described in the literature:Non-finite values (NaN, Inf) are filtered
The data are sorted
The algorithm repeatedly selects the contiguous half-sample with the smallest width
Once ≤ 2 points remain, the midpoint of the final interval is returned
The midpoint may not be a sample value, but provides a stable estimate of the location of highest density.
Time complexity is dominated by sorting (O(n log n)); space complexity is O(n).
After sorting, the contraction loop operates on SubArray views to avoid allocations.
API Design
The estimator is exposed as a new function:
hsm_mode(x::AbstractVector{T}) where T<:Real
It is NOT added as an overload of mode() in order to:
avoid changing existing semantics
clearly distinguish frequency-based and density-based estimation
let users choose the appropriate method explicitly
The return type is an AbstractFloat, promoted from the input element type (e.g. integers → Float64, Float32 → Float32).
Testing and Documentation
Tests cover basic correctness, edge cases, robustness to outliers, handling of non-finite values, and type behavior. All tests pass.
The docstring explains intended use cases, compares with mode(), documents complexity, provides examples, and cites the relevant literature.
References
Robertson & Cryer (1974), JASA
Bickel & Fruehwirth (2006), CSDA
Notes
This PR is intentionally small and focused. Extensions such as weighted HSM or support for missing values are left for future work.
I have attached images showing that the tests are passing, and I would like to know whether I should address the existing warnings in the codebase.
Images:






Feedback on naming or API placement is welcome.