Skip to content

Add hsm_mode: Half-Sample Mode for continuous data#984

Open
anurag-mds wants to merge 1 commit intoJuliaStats:masterfrom
anurag-mds:hsm-mode
Open

Add hsm_mode: Half-Sample Mode for continuous data#984
anurag-mds wants to merge 1 commit intoJuliaStats:masterfrom
anurag-mds:hsm-mode

Conversation

@anurag-mds
Copy link

Summary

This PR adds hsm_mode(), an implementation of the half-sample mode (HSM) which is a robust estimator of the mode for continuous distributions.

It is introduced as a separate function ( not an overload of mode() ) to preserve existing behaviour while providing a statistically meaningful alternative for continuous data

This addresses and closes issue #957.


Motivation

StatsBase.mode() is frequency-based and works well for discrete data. For continuous distributions, however, samples are usually unique, which makes frequency counts unstable and highly variable in practice

Issue #957 documents this behaviour, particularly for heavy-tailed distributions, where mode() can show extreme variance.

This PR provides an estimator designed specifically for continuous data


Approach

hsm_mode() implements the standard half-sample method described in the literature:
  • Non-finite values (NaN, Inf) are filtered

  • The data are sorted

  • The algorithm repeatedly selects the contiguous half-sample with the smallest width

  • Once ≤ 2 points remain, the midpoint of the final interval is returned

  • The midpoint may not be a sample value, but provides a stable estimate of the location of highest density.

  • Time complexity is dominated by sorting (O(n log n)); space complexity is O(n).

  • After sorting, the contraction loop operates on SubArray views to avoid allocations.


API Design

The estimator is exposed as a new function:

hsm_mode(x::AbstractVector{T}) where T<:Real

It is NOT added as an overload of mode() in order to:

  • avoid changing existing semantics

  • clearly distinguish frequency-based and density-based estimation

  • let users choose the appropriate method explicitly

The return type is an AbstractFloat, promoted from the input element type (e.g. integers → Float64, Float32 → Float32).


Testing and Documentation

Tests cover basic correctness, edge cases, robustness to outliers, handling of non-finite values, and type behavior. All tests pass.

The docstring explains intended use cases, compares with mode(), documents complexity, provides examples, and cites the relevant literature.


References

Robertson & Cryer (1974), JASA

Bickel & Fruehwirth (2006), CSDA


Notes

This PR is intentionally small and focused. Extensions such as weighted HSM or support for missing values are left for future work.

I have attached images showing that the tests are passing, and I would like to know whether I should address the existing warnings in the codebase.

Images:

image

image

image

image

image

image

Feedback on naming or API placement is welcome.

@devmotion
Copy link
Member

Was this PR AI-generated?

@anurag-mds anurag-mds closed this Dec 27, 2025
@anurag-mds anurag-mds reopened this Dec 27, 2025
@anurag-mds
Copy link
Author

The implementation, test and commits are mine. I reviewed existing StatsBase prs to match project conventions, and I used AI assistance only for wording clarity in the pr description and for general performance review and bugs made by me. it was not used for algorithm or code . since in ci pipeline documentation issues which I totally forgot was missing I am actively fixing that like using ceil(len/2) as per the definition. I am eager to explain any design or choices in detail

@ForceBru
Copy link

introduced as a separate function ( not an overload of mode() ) to preserve existing behaviour

I feel like the existing behavior (for floating-point numbers) is a bug. If your data are not integers, we must assume that they come from a continuous distribution and use the HSM algorithm or another estimator for continuous data.

Or perhaps introduce a keyword argument:

mode(x::AbstractVector{<:AbstractFloat}; discrete::Bool=false) =
  discrete ? mode_discrete(x) : mode_hsm(x)

mode(x::AbstractVector) = mode_discrete(x)

Here, mode_discrete is what's currently called mode. The keyword argument lets one use the high-variance estimator for AbstractFloat.

I'm not a contributor to Distributions.jl, though (although I'd like to become a contributor; my complete PR with hyperbolic distributions seems to have joined its peers, unfortunately), that's just my opinion.

@anurag-mds
Copy link
Author

I think using frequency-based mode for floats can be misleading.

But making HSM the default for AbstractFloat might break cases where floats are actually categories or rounded values.

Maybe a keyword like discrete=false or a dispatch mode(x, HSM()) works better. I can change the API like that if the maintainers think it’s right.

@anurag-mds
Copy link
Author

anurag-mds commented Dec 28, 2025

@ForceBru
You have made an excellent point about API design consistency. You're right
that distinguishing frequency-based vs density-based estimation matters.

Here's the evidence for why hsm_mode() should remain separate:

image

As you can see
Insight: Frequency-based mode() counts collisions on continuous
data where samples are usually unique. The "most frequent" element becomes
arbitrary and order-dependent. HSM finds the actual density peak instead.

Design:

  • mode(): Frequency-based, correct for discrete data [keep as-is]
  • hsm_mode(): Density-based, essential for continuous data [separate]

Users can explicitly choose the right tool. No silent behavior changes.

So are you okay with this separate function approach or shall we explore the keyword argument variant further ?

@anurag-mds
Copy link
Author

All tests pass, edge cases handled, documentation added. @devmotion are there any remaining technical concerns with hsm_mode() or its API before approval?

@nalimilan
Copy link
Member

Thanks for the PR!

I tend to agree that a keyword argument would make sense here. The mode is a statistical concept which can be estimated in various ways. The one currently used by mode is the simplest, and of course it's bad for continuous samples unless you apply some binning. But mode(Normal()) returns zero already so the concept makes sense in general.

The API could look like mode(x, method=:halfsample).

@anurag-mds
Copy link
Author

Thanks for your detailed review
This is very helpful. I'll address the points you raised API via keyword, iterator support, middle , non-finite handling, test adjustments and fix the CI failures.

@anurag-mds
Copy link
Author

Apologies for the back-and-forth and the CI noise I'm currently away from my main development setup, but I’ve noted all changes precisely and will push a consolidated update shortly once I’m back. I’ll comment again once the updates are in.
Thanks again for the detailed guidance.

@anurag-mds anurag-mds marked this pull request as draft January 25, 2026 15:49
@anurag-mds anurag-mds marked this pull request as ready for review January 25, 2026 16:21
@anurag-mds
Copy link
Author

@nalimilan Does everything seems good?

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@anurag-mds
Copy link
Author

Thanks @nalimilan for the thorough review. I agree with the remaining points (doc wording, method naming, references, and test adjustments). I’m addressing them now and will push a final cleanup commit so that
everything is consolidated.

@anurag-mds anurag-mds marked this pull request as draft February 3, 2026 18:11
@anurag-mds anurag-mds marked this pull request as ready for review February 3, 2026 18:19
@anurag-mds
Copy link
Author

@nalimilan I have made necessary changes if anything to fix, modify or add do let me know!
Thanks a lot for the guidance

@anurag-mds
Copy link
Author

@nalimilan What do you think?

"""
function mode(a::AbstractArray{T}, r::UnitRange{T}; method::Symbol=:default) where T<:Integer
if method == :halfsample
return _hsm_mode(a)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct, r shouldn't be ignored. Do you think you could implement it? Otherwise throw an error and document it.

Comment on lines +109 to +111
@test mode([1, 2, 2, 3, 4, 4, 4, 5], 1:5, method=:default) == 4
@test mode([1, 2, 2, 3, 4, 4, 4, 5], 1:5, method=:halfsample) == 4.0 # Test halfsample with range
@test_throws ArgumentError mode([1, 2, 2, 3, 4, 4, 4, 5], 1:5, method=:invalid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test with 2:4 to actually test that argument.

@anurag-mds
Copy link
Author

Everything Noted.
I will make sure that all issues are properly addressed...

@anurag-mds anurag-mds marked this pull request as draft February 20, 2026 19:50
@anurag-mds anurag-mds marked this pull request as ready for review February 20, 2026 19:59
@anurag-mds
Copy link
Author

Rebasing took more time Is everything good @nalimilan ?

@anurag-mds
Copy link
Author

Is there anything left @nalimilan ?

filtered = sort([x for x in a if isfinite(x)])
len = length(filtered)

len == 0 && throw(ArgumentError("mode is not defined for collections with no finite values"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see this restriction in the other mode implementation above. So maybe that is only a limitation of half-sample mode?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is

As you can see for Regular Frequency Mode:

Consider [Nan, Inf, -Inf]
Here,

  1. It can still count frequencies
  2. Nan appears once, Inf appears once
  3. Returns first one

Now, for HSM Mode:

Consider the same array,
Here,

  1. Filters out all non-finite values
  2. Nothing left to calculate
  3. MUST throw error
  4. Cannot find a cluster with no data!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the error message should be more explicit:

Suggested change
len == 0 && throw(ArgumentError("mode is not defined for collections with no finite values"))
len == 0 && throw(ArgumentError("mode with `method=:halfsample` is not defined " *
"for collections with no finite values"))

@anurag-mds
Copy link
Author

I've addressed the final performance suggestions (sort! and LazyString). Ready for final review.
@nalimilan @devmotion

@anurag-mds
Copy link
Author

Can you please review the updated changes now?

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Looks almost ready to me.

@devmotion Do you have better ideas of a name for method=:frequency? The definition of the mode implies frequency whatever the estimation method, but I can't find a better name...

@test mode([1.0, 1.1, 1.2, 5.0, 5.1], method=:halfsample) ≈ 1.1 atol=0.1

# Robustness to outliers
@test abs(mode([1.0, 1.05, 1.1, 1.15, 100.0], method=:halfsample) - 1.075) < 0.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better test the exact expected return value.

@anurag-mds anurag-mds force-pushed the hsm-mode branch 8 times, most recently from d13f8a4 to 95ab2f5 Compare March 13, 2026 06:45
@anurag-mds
Copy link
Author

@nalimilan Done applied the suggested changes
Extremely sorry for such a noisy PR
But I got to know many standards and style julia follows

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants