proportionmap accepts iterators#855
Conversation
src/counts.jl
Outdated
| than the proportion of raw counts. | ||
| """ | ||
| proportionmap(x::AbstractArray) = _normalize_countmap(countmap(x), length(x)) | ||
| proportionmap(x) = _normalize_countmap(countmap(x), length(collect(x))) |
There was a problem hiding this comment.
Could we just count the total number of elements when building the countmap? It seems inefficient to materialize x only to obtain its length if we already iterate through it anyway.
There was a problem hiding this comment.
Something around sum(values(countmap(x))? But I think that's memory inefficient even though it doesn't iterate again.
There was a problem hiding this comment.
No, I thought counting directly inside of countmap. But probably sum(values, countmap(x)) would still be more efficient than using collect(x) if x is an iterator with a large number of elements.
There was a problem hiding this comment.
julia> @btime proportionmap(skipmissing(a)) 8.625 μs (27 allocations: 146.67 KiB) Dict{Int64, Float64} with 4 entries: 4 => 0.25 2 => 0.25 3 => 0.25 1 => 0.25
julia> @btime proportionmap(skipmissing(a)) 316.667 ns (9 allocations: 1.08 KiB) Dict{Int64, Float64} with 4 entries: 4 => 0.25 2 => 0.25 3 => 0.25 1 => 0.25
Looks like a significant improvement 🧐
| countm = Dict{eltype(x), Int}() | ||
| n = 0 | ||
| for y in x | ||
| countm[y] = get(countm, y, 0) + 1 | ||
| n += 1 | ||
| end |
There was a problem hiding this comment.
This reinvents countmap. Better make countmap allow iterators instead, so that both functions benefit.
There was a problem hiding this comment.
countmap already accepts iterators; I did that to keep a count of n while iterating.
There was a problem hiding this comment.
OK. The problem is that countmap uses different algorithms under the hood for performance. By using a Dict here, you lose the benefit of the fast radix sort and count sort algorithms.
I see two solutions:
- do
n = Base.IteratorSize(x) isa Union{HasLength, HasShape} ? length(x) : sum(values(countm)) - adjust all
_addcounts!methods to return the number of elements (this should be cheap so not a big deal if it's not used byaddcounts)
There was a problem hiding this comment.
I am looking to help get this across the line. Is this your first proposed solution?
function proportionmap(x)
countm = countmap(x)
n = Base.IteratorSize(x) isa Union{Base.HasLength, Base.HasShape} ? length(x) : sum(values(countm))
_normalize_countmap(countm, n)
end
closes #842.