MultiCell storage by Superharz · Pull Request #1008 · scikit-hep/boost-histogram

Superharz · 2025-05-26T11:31:05Z

This PR adds a multi weight storage type (called MultiWeight) to boost-histogram. It addresses #83 and is bases on a prototype provided by @HDembinski in boostorg/histogram#211.

It allows one to create a histogram that can store multiple independent weights per bin (the number of weights per bin has to be specified when creating the histogram).

Example

Create and fill a 1 dim histogram that stores 3 weights per bin:

import boost_histogram as bh
import numpy as np
x = np.array([1, 2])
weights = np.array([
        [1, 2, 3],
        [4, 5, 6]
    ])
h = bh.Histogram(bh.axis.Regular(5, 0, 5), storage = bh.storage.MultiWeight(3))
h.fill(x, sample = weights)

h[1] = [1, 2, 3]
h[2] = [4, 5, 6]

Status of this PR

The MultiWeight histograms can be created in python and they can be filled.
Pickle also works.

The buffer and view for the histograms has yet to be implemented.

Fixes #83

Superharz · 2025-05-26T14:42:12Z

Some numbers comparing a 8*8 2D MultiWeight histogram with 20k weights vs 20k normal (single weights) histograms:
Filling both with 100k random values takes about 1min 26s for a loop over all 20k single weights histograms vs 1.71s to fill the MultiWeight histogram with the same data.
All single weight histograms together take up about 51 MB of RAM on my machine vs the one MultiWeight histogram only taking up about 15.3 MB

Superharz · 2025-05-26T16:00:20Z

h.values() does not work yet.
However, the important part (for me) is to fill the histograms not to retrieve the data from them.
At the end, the data can be converted into a numpy array by a small loop over the histogram:

For a 8*8 2D histogram h with 20k weights:

a = np.zeros((8,8, 20000))
for i in range(8):
    for j in range(8):
        a[i,j] = np.array(h[i,j])

include/bh_python/fill.hpp

HDembinski · 2025-05-27T07:21:51Z

include/bh_python/multi_weight.hpp

+namespace histogram {
+
+template <class T>
+struct multi_weight_value : public boost::span<T> {


The naming is misleading. This inherits from boost::span, which is a view type, but the name X_value suggest it is a value type. A value type holds its data, while a view does not. So this is a multi_weight_span or multi_weight_view. Probably the latter.

I believe you need a multi_weight_value, a multi_weight_reference, and multi_weight_const_reference. You can make the multi_weight_reference derive from multi_weight_const_reference to avoid some duplication. The multi_weight_value should hold a copy of the data, while the references are views like your current 'multi_weight_value'.

You should try to make a multi_weight_base with common code for all of these, like the operators.

Okay, if I understand your suggestion right, my idea would be to have

multi_weight_value derived from a std::vector that adds assignment and sum operators

multi_weight_const_reference derived from a boost::span that does not define any operators to modify the data

multi_weight_reference derived from multi_weight_const_reference that adds the assignment and sum operators

Should I better use a std::valarray instead of std::vector?

Btw, should it not be enough to define
using const_reference = const reference (without the &)
because this would automatically prohibit the use of any function that is not defined const?

Therefore, it should be sufficient to rename multi_weight_value -> multi_weight_reference and implement a separate multi_weight_value class maybe based on std::vector

Btw, should it not be enough to define
using const_reference = const reference (without the &)
because this would automatically prohibit the use of any function that is not defined const?

Maybe. Usually, I had a reason why I did things in a certain way, but you might be right here. Implementing const_reference without the mutable methods and then inherit a reference from const_reference is cleaner, because then the methods that shouldn't be called aren't there. That being said, I can see the appeal of implementing only one kind of reference.

So I read my own comment vom 2022 again
boostorg/histogram#211 (comment)
and now I realize again why the multi_weight_value inherited from boost::span. That was not an oversight, this choice is explained there. Now I have to rethink this whole approach.

Maybe we have to implement a copy-on-write mechanism here to meet the requirements, I don't know.

I think that's why this got stuck, the design wasn't clear.

Thinking more about it, I think the approach we are developing is still fine. The storage holds all the weights for all cells in one large contiguous memory block. When you create a multi_weight_value, it has to return a copy, and that's fine, because the interfaces are generally designed to avoid copies, returning const_reference and reference where possible.

Btw, should it not be enough to define
using const_reference = const reference (without the &)
because this would automatically prohibit the use of any function that is not defined const?

Maybe. Usually, I had a reason why I did things in a certain way, but you might be right here. Implementing const_reference without the mutable methods and then inherit a reference from const_reference is cleaner, because then the methods that shouldn't be called aren't there. That being said, I can see the appeal of implementing only one kind of reference.

Something was bothering me about this and so I checked. And I was right, you cannot implement a const reference like that, look:

#include <boost/core/span.hpp> #include <vector> #include <cassert> using mutable_span = boost::span<double>; using const_span = const boost::span<double>; // not really const const_span foo(const_span x) { x[1] = 0; // oopsie return x; } int main() { std::vector<double> x = {1, 2, 3}; mutable_span y = foo(x); y[0] = 0; assert(x[0] == 0); // oopsie assert(x[1] == 0); // oopsie }

The const modifier on a value type (no &) doesn't prevent you from calling methods which mutate its contents. But even if that were true, there is no way in C++ to prevent initializing a mutable_span from a const_span, because you can also do:

const double x = 3; double y = x;

C++ implicitly assumes that all types which are not references (no &) are value types, meaning assignment creates a copy.

HDembinski · 2025-05-27T07:22:56Z

include/bh_python/multi_weight.hpp

+        return std::equal(this->begin(), this->end(), values.begin());
+    }
+    bool operator!=(const boost::span<T> values) const { return !operator==(values); }
+    void operator+=(const std::vector<T> values) {


Why is this not accepting a span as well?

I have tried to address this in b2cffcf.
However, I can not template += to accept a class S because this results in a huge compiler error message which says

/cvmfs/sft.cern.ch/lcg/views/LCG_107a/x86_64-el9-gcc14-opt/include/boost/histogram/detail/accumulator_traits.hpp:81:37: error: call of overloaded 'accumulator_traits_impl(boost::histogram::multi_weight_value<double>&, boost::histogram::detail::priority<2>)' is ambiguous 81 | decltype(accumulator_traits_impl(std::declval<T&>(), priority<2>{}));

Instead, I have now kept the overload of the operator+= but the vector version now also calls the span version to not duplicate code

However, this might be solved by switching to the split into multi_weight_value and multi_weight_reference as proposed in #1008 (comment).

You should get it to work with boost::span, because then it works with any contiguous memory container without a copy. Also it should be const std::vector<T>& values instead of const std::vector<T> values, the latter is passing by value so a copy is created even when the vector type matches exactly.

include/bh_python/multi_weight.hpp

HDembinski · 2025-05-27T08:06:04Z

Wow, that's quite an impressive patch. I really appreciate the benchmarks, which nicely confirm the expected benefits. I hope you have more time to implement the changes.

I suggest we work on the implementation here and backport it to boostorg/histogram later. In the end, both libraries should be in sync.

Superharz · 2025-05-27T08:10:43Z

Hi @HDembinski .
Thank you for reviewing this. How would you like me to address your comments? Should I do one commit per comment or fix everything and do one commit then?

HDembinski · 2025-05-27T08:13:56Z

Within a PR you don't need to do clean commits, I won't look at the commit history. Feel free to fix multiple issues in one commit.

HDembinski · 2025-05-27T08:45:39Z

By the way, once this feature is done, I suggest you present this at the next PyHEP https://indico.cern.ch/event/1515852/ or the one next year, depending on how quickly we get this done. It is a major feature, and you deserve recognition for implementing this. I left science, so I don't go to any of these workshops anymore.

Superharz · 2025-05-27T15:30:00Z

Hi @HDembinski,
I addressed the first batch of your comments.
For the remaining ones I need some input from your side on how to continue :)

HDembinski · 2025-06-22T12:42:10Z

h.fill(x, sample = weights)

I didn't notice before, but you use sample to pass the weights, that's breaking interface assumptions. Weights should be passed with the weight keyword. I see that I did that, too, in my demo. I guess it was the easiest hack to make it work, but that's breaking all kinds of interface contracts.

Superharz · 2025-06-22T12:44:44Z

h.fill(x, sample = weights)

I didn't notice before, but you use sample to pass the weights, that's breaking interface assumptions. Weights should be passed with the weight keyword.

I completely agree.
This was taken over from your original prototype and it worked, therefore I kept it.

HDembinski · 2025-06-22T12:54:17Z

However, it is kinda nice that it basically works with sample, so maybe it is enough to change the name of the storage to multi_count, so avoid having name "weight" in the name which gives the wrong idea.

HDembinski · 2025-06-22T12:55:26Z

@Superharz I considering moving development of this feature back to boostorg/histogram. I find it easier to develop C++ code there. I will merge your changes here to that branch.

Superharz · 2026-01-20T12:52:50Z

I am thinking of a better name for the storage. Some ideas:
* multi_sample

* multi_cell

* sub_cell

* split_cell
I think I like multi_cell best, thoughts?

Hi @HDembinski, I understand your disfavor of the MultiWeight name. Storing multiple weights per bin will still be my personal use case. But the name is indeed confusing through the sample keyword argument.

I have now renamed all MultiWeight and multi_weight occurrences to MultiCell and multi_cell as proposed by yourself.

henryiii · 2026-01-20T14:38:29Z

I'm curious, could we remap weight= in Python to sample in C++ for this storage? I'm fine with it as is, but just curious if that would work.

Superharz · 2026-01-20T15:22:56Z

I'm curious, could we remap weight= in Python to sample in C++ for this storage? I'm fine with it as is, but just curious if that would work.

I think the remapping should be possible. There already is a custom adaption for the fill function on the c++ binding side for the MultiCell storage. In there, we could simply use the weight parameter instead of the sample one as both are passed to the function.

However, IMHO, this does not fit the idea that this is a (sophisticated) python wrapper and should therefore align with the c++ implementation. Therefore, I would veto using the weight parameter on the python side while relying on sample parameter on the c++ side.

Maybe there could even be some unique use case for the weight parameter. I could only think of a global reweighting where one wants to scale every weight per event by the same factor. This common factor could then be passed as a genuine weight via the weight parameter.

henryiii · 2026-01-20T15:54:19Z

From a user's perspective, these are logically weights. They are not samples. So someone using boost-histogram would have to learn this peculiarity of implementation in order to use this feature, which they don't care about. Currently, these do the same thing:

normal.fill(3, weight=2)
multi.fill(3, sample=[2])

We are implementing this ourselves, rather than wrapping something upstream, so I think we have flexibility to make this weight, and leave the fact we had to use bh::sample instead of bh::weight as an implementation detail. Maybe eventually this could even be fixed upstream? I'd say it's up to @HDembinski, though.

HDembinski · 2026-01-20T16:38:15Z

Conceptually, we are splitting a bin into sub cells. The C++ code technically also allows users to use other accumulators, even though that is not implemented here. That's why naming it multi_weight is misleading.

How people use this storage is unaffected by the name, the documentation doesn't have to be changed.

This is the simplest solution to that cognitive dissonance.

However, I am also okay with keeping the name and implementing a special case here, so that the Python user can fill this storage with the weight keyword, even though internally in C++ sample is used. But then someone has to check and argue here that this choice is not introducing other inconsistencies in the future.

Eventually, I want to implement this cleanly in C++ and that switch should be fully transparent to a python user when it happens.

henryiii · 2026-01-20T18:23:08Z

I like the name MultiCell fine, I was more interested in the API for filling; weight=[...] fits with the API on all the other histograms better than using sample, which is an implementation detail. MultiWeight to me seems like it should behave more like the other storages with Weight in the name and track variances, which it does not.

So how about keeping the name MultiCell, but using the weight= keyword?

henryiii · 2026-01-20T18:24:42Z

argue here that this choice is not introducing other inconsistencies in the future.

That's why I think you should make the call here. From what you've said, it sounds like the detail that sample is used internally can be removed in the future, and we can cleanly swap to using it without a Python API change?

… histogram objects.

Superharz · 2026-01-21T09:38:29Z

I like the name MultiCell fine, I was more interested in the API for filling; weight=[...] fits with the API on all the other histograms better than using sample, which is an implementation detail. MultiWeight to me seems like it should behave more like the other storages with Weight in the name and track variances, which it does not.

So how about keeping the name MultiCell, but using the weight= keyword?

Hi @HDembinski and @henryiii, I have now implemented to use the weight keyword for MultiCell filling by simply switching the sample and weight input to the hist.fill() function.

This is only used on the high level fill function (the one in histogram.py), the low level hist._hist.fill() function still uses the sample keyword.

This allows the normal python user to use the weight keyword.

HDembinski · 2026-01-21T19:57:12Z

Looks good to me.

henryiii · 2026-01-21T23:07:35Z

@all-contributors please add @Superharz for code

allcontributors · 2026-01-21T23:07:45Z

@henryiii

I've put up a pull request to add @Superharz! 🎉

henryiii · 2026-01-21T23:08:12Z

Thanks!

HDembinski · 2026-01-22T13:27:26Z

Also from me, kudos to your endurance!

Superharz · 2026-01-22T14:52:02Z

As a last request, I would like to kindly ask for a new boost-histogram release. I have some collaborators who would use the new histograms and it would be convenient if they would not have to self compile the module.

github-actions bot added the needs changelog Might need a changelog entry label May 26, 2025

Superharz mentioned this pull request May 26, 2025

Support for multiple weight variations #83

Closed