Skip to content

MultiCell storage#1008

Merged
henryiii merged 49 commits intoscikit-hep:developfrom
Superharz:multi_weight_storage
Jan 21, 2026
Merged

MultiCell storage#1008
henryiii merged 49 commits intoscikit-hep:developfrom
Superharz:multi_weight_storage

Conversation

@Superharz
Copy link
Copy Markdown
Contributor

@Superharz Superharz commented May 26, 2025

This PR adds a multi weight storage type (called MultiWeight) to boost-histogram. It addresses #83 and is bases on a prototype provided by @HDembinski in boostorg/histogram#211.

It allows one to create a histogram that can store multiple independent weights per bin (the number of weights per bin has to be specified when creating the histogram).

Example

Create and fill a 1 dim histogram that stores 3 weights per bin:

import boost_histogram as bh
import numpy as np
x = np.array([1, 2])
weights = np.array([
        [1, 2, 3],
        [4, 5, 6]
    ])
h = bh.Histogram(bh.axis.Regular(5, 0, 5), storage = bh.storage.MultiWeight(3))
h.fill(x, sample = weights)

h[1] = [1, 2, 3]
h[2] = [4, 5, 6]

Status of this PR

The MultiWeight histograms can be created in python and they can be filled.
Pickle also works.

The buffer and view for the histograms has yet to be implemented.

Fixes #83

@github-actions github-actions bot added the needs changelog Might need a changelog entry label May 26, 2025
@Superharz
Copy link
Copy Markdown
Contributor Author

Some numbers comparing a 8*8 2D MultiWeight histogram with 20k weights vs 20k normal (single weights) histograms:
Filling both with 100k random values takes about 1min 26s for a loop over all 20k single weights histograms vs 1.71s to fill the MultiWeight histogram with the same data.
All single weight histograms together take up about 51 MB of RAM on my machine vs the one MultiWeight histogram only taking up about 15.3 MB

@Superharz
Copy link
Copy Markdown
Contributor Author

h.values() does not work yet.
However, the important part (for me) is to fill the histograms not to retrieve the data from them.
At the end, the data can be converted into a numpy array by a small loop over the histogram:

For a 8*8 2D histogram h with 20k weights:

a = np.zeros((8,8, 20000))
for i in range(8):
    for j in range(8):
        a[i,j] = np.array(h[i,j])

namespace histogram {

template <class T>
struct multi_weight_value : public boost::span<T> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming is misleading. This inherits from boost::span, which is a view type, but the name X_value suggest it is a value type. A value type holds its data, while a view does not. So this is a multi_weight_span or multi_weight_view. Probably the latter.

Copy link
Copy Markdown
Member

@HDembinski HDembinski May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you need a multi_weight_value, a multi_weight_reference, and multi_weight_const_reference. You can make the multi_weight_reference derive from multi_weight_const_reference to avoid some duplication. The multi_weight_value should hold a copy of the data, while the references are views like your current 'multi_weight_value'.

Copy link
Copy Markdown
Member

@HDembinski HDembinski May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should try to make a multi_weight_base with common code for all of these, like the operators.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, if I understand your suggestion right, my idea would be to have

  1. multi_weight_value derived from a std::vector that adds assignment and sum operators
  2. multi_weight_const_reference derived from a boost::span that does not define any operators to modify the data
  3. multi_weight_reference derived from multi_weight_const_reference that adds the assignment and sum operators

Should I better use a std::valarray instead of std::vector?

Copy link
Copy Markdown
Contributor Author

@Superharz Superharz May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, should it not be enough to define
using const_reference = const reference (without the &)
because this would automatically prohibit the use of any function that is not defined const?

Therefore, it should be sufficient to rename multi_weight_value -> multi_weight_reference and implement a separate multi_weight_value class maybe based on std::vector

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, should it not be enough to define
using const_reference = const reference (without the &)
because this would automatically prohibit the use of any function that is not defined const?

Maybe. Usually, I had a reason why I did things in a certain way, but you might be right here. Implementing const_reference without the mutable methods and then inherit a reference from const_reference is cleaner, because then the methods that shouldn't be called aren't there. That being said, I can see the appeal of implementing only one kind of reference.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I read my own comment vom 2022 again
boostorg/histogram#211 (comment)
and now I realize again why the multi_weight_value inherited from boost::span. That was not an oversight, this choice is explained there. Now I have to rethink this whole approach.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we have to implement a copy-on-write mechanism here to meet the requirements, I don't know.

I think that's why this got stuck, the design wasn't clear.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about it, I think the approach we are developing is still fine. The storage holds all the weights for all cells in one large contiguous memory block. When you create a multi_weight_value, it has to return a copy, and that's fine, because the interfaces are generally designed to avoid copies, returning const_reference and reference where possible.

Copy link
Copy Markdown
Member

@HDembinski HDembinski Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, should it not be enough to define
using const_reference = const reference (without the &)
because this would automatically prohibit the use of any function that is not defined const?

Maybe. Usually, I had a reason why I did things in a certain way, but you might be right here. Implementing const_reference without the mutable methods and then inherit a reference from const_reference is cleaner, because then the methods that shouldn't be called aren't there. That being said, I can see the appeal of implementing only one kind of reference.

Something was bothering me about this and so I checked. And I was right, you cannot implement a const reference like that, look:

#include <boost/core/span.hpp>
#include <vector>
#include <cassert>

using mutable_span = boost::span<double>;
using const_span = const boost::span<double>; // not really const

const_span foo(const_span x) {
    x[1] = 0; // oopsie
    return x;
}

int main()
{
  std::vector<double> x = {1, 2, 3};
  
  mutable_span y = foo(x);

  y[0] = 0;

  assert(x[0] == 0); // oopsie
  assert(x[1] == 0); // oopsie
}

The const modifier on a value type (no &) doesn't prevent you from calling methods which mutate its contents. But even if that were true, there is no way in C++ to prevent initializing a mutable_span from a const_span, because you can also do:

const double x = 3;
double y = x;

C++ implicitly assumes that all types which are not references (no &) are value types, meaning assignment creates a copy.

return std::equal(this->begin(), this->end(), values.begin());
}
bool operator!=(const boost::span<T> values) const { return !operator==(values); }
void operator+=(const std::vector<T> values) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not accepting a span as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to address this in b2cffcf.
However, I can not template += to accept a class S because this results in a huge compiler error message which says

/cvmfs/sft.cern.ch/lcg/views/LCG_107a/x86_64-el9-gcc14-opt/include/boost/histogram/detail/accumulator_traits.hpp:81:37: error: call of overloaded 'accumulator_traits_impl(boost::histogram::multi_weight_value<double>&, boost::histogram::detail::priority<2>)' is ambiguous
   81 |     decltype(accumulator_traits_impl(std::declval<T&>(), priority<2>{}));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, I have now kept the overload of the operator+= but the vector version now also calls the span version to not duplicate code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this might be solved by switching to the split into multi_weight_value and multi_weight_reference as proposed in #1008 (comment).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should get it to work with boost::span, because then it works with any contiguous memory container without a copy. Also it should be const std::vector<T>& values instead of const std::vector<T> values, the latter is passing by value so a copy is created even when the vector type matches exactly.

@HDembinski
Copy link
Copy Markdown
Member

Wow, that's quite an impressive patch. I really appreciate the benchmarks, which nicely confirm the expected benefits. I hope you have more time to implement the changes.

I suggest we work on the implementation here and backport it to boostorg/histogram later. In the end, both libraries should be in sync.

@Superharz
Copy link
Copy Markdown
Contributor Author

Hi @HDembinski .
Thank you for reviewing this. How would you like me to address your comments? Should I do one commit per comment or fix everything and do one commit then?

@HDembinski
Copy link
Copy Markdown
Member

HDembinski commented May 27, 2025

Within a PR you don't need to do clean commits, I won't look at the commit history. Feel free to fix multiple issues in one commit.

@HDembinski
Copy link
Copy Markdown
Member

HDembinski commented May 27, 2025

By the way, once this feature is done, I suggest you present this at the next PyHEP https://indico.cern.ch/event/1515852/ or the one next year, depending on how quickly we get this done. It is a major feature, and you deserve recognition for implementing this. I left science, so I don't go to any of these workshops anymore.

@Superharz
Copy link
Copy Markdown
Contributor Author

Hi @HDembinski,
I addressed the first batch of your comments.
For the remaining ones I need some input from your side on how to continue :)

@HDembinski
Copy link
Copy Markdown
Member

HDembinski commented Jun 22, 2025

h.fill(x, sample = weights)

I didn't notice before, but you use sample to pass the weights, that's breaking interface assumptions. Weights should be passed with the weight keyword. I see that I did that, too, in my demo. I guess it was the easiest hack to make it work, but that's breaking all kinds of interface contracts.

@Superharz
Copy link
Copy Markdown
Contributor Author

h.fill(x, sample = weights)

I didn't notice before, but you use sample to pass the weights, that's breaking interface assumptions. Weights should be passed with the weight keyword.

I completely agree.
This was taken over from your original prototype and it worked, therefore I kept it.

@HDembinski
Copy link
Copy Markdown
Member

However, it is kinda nice that it basically works with sample, so maybe it is enough to change the name of the storage to multi_count, so avoid having name "weight" in the name which gives the wrong idea.

@HDembinski
Copy link
Copy Markdown
Member

@Superharz I considering moving development of this feature back to boostorg/histogram. I find it easier to develop C++ code there. I will merge your changes here to that branch.

@Superharz
Copy link
Copy Markdown
Contributor Author

I am thinking of a better name for the storage. Some ideas:

* multi_sample

* multi_cell

* sub_cell

* split_cell

I think I like multi_cell best, thoughts?

Hi @HDembinski, I understand your disfavor of the MultiWeight name. Storing multiple weights per bin will still be my personal use case. But the name is indeed confusing through the sample keyword argument.

I have now renamed all MultiWeight and multi_weight occurrences to MultiCell and multi_cell as proposed by yourself.

@henryiii henryiii changed the title Multi weight storage MultCell storage Jan 20, 2026
@henryiii
Copy link
Copy Markdown
Member

I'm curious, could we remap weight= in Python to sample in C++ for this storage? I'm fine with it as is, but just curious if that would work.

@Superharz
Copy link
Copy Markdown
Contributor Author

Superharz commented Jan 20, 2026

I'm curious, could we remap weight= in Python to sample in C++ for this storage? I'm fine with it as is, but just curious if that would work.

I think the remapping should be possible. There already is a custom adaption for the fill function on the c++ binding side for the MultiCell storage. In there, we could simply use the weight parameter instead of the sample one as both are passed to the function.

However, IMHO, this does not fit the idea that this is a (sophisticated) python wrapper and should therefore align with the c++ implementation. Therefore, I would veto using the weight parameter on the python side while relying on sample parameter on the c++ side.

Maybe there could even be some unique use case for the weight parameter. I could only think of a global reweighting where one wants to scale every weight per event by the same factor. This common factor could then be passed as a genuine weight via the weight parameter.

@henryiii
Copy link
Copy Markdown
Member

From a user's perspective, these are logically weights. They are not samples. So someone using boost-histogram would have to learn this peculiarity of implementation in order to use this feature, which they don't care about. Currently, these do the same thing:

normal.fill(3, weight=2)
multi.fill(3, sample=[2])

We are implementing this ourselves, rather than wrapping something upstream, so I think we have flexibility to make this weight, and leave the fact we had to use bh::sample instead of bh::weight as an implementation detail. Maybe eventually this could even be fixed upstream? I'd say it's up to @HDembinski, though.

@HDembinski
Copy link
Copy Markdown
Member

HDembinski commented Jan 20, 2026

Conceptually, we are splitting a bin into sub cells. The C++ code technically also allows users to use other accumulators, even though that is not implemented here. That's why naming it multi_weight is misleading.

How people use this storage is unaffected by the name, the documentation doesn't have to be changed.

This is the simplest solution to that cognitive dissonance.

However, I am also okay with keeping the name and implementing a special case here, so that the Python user can fill this storage with the weight keyword, even though internally in C++ sample is used. But then someone has to check and argue here that this choice is not introducing other inconsistencies in the future.

Eventually, I want to implement this cleanly in C++ and that switch should be fully transparent to a python user when it happens.

@henryiii henryiii changed the title MultCell storage MultiCell storage Jan 20, 2026
@henryiii
Copy link
Copy Markdown
Member

I like the name MultiCell fine, I was more interested in the API for filling; weight=[...] fits with the API on all the other histograms better than using sample, which is an implementation detail. MultiWeight to me seems like it should behave more like the other storages with Weight in the name and track variances, which it does not.

So how about keeping the name MultiCell, but using the weight= keyword?

@henryiii
Copy link
Copy Markdown
Member

argue here that this choice is not introducing other inconsistencies in the future.

That's why I think you should make the call here. From what you've said, it sounds like the detail that sample is used internally can be removed in the future, and we can cleanly swap to using it without a Python API change?

@Superharz
Copy link
Copy Markdown
Contributor Author

I like the name MultiCell fine, I was more interested in the API for filling; weight=[...] fits with the API on all the other histograms better than using sample, which is an implementation detail. MultiWeight to me seems like it should behave more like the other storages with Weight in the name and track variances, which it does not.

So how about keeping the name MultiCell, but using the weight= keyword?

Hi @HDembinski and @henryiii, I have now implemented to use the weight keyword for MultiCell filling by simply switching the sample and weight input to the hist.fill() function.

This is only used on the high level fill function (the one in histogram.py), the low level hist._hist.fill() function still uses the sample keyword.

This allows the normal python user to use the weight keyword.

@HDembinski
Copy link
Copy Markdown
Member

Looks good to me.

@henryiii
Copy link
Copy Markdown
Member

@all-contributors please add @Superharz for code

@allcontributors
Copy link
Copy Markdown
Contributor

@henryiii

I've put up a pull request to add @Superharz! 🎉

@henryiii henryiii merged commit fe1d87b into scikit-hep:develop Jan 21, 2026
21 of 23 checks passed
@henryiii
Copy link
Copy Markdown
Member

Thanks!

@HDembinski
Copy link
Copy Markdown
Member

Also from me, kudos to your endurance!

@Superharz
Copy link
Copy Markdown
Contributor Author

As a last request, I would like to kindly ask for a new boost-histogram release. I have some collaborators who would use the new histograms and it would be convenient if they would not have to self compile the module.

@Superharz Superharz deleted the multi_weight_storage branch January 25, 2026 13:19
@henryiii henryiii removed the needs changelog Might need a changelog entry label Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for multiple weight variations

3 participants