Re-enable time-dependent z-scoring for Flow Matching by satwiksps · Pull Request #1752 · sbi-dev/sbi

satwiksps · 2026-02-03T19:27:38Z

Description

This PR re-introduces z-scoring for Flow Matching estimators using a time-dependent normalization approach and adds a Gaussian Baseline for improved training stability.

As discussed in #1623, standard z-scoring is problematic because the network input evolves from data to noise. This implementation provides two distinct normalization modes to handle this evolution while maintaining training stability.

Corrected Normalization Statistics:
Since we define $t=0$ as Data and $t=1$ as Noise, the statistics are handled based on the chosen mode:

Gaussian Baseline (gaussian_baseline=True): Normalizes inputs to $N(0, 1)$ across the entire path. The drift signal is handled by the hard-coded affine baseline.

$$\mu_t = (1 - t) \cdot \mu_{data}$$
$$\sigma_t = \sqrt{(1 - t)^2 \sigma_{data}^2 + t^2}$$
Variance Only (gaussian_baseline=False): Normalizes variance while preserving the raw data location at $t=0$. This ensures the network can still learn the drift signal when no baseline is used.

$$\mu_t = t \cdot \mu_{data}$$
$$\sigma_t = \sqrt{t^2 \sigma_{data}^2 + (1 - t)^2}$$

Gaussian Baseline:
We implemented an affine vector field baseline (enabled by default). The network now learns the residual vector field with respect to the optimal Gaussian probability path, significantly improving convergence on shifted datasets.

Related Issues/PRs

Closes Add back z-scoreing for flow matching #1623

Changes

sbi/neural_nets/net_builders/vector_field_nets.py: Updated build_vector_field_estimator to calculate training data statistics, accept the gaussian_baseline flag, and pass them to the estimator.
sbi/neural_nets/estimators/flowmatching_estimator.py:
- Buffer Management: Registered mean_1 and std_1 as buffers and expanded them to match input_shape to ensure compatibility with multi-dimensional data in CI.
- Split Logic: Implemented the split logic in forward() to support both Gaussian Baseline (residual learning) and Variance-only (signal preserving) modes.
- Numerical Stability: Added a small epsilon (1e-5) to variance calculations to prevent division-by-zero errors.
tests/linearGaussian_vector_field_test.py:
- Added test_fmpe_time_dependent_z_scoring_integration: Verifies statistics population, buffer registration, and forward pass shapes.
- Added test_fmpe_shifted_data_gaussian_baseline: Verifies that the Gaussian Baseline outperforms variance-only scaling on shifted data ($U[95, 105]$) with robust simulation counts ($N=2000$).

Verification

Shifted Data Benchmark: Confirmed that gaussian_baseline=True achieves lower validation loss and faster convergence than variance-only scaling on a shifted 1D prior ($U[95, 105]$).
Integration Tests: All new tests pass, confirming correct buffer registration and stability with z_score_x='independent'.
Benchmarks: I ran the sbi benchmarks locally (pytest --bm --bm-mode fmpe) to check for stability and performance. All 12 tests passed successfully.

codecov · 2026-02-03T19:59:57Z

Codecov Report

❌ Patch coverage is 75.00000% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.82%. Comparing base (42d89f3) to head (8e37f05).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...i/neural_nets/estimators/flowmatching_estimator.py	74.07%	14 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1752      +/-   ##
==========================================
- Coverage   87.88%   87.82%   -0.06%     
==========================================
  Files         140      140              
  Lines       12726    12777      +51     
==========================================
+ Hits        11184    11222      +38     
- Misses       1542     1555      +13

Flag	Coverage Δ
fast	`82.71% <75.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
sbi/neural_nets/factory.py	`95.06% <ø> (+1.08%)`	⬆️
sbi/neural_nets/net_builders/vector_field_nets.py	`94.00% <100.00%> (+0.03%)`	⬆️
...i/neural_nets/estimators/flowmatching_estimator.py	`85.58% <74.07%> (-11.09%)`	⬇️

satwiksps · 2026-02-03T21:05:59Z

It seems tests/torchutils_test.py::TorchUtilsTest::test_searchsorted is consistently failing in the CI with an execnet.gateway_base.DumpError.

Since this failure is in torchutils_test.py (which I haven't touched) and appears to be a serialization issue with pytest-xdist masking a local assertion error, I believe it is unrelated to my changes in flowmatching_estimator.py ?

The actual Flow Matching benchmarks and integration tests for this PR passed successfully though

janfb · 2026-02-05T15:51:34Z

It seems tests/torchutils_test.py::TorchUtilsTest::test_searchsorted is consistently failing in the CI with an execnet.gateway_base.DumpError.

Since this failure is in torchutils_test.py (which I haven't touched) and appears to be a serialization issue with pytest-xdist masking a local assertion error, I believe it is unrelated to my changes in flowmatching_estimator.py ?

The actual Flow Matching benchmarks and integration tests for this PR passed successfully though

Yes, this is unrelated and popped up here by chance or because of an unrelated change in a downstream package. I pushed a fix to this branch ✅

janfb · 2026-02-05T15:56:42Z

Thanks for working on this @satwiksps !

Overall, this looks exactly right. However, after reviewing the code and tracing through the flow matching implementation, I believe the z-scoring formula is inverted relative to the interpolation convention (quite confusing!)

The interpolation in the loss function is:

theta_t = (1 - t) * theta_data + t * theta_noise

So the expected input mean at each time is:

E[θ_t] = (1-t) * mean_data + t * 0 = (1-t) * mean_data

Current PR formula:

mu_t = t * mean_1
var_t = (t * std_1)² + (1 - t)²

This gives mu_t = 0 at t=0 and mu_t = mean_data at t=1 — exactly backwards.

Correct formula should be:

mu_t = (1 - t) * mean_1
var_t = ((1 - t) * std_1)² + t²

The formula only matches at t=0.5 and is maximally wrong at the boundaries.

Note on zuko's sampling: I had to dig a bit but in zuko, NormalizingFlow.sample() uses transform.inv() which integrates backward (t1→t0), so training and sampling conventions do align — the issue is purely the z-scoring formula.

To verify this, I suggest the following test: The standard linear Gaussian test, but with uniform prior between 95 and 100, and with data x_o centered at 100 (far from N(0,1)). With the inverted formula, C2ST should degrades significantly compared to no z-scoring and it should be fixed (c2st close 0.5) with the correct formula.

Can you confirm this (maybe I got confused with the integration directions after all)?

manuelgloeckler

Hey @satwiksps !

Thanks for the contribution! I checked with main and as of now it does I guess on average perform very similar if not a bit worse than before (although, I think thats mostly fine i.e. these tasks).

I wonder if it would make sense to improve the "preconditioning" a bit more (see comments).

janfb · 2026-02-06T05:41:23Z

Hey @satwiksps !

Thanks for the contribution! I checked with main and as of now it does I guess on average perform very similar if not a bit worse than before (although, I think thats mostly fine i.e. these tasks).
I wonder if it would make sense to improve the "preconditioning" a bit more (see comments).

Thanks for adding the comparison to main. What could happen here is that the benchmarking tasks are not discriminative w.r.t. to z-scoring, no? I.e., we need a task that benefits from z-scoring.

janfb · 2026-02-06T07:17:44Z

Alright, I looked at it again and I realized that my proposal was actually incorrect. The formulas I proposed would result in total normalization, i.e., "independent" z-scoring, where all time steps have equal zero mean after z-scoring and we lose valuable time-depenedent information - sorry @satwiksps , your formulas where actually correct!

What Manuel proposed is great, we z-score with respect to the Gaussian baseline, e.g., what one would expect when the posterior is actually Gaussian. Then the flow matching network only has to learn the residual from this ideal baseline (please correct me @manuelgloeckler if this intuition is inaccurate).

I tested this locally with the following setup:

Prior: BoxUniform([95, 105]), x_o=100
Simulator: x = theta + 0.5 * noise
Reference posterior: N(x_o, 0.5²I)
3000 simulations.

Results:

Formula	C2ST	Description
Gaussian	0.631	Gaussian baseline + residual learning
var_only	0.772	Variance scaling only
pr	0.774	PR's time-dependent z-scoring
static	0.796	Static mean subtraction
none	0.865	No z-scoring
independent z-scoring	0.922	"Correct" mean formula

Thus, @satwiksps I suggest you implement both options, your proposal and Manuel's proposal and add the test as a new z-scoring test and confirm the results.
@manuelgloeckler I think it would be good to have both options as the gaussian baseline assumption can be suboptimal when the posterior is multi-modal or skewed?

manuelgloeckler · 2026-02-06T09:20:21Z

@janfb The preconditioning is with respect to the "prior" not the posterior (as this would require regression from x). I don't think that it will "hurt" in almost all cases i.e. FM nets are initialized to output zero hence effectively will let the initialized network sample from a mass covering Gaussian approximation of the prior (and everything else needs to be learned).

Nonetheless having an option to disable it is always good.

Agree that the benchmark tests are not really sensitive to the z-scoreing, but as we usually enable z-scoreing by default it shouldn't hurt performance even if its not necessary. But as said the deviation is small enough to be fine (and might improve with the additional baseline).

janfb

Thanks for the update @satwiksps ! looks good, I just have one crucial question on the standard z-scoring formulas again, please check 🙏

manuelgloeckler

Thanks for you contribution.

I think the formula is still a bit off (but it also was never very clearly defined by us anyway (: ).

I do have a minor suggestion on t he mean,var buffers as well as the gaussian baseline test, which should be addressed (see comments). Once this done, we can merge it :)

Kind regards,
Manuel

janfb · 2026-03-04T11:15:38Z

Hi @manuelgloeckler and @satwiksps

I tested this locally using a linear Gaussian test with shifted prior to U(95, 105). We have the option to just z-score the time vector or to additionally use the Gaussian baseline assumption. It turns out that the formula that works best is the one that normalized the time vector to 0 across time (called zscore_true_marginal below).

I also tested the Gaussian baseline option, once with the formulas dirived from the Flow Matching velocity objective (baseline_velocity), and once the one that Manuel proposed which come from the score matching Ansatz I believe (baseline_position). Here we see that the former works a bit better.

zscore_true_marginal: C2ST = 0.514 +/- 0.011
baseline_velocity: C2ST = 0.529 +/- 0.014
baseline_position: C2ST = 0.570 +/- 0.036
zscore_initial_pr: C2ST = 0.640 +/- 0.011

I added all options as options in the internal code for and a smoke test comparing all these options. But just for reference. In the next commit I will clean things up.

So, I suggest we go with the zscore_true_marginal by default and offer the velocity Gaussian baseline as additional option with default False

- fix bug with z_score_x vs y mapping in kwargs setup - fix formulas after empirical test with smoke tests

- more wrong test appear because of previously silent kwargs failures.

manuelgloeckler

Thanks!

The implementation looks good. I am just a bit confused/concerned about the assumed integration direction. But I can be wrong there, can you point me to the part were this "switch" happens?

janfb · 2026-03-23T11:43:20Z

@manuelgloeckler thanks for the review and checking again the integration direction. It should now be fixed.

I will run slow tests again to make sure all is clean, and then we can merge / do you approve?

janfb

Slow tests are passing and formulas have been clarified. Thanks again @satwiksps for the initial hard work on this one! 👏 🚀

I will leave this final review and approval to @manuelgloeckler !

manuelgloeckler

Great, thanks all!

satwiksps marked this pull request as ready for review February 3, 2026 19:51

manuelgloeckler reviewed Feb 5, 2026

View reviewed changes

Comment thread sbi/neural_nets/estimators/flowmatching_estimator.py Outdated

manuelgloeckler reviewed Feb 5, 2026

View reviewed changes

janfb reviewed Feb 13, 2026

View reviewed changes

Comment thread tests/linearGaussian_vector_field_test.py Outdated

Comment thread sbi/neural_nets/estimators/flowmatching_estimator.py Outdated

satwiksps requested review from janfb and manuelgloeckler February 15, 2026 07:17

manuelgloeckler reviewed Feb 16, 2026

View reviewed changes

Comment thread sbi/neural_nets/estimators/flowmatching_estimator.py Outdated

Comment thread sbi/neural_nets/estimators/flowmatching_estimator.py Outdated

Comment thread tests/linearGaussian_vector_field_test.py Outdated

Comment thread tests/torchutils_test.py

manuelgloeckler requested changes Feb 16, 2026

View reviewed changes

satwiksps added 14 commits March 4, 2026 17:39

update vector field builder to pass z-scoring stats to estimator

ed671a5

implement time-dependent z-scoring logic in FlowMatchingEstimator

7fe5153

add integration test for FMPE time-dependent z-scoring

5c56dfb

trigger ci

8e0fffc

re-trigger ci after ready for review

e9e275e

again trigger the ci to pass the flaky test

73936ca

FlowMatchingEstimator changes for Gaussian Baseline

98302a5

update vector field builder for Gaussian Baseline

1925041

add new shifted data test to existing test file

c6e8e3a

fix the value error

ce06dd7

add epsilon to variance term

387a4e8

switch from structured to independent

5e00b63

fix formula

0211ba8

fix shape mismatch

3ad4139

janfb added 3 commits March 4, 2026 18:14

wip fixes

a2fa938

fix: z_score_x/y bug; Gaussian baseline and z-scoring formulas

8a7aefb

- fix bug with z_score_x vs y mapping in kwargs setup - fix formulas after empirical test with smoke tests

remove z-scoring options that were for comparison.

a3ea99d

janfb force-pushed the fm-z-scoring branch from 62a3df7 to a3ea99d Compare March 4, 2026 13:31

janfb added 2 commits March 4, 2026 15:17

add missing gaussian_baseline field

d3f183c

fix: z-score swap bug; fix tests

de505c8

- more wrong test appear because of previously silent kwargs failures.

janfb force-pushed the fm-z-scoring branch from 952b63b to de505c8 Compare March 4, 2026 15:42

manuelgloeckler reviewed Mar 12, 2026

View reviewed changes

Comment thread sbi/neural_nets/estimators/flowmatching_estimator.py

Comment thread sbi/neural_nets/estimators/flowmatching_estimator.py

Comment thread sbi/neural_nets/estimators/flowmatching_estimator.py

manuelgloeckler mentioned this pull request Mar 13, 2026

PyTorch 2.6.0 #1380

Closed

manuelgloeckler mentioned this pull request Mar 20, 2026

chore: prepare release #1718

Merged

3 tasks

janfb added 2 commits March 23, 2026 12:28

rename to mean_0 to match reverse integration

2f82d2a

merge main and resolve conflicts

d151097

janfb reviewed Mar 23, 2026

View reviewed changes

Comment thread sbi/neural_nets/estimators/flowmatching_estimator.py Outdated

Comment thread sbi/neural_nets/estimators/flowmatching_estimator.py

fix skipped vf iid potential test

8e37f05

manuelgloeckler approved these changes Mar 23, 2026

View reviewed changes

manuelgloeckler merged commit e483730 into sbi-dev:main Mar 23, 2026
9 checks passed

Conversation

satwiksps commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues/PRs

Changes

Verification

Verification

Uh oh!

codecov bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

satwiksps commented Feb 3, 2026

Uh oh!

janfb commented Feb 5, 2026

Uh oh!

janfb commented Feb 5, 2026

Uh oh!

Uh oh!

manuelgloeckler left a comment

Choose a reason for hiding this comment

Uh oh!

janfb commented Feb 6, 2026

Uh oh!

janfb commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

manuelgloeckler commented Feb 6, 2026

Uh oh!

janfb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

manuelgloeckler left a comment

Choose a reason for hiding this comment

Uh oh!

janfb commented Mar 4, 2026

Uh oh!

manuelgloeckler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janfb commented Mar 23, 2026

Uh oh!

janfb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

manuelgloeckler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

satwiksps commented Feb 3, 2026 •

edited

Loading

codecov bot commented Feb 3, 2026 •

edited

Loading

janfb commented Feb 6, 2026 •

edited

Loading