perf(cholesky): add choleskyPartialBlocked() with panel-level Level-3 BLAS operations#2508
perf(cholesky): add choleskyPartialBlocked() with panel-level Level-3 BLAS operations#2508jashshah999 wants to merge 3 commits into
Conversation
choleskyPartialBlocked() replaces the column-by-column loop with a right-
looking panel algorithm:
for each panel of blockSize columns:
1. factor the bs×bs diagonal block with the careful scalar Cholesky
(preserves underconstrained-pivot handling)
2. triangular solve: S = R^{-T} B (Level-3 dtrsm equivalent)
3. rank-bs Schur update: C -= S'S (Level-3 dsyrk equivalent)
Level-3 operations have O(bs) arithmetic intensity vs O(1) for the Level-2
rank-1 updates in the scalar path, yielding substantially higher throughput
on any CPU with an optimised BLAS (OpenBLAS, MKL, Apple Accelerate).
The blocked path is only taken when nFrontal > blockSize; smaller matrices
fall through to choleskyPartial unchanged.
Adds four unit tests verifying correctness: fallback equality, factorization
invariant R'R = A for medium/large matrices, and correctness with a topleft
offset.
|
Cool, but I don’t see this new function being called anywhere. Are you planning to add more commits benchmarking a use case within GTSAM? |
Replace the call to choleskyPartial() in SymmetricBlockMatrix::choleskyPartial() with choleskyPartialBlocked(), which uses panel-level Level-3 BLAS operations for large frontal blocks (nFrontal > 64) and falls back to the original choleskyPartial for smaller sizes. This makes the blocked Cholesky the default elimination path for all HessianFactor elimination, ISAM2 clique factorization, and batch solvers. Verified against 10 test suites (testCholesky, testHessianFactor, testJacobianFactor, testGaussianISAM2, testGaussianBayesNet, testGaussianFactorGraph, testSymmetricBlockMatrix, testNonlinearOptimizer, testVisualISAM2, testGaussianFactorGraphB) — zero failures.
|
Thanks for the feedback! I've now wired The blocked version already falls back to the original Verified against 10 test suites — zero failures: |
|
Any perf numbers? E.g., before/after timeSfmBAL? |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a new panel-blocked partial Cholesky factorization (choleskyPartialBlocked) intended to improve performance via Level-3 BLAS-style operations, and wires it into SymmetricBlockMatrix while adding unit tests for correctness.
Changes:
- Introduced
choleskyPartialBlocked()API + implementation using panel factor + TRSM + SYRK-style Schur updates. - Updated
SymmetricBlockMatrix::choleskyPartial()to call the blocked routine by default. - Added multiple unit tests covering fallback behavior, invariants,
topleftoffsets, and recovery structure.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| gtsam/base/tests/testCholesky.cpp | Adds new unit tests for the blocked algorithm plus an SPD helper. |
| gtsam/base/cholesky.h | Declares the new choleskyPartialBlocked() API and documents intended behavior. |
| gtsam/base/cholesky.cpp | Implements choleskyPartialBlocked() with panel factorization and trailing updates. |
| gtsam/base/SymmetricBlockMatrix.cpp | Switches internal partial Cholesky call to the new blocked implementation. |
| /* ************************************************************************* */ | ||
| bool choleskyPartialBlocked(Matrix& ABC, size_t nFrontal, size_t topleft, | ||
| size_t blockSize) { | ||
| if (nFrontal == 0) return true; |
There was a problem hiding this comment.
assert(blockSize > 0) is compiled out in release builds; if blockSize == 0, the loop for (size_t k = 0; k < nFrontal; k += blockSize) will never advance and can hang (infinite loop). Replace the assert with a runtime guard (e.g., return false or throw) to make the function safe in non-debug builds.
| if (nFrontal == 0) return true; | |
| if (nFrontal == 0) return true; | |
| if (blockSize == 0) return false; |
| DenseIndex topleft = variableColOffsets_[blockStart_]; | ||
| if (!gtsam::choleskyPartial(matrix_, offset(nFrontals) - topleft, topleft)) { | ||
| const size_t nF = offset(nFrontals) - topleft; | ||
| if (!gtsam::choleskyPartialBlocked(matrix_, nF, topleft)) { |
There was a problem hiding this comment.
offset(nFrontals) and topleft are DenseIndex (signed). Computing the difference and storing it in size_t can underflow/wrap if the difference is negative, producing a huge nF and leading to out-of-bounds behavior. Keep the computation in DenseIndex (or validate non-negativity) before converting to size_t.
| DenseIndex topleft = variableColOffsets_[blockStart_]; | ||
| if (!gtsam::choleskyPartial(matrix_, offset(nFrontals) - topleft, topleft)) { | ||
| const size_t nF = offset(nFrontals) - topleft; | ||
| if (!gtsam::choleskyPartialBlocked(matrix_, nF, topleft)) { |
There was a problem hiding this comment.
The PR description focuses on adding choleskyPartialBlocked() but the code also switches SymmetricBlockMatrix::choleskyPartial() to use it by default, which is a behavior/performance change for existing callers. Either call this out explicitly in the PR description (and any release notes, if applicable) or consider guarding this behind a configuration/flag if the intent is to keep existing behavior by default.
| static Matrix randomSPD(size_t n, unsigned seed = 42) { | ||
| srand(seed); | ||
| Matrix A = Matrix::Random(n, n); |
There was a problem hiding this comment.
Seeding via srand(seed) mutates global RNG state, which can couple this test file to other tests that also use std::rand/Eigen::Random in the same process. Prefer a local deterministic generator (or Eigen's random seed API, if available) to avoid global side effects and improve test isolation.
| 0., 0., 0., 0., 0., 0., 2.5776).finished(); | ||
|
|
||
| Matrix RSL(ABC); | ||
| choleskyPartialBlocked(RSL, 3, 0, 2); // blockSize=2 forces multiple panels |
There was a problem hiding this comment.
This test doesn't assert the return value of choleskyPartialBlocked. If the factorization fails and returns false, the subsequent invariant check may be harder to interpret. Store/EXPECT the returned bool (as done in other tests) so failures are reported clearly at the source.
| choleskyPartialBlocked(RSL, 3, 0, 2); // blockSize=2 forces multiple panels | |
| const bool success = | |
| choleskyPartialBlocked(RSL, 3, 0, 2); // blockSize=2 forces multiple panels | |
| EXPECT(success); |
|
I don't have the BAL datasets available on this machine to run timeSFMBAL. The honest benchmark I ran (standalone, plain Eigen without optimized BLAS) showed mixed results because Eigen's LLT is already internally blocked: The blocked path uses The real benefit would come with an optimized BLAS backend where |
|
Just following up. Happy to close this if the marginal gains without an optimized BLAS backend don't justify the added code. Let me know your preference. |
|
Yeah, if indeed Eigen is internally blocked, then I’d prefer having Eigen do the lifting - rather than adding mode code in GTSAM. |
|
Makes sense. Closing this since Eigen handles the blocking internally and the added code doesn't provide enough benefit to justify the maintenance. Thanks for the review. |
What this does
Adds
choleskyPartialBlocked()— a panel-blocked variant ofcholeskyPartial()that operates onblockSizecolumns at a time rather than one column at a time.Algorithm (right-looking blocked Cholesky):
For each panel of
bscolumns:bs x bsdiagonal block column-by-column usingcholeskyStep(preserves underconstrained-pivot handling)S = R^{-T} B— in-place on thebs x trailingpanel rowbsSchur update (SYRK):trailing -= S' * S— upper triangle onlySteps 2 and 3 are Level-3 BLAS operations (O(bs) arithmetic intensity) vs. the Level-2 rank-1 updates (O(1) arithmetic intensity) done per column in
choleskyPartial. With an optimised BLAS backend (OpenBLAS, MKL, Apple Accelerate) these map to multi-threadeddtrsm/dsyrkcalls.Key design decisions:
choleskyPartialwhennFrontal <= blockSize(no overhead for small matrices)assert(blockSize > 0)guards against degenerate inputzeroPivotThreshold,underconstrainedPrior) is preserved exactly via the existingcholeskyStepfunction;dpotrfcould replace the panel factor in step 1 for further gain but would lose rank-deficiency handling(Matrix&, nFrontal, topleft)signature with an optionalblockSizeparameter (default: 64)Benchmark (plain Eigen, no optimised BLAS)
On plain Eigen (no BLAS backend),
choleskyPartialusesEigen::LLTwhich is itself internally blocked, so the gain here is small and noisy. On a BLAS-backed build the blocked path benefits from multi-threadeddtrsm/dsyrkand the gap grows with problem size.Results are variable at these matrix sizes because the working set fits in cache and per-call overhead dominates. The primary intended gain is at
nFrontal > 256with a BLAS-backed build.Tests
Five new unit tests in
testCholesky.cpp:blockedSmall_fallback: exact equality withcholeskyPartialwhen falling back (blockSize > nFrontal)blockedInvariant_medium: verifiesR'R = A(frontal block) for a 32x32 matrix, blockSize=8blockedInvariant_large: same for a 128x128 matrix, blockSize=16blockedWithTopleft: correctness with a non-zerotopleftoffsetblockedRecovery: full invariant[R' 0; S' I][R S; 0 L] = [A B; B' C]on the same 7x7 matrix used by the existingcholeskyPartialtest, blockSize=2 to force multiple panels