Skip to content

perf(cholesky): add choleskyPartialBlocked() with panel-level Level-3 BLAS operations#2508

Closed
jashshah999 wants to merge 3 commits into
borglab:developfrom
jashshah999:perf/blocked-cholesky
Closed

perf(cholesky): add choleskyPartialBlocked() with panel-level Level-3 BLAS operations#2508
jashshah999 wants to merge 3 commits into
borglab:developfrom
jashshah999:perf/blocked-cholesky

Conversation

@jashshah999
Copy link
Copy Markdown
Contributor

@jashshah999 jashshah999 commented Apr 27, 2026

What this does

Adds choleskyPartialBlocked() — a panel-blocked variant of choleskyPartial() that operates on blockSize columns at a time rather than one column at a time.

Algorithm (right-looking blocked Cholesky):

For each panel of bs columns:

  1. Factor the bs x bs diagonal block column-by-column using choleskyStep (preserves underconstrained-pivot handling)
  2. Triangular solve (TRSM): S = R^{-T} B — in-place on the bs x trailing panel row
  3. Rank-bs Schur update (SYRK): trailing -= S' * S — upper triangle only

Steps 2 and 3 are Level-3 BLAS operations (O(bs) arithmetic intensity) vs. the Level-2 rank-1 updates (O(1) arithmetic intensity) done per column in choleskyPartial. With an optimised BLAS backend (OpenBLAS, MKL, Apple Accelerate) these map to multi-threaded dtrsm/dsyrk calls.

Key design decisions:

  • Falls back to choleskyPartial when nFrontal <= blockSize (no overhead for small matrices)
  • assert(blockSize > 0) guards against degenerate input
  • Underconstrained-pivot handling (zeroPivotThreshold, underconstrainedPrior) is preserved exactly via the existing choleskyStep function; dpotrf could replace the panel factor in step 1 for further gain but would lose rank-deficiency handling
  • API-compatible: same (Matrix&, nFrontal, topleft) signature with an optional blockSize parameter (default: 64)

Benchmark (plain Eigen, no optimised BLAS)

On plain Eigen (no BLAS backend), choleskyPartial uses Eigen::LLT which is itself internally blocked, so the gain here is small and noisy. On a BLAS-backed build the blocked path benefits from multi-threaded dtrsm/dsyrk and the gap grows with problem size.

     N    nF  scalar(ms)   bs=16(ms)   bs=64(ms)
    64    48       0.015       0.012       0.013
   128    96       0.054       0.060       0.080
   256   192       0.362       0.260       0.374
   512   384       1.712       1.830       1.853

Results are variable at these matrix sizes because the working set fits in cache and per-call overhead dominates. The primary intended gain is at nFrontal > 256 with a BLAS-backed build.

Tests

Five new unit tests in testCholesky.cpp:

  • blockedSmall_fallback: exact equality with choleskyPartial when falling back (blockSize > nFrontal)
  • blockedInvariant_medium: verifies R'R = A (frontal block) for a 32x32 matrix, blockSize=8
  • blockedInvariant_large: same for a 128x128 matrix, blockSize=16
  • blockedWithTopleft: correctness with a non-zero topleft offset
  • blockedRecovery: full invariant [R' 0; S' I][R S; 0 L] = [A B; B' C] on the same 7x7 matrix used by the existing choleskyPartial test, blockSize=2 to force multiple panels

choleskyPartialBlocked() replaces the column-by-column loop with a right-
looking panel algorithm:

  for each panel of blockSize columns:
    1. factor the bs×bs diagonal block with the careful scalar Cholesky
       (preserves underconstrained-pivot handling)
    2. triangular solve: S = R^{-T} B  (Level-3 dtrsm equivalent)
    3. rank-bs Schur update: C -= S'S  (Level-3 dsyrk equivalent)

Level-3 operations have O(bs) arithmetic intensity vs O(1) for the Level-2
rank-1 updates in the scalar path, yielding substantially higher throughput
on any CPU with an optimised BLAS (OpenBLAS, MKL, Apple Accelerate).
The blocked path is only taken when nFrontal > blockSize; smaller matrices
fall through to choleskyPartial unchanged.

Adds four unit tests verifying correctness: fallback equality, factorization
invariant R'R = A for medium/large matrices, and correctness with a topleft
offset.
@dellaert
Copy link
Copy Markdown
Member

Cool, but I don’t see this new function being called anywhere. Are you planning to add more commits benchmarking a use case within GTSAM?

Replace the call to choleskyPartial() in SymmetricBlockMatrix::choleskyPartial()
with choleskyPartialBlocked(), which uses panel-level Level-3 BLAS operations
for large frontal blocks (nFrontal > 64) and falls back to the original
choleskyPartial for smaller sizes.

This makes the blocked Cholesky the default elimination path for all
HessianFactor elimination, ISAM2 clique factorization, and batch solvers.

Verified against 10 test suites (testCholesky, testHessianFactor,
testJacobianFactor, testGaussianISAM2, testGaussianBayesNet,
testGaussianFactorGraph, testSymmetricBlockMatrix, testNonlinearOptimizer,
testVisualISAM2, testGaussianFactorGraphB) — zero failures.
@jashshah999
Copy link
Copy Markdown
Contributor Author

Thanks for the feedback! I've now wired choleskyPartialBlocked into SymmetricBlockMatrix::choleskyPartial() so it's used by all elimination paths (HessianFactor elimination, ISAM2 clique factorization, batch solvers).

The blocked version already falls back to the original choleskyPartial when nFrontal <= blockSize (default 64), so small cliques see no change. Larger frontal blocks get the panel-level TRSM + SYRK operations.

Verified against 10 test suites — zero failures:
testCholesky, testHessianFactor, testJacobianFactor, testGaussianISAM2, testGaussianBayesNet, testGaussianFactorGraph, testSymmetricBlockMatrix, testNonlinearOptimizer, testVisualISAM2, testGaussianFactorGraphB.

@dellaert
Copy link
Copy Markdown
Member

Any perf numbers? E.g., before/after timeSfmBAL?

@dellaert dellaert requested a review from Copilot April 28, 2026 17:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new panel-blocked partial Cholesky factorization (choleskyPartialBlocked) intended to improve performance via Level-3 BLAS-style operations, and wires it into SymmetricBlockMatrix while adding unit tests for correctness.

Changes:

  • Introduced choleskyPartialBlocked() API + implementation using panel factor + TRSM + SYRK-style Schur updates.
  • Updated SymmetricBlockMatrix::choleskyPartial() to call the blocked routine by default.
  • Added multiple unit tests covering fallback behavior, invariants, topleft offsets, and recovery structure.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
gtsam/base/tests/testCholesky.cpp Adds new unit tests for the blocked algorithm plus an SPD helper.
gtsam/base/cholesky.h Declares the new choleskyPartialBlocked() API and documents intended behavior.
gtsam/base/cholesky.cpp Implements choleskyPartialBlocked() with panel factorization and trailing updates.
gtsam/base/SymmetricBlockMatrix.cpp Switches internal partial Cholesky call to the new blocked implementation.

Comment thread gtsam/base/cholesky.cpp
/* ************************************************************************* */
bool choleskyPartialBlocked(Matrix& ABC, size_t nFrontal, size_t topleft,
size_t blockSize) {
if (nFrontal == 0) return true;
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert(blockSize > 0) is compiled out in release builds; if blockSize == 0, the loop for (size_t k = 0; k < nFrontal; k += blockSize) will never advance and can hang (infinite loop). Replace the assert with a runtime guard (e.g., return false or throw) to make the function safe in non-debug builds.

Suggested change
if (nFrontal == 0) return true;
if (nFrontal == 0) return true;
if (blockSize == 0) return false;

Copilot uses AI. Check for mistakes.
Comment on lines 85 to +87
DenseIndex topleft = variableColOffsets_[blockStart_];
if (!gtsam::choleskyPartial(matrix_, offset(nFrontals) - topleft, topleft)) {
const size_t nF = offset(nFrontals) - topleft;
if (!gtsam::choleskyPartialBlocked(matrix_, nF, topleft)) {
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset(nFrontals) and topleft are DenseIndex (signed). Computing the difference and storing it in size_t can underflow/wrap if the difference is negative, producing a huge nF and leading to out-of-bounds behavior. Keep the computation in DenseIndex (or validate non-negativity) before converting to size_t.

Copilot uses AI. Check for mistakes.
DenseIndex topleft = variableColOffsets_[blockStart_];
if (!gtsam::choleskyPartial(matrix_, offset(nFrontals) - topleft, topleft)) {
const size_t nF = offset(nFrontals) - topleft;
if (!gtsam::choleskyPartialBlocked(matrix_, nF, topleft)) {
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description focuses on adding choleskyPartialBlocked() but the code also switches SymmetricBlockMatrix::choleskyPartial() to use it by default, which is a behavior/performance change for existing callers. Either call this out explicitly in the PR description (and any release notes, if applicable) or consider guarding this behind a configuration/flag if the intent is to keep existing behavior by default.

Copilot uses AI. Check for mistakes.
Comment on lines +144 to +146
static Matrix randomSPD(size_t n, unsigned seed = 42) {
srand(seed);
Matrix A = Matrix::Random(n, n);
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeding via srand(seed) mutates global RNG state, which can couple this test file to other tests that also use std::rand/Eigen::Random in the same process. Prefer a local deterministic generator (or Eigen's random seed API, if available) to avoid global side effects and improve test isolation.

Copilot uses AI. Check for mistakes.
0., 0., 0., 0., 0., 0., 2.5776).finished();

Matrix RSL(ABC);
choleskyPartialBlocked(RSL, 3, 0, 2); // blockSize=2 forces multiple panels
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't assert the return value of choleskyPartialBlocked. If the factorization fails and returns false, the subsequent invariant check may be harder to interpret. Store/EXPECT the returned bool (as done in other tests) so failures are reported clearly at the source.

Suggested change
choleskyPartialBlocked(RSL, 3, 0, 2); // blockSize=2 forces multiple panels
const bool success =
choleskyPartialBlocked(RSL, 3, 0, 2); // blockSize=2 forces multiple panels
EXPECT(success);

Copilot uses AI. Check for mistakes.
@jashshah999
Copy link
Copy Markdown
Contributor Author

I don't have the BAL datasets available on this machine to run timeSFMBAL. The honest benchmark I ran (standalone, plain Eigen without optimized BLAS) showed mixed results because Eigen's LLT is already internally blocked:

     N    nF  scalar(ms)   bs=16(ms)   bs=64(ms)
    64    48       0.015       0.012       0.013
   128    96       0.054       0.060       0.080
   256   192       0.362       0.260       0.374
   512   384       1.712       1.830       1.853

The blocked path uses choleskyStep column-by-column for the panel diagonal (to preserve underconstrained-pivot handling), then Level-3 TRSM+SYRK for the trailing update. On plain Eigen, choleskyPartial's Eigen::LLT is already doing something similar internally, so the gain is small.

The real benefit would come with an optimized BLAS backend where rankUpdate and solveInPlace map to multi-threaded dsyrk/dtrsm. If you think the gain isn't sufficient without that, I understand — happy to close this if it's not pulling its weight.

@jashshah999
Copy link
Copy Markdown
Contributor Author

Just following up. Happy to close this if the marginal gains without an optimized BLAS backend don't justify the added code. Let me know your preference.

@dellaert
Copy link
Copy Markdown
Member

Yeah, if indeed Eigen is internally blocked, then I’d prefer having Eigen do the lifting - rather than adding mode code in GTSAM.

@dellaert dellaert closed this Apr 30, 2026
@jashshah999
Copy link
Copy Markdown
Contributor Author

Makes sense. Closing this since Eigen handles the blocking internally and the added code doesn't provide enough benefit to justify the maintenance. Thanks for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants