feat: Phase 1 memory optimization for JavaScript SBOM generation by matthyx · Pull Request #4585 · anchore/syft

matthyx · 2026-01-30T16:16:44Z

This commit implements Phase 1 of the memory optimization plan, focusing on quick wins that reduce memory allocations.

Changes:

JavaScript Lock File Parsers:
- yarn.lock: Use bufio.Scanner for line-by-line reading instead of
  loading entire file and regex.Split, reducing peak memory usage.
  Only loads the first 100 bytes to detect version, then parses v1 files
  line-by-line directly from the file handle.
- pnpm-lock.yaml: Replace string concatenation with strings.Join and
  use strings.Cut for efficient splitting, reducing string allocations
- Pre-allocate map sizes to reduce reallocations
Dependency Resolution:
- Replace strset-based deduplication with efficient map-based approach
- Use struct keys instead of string concatenation for relationship tracking
- Eliminate temporary string allocations in deduplicate function

Note: A previous optimization to defer license scanner creation was removed
after review, as the scanner should always be set in the context before
cataloging begins.

Memory Impact:

Reduced memory usage for JavaScript SBOM generation
Peak memory reduction for large JS projects (10k+ packages)
Maintains all existing functionality and test coverage

All existing tests pass with these changes.

Related: Detailed analysis available at https://gist.github.com/matthyx/e500d3282876fd6a41064770eacc7229
Issue: #4586

kzantow · 2026-01-30T17:26:32Z

internal/licenses/context.go

I'm a little confused about this change, there should be a single global license scanner available in the context already, perhaps there needs to be some changes to pass context further instead?

fixed, sorry

MEMORY_ANALYSIS.md

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

kzantow

Thank you for the PR, @matthyx. Apologies for the delay responding here, and if my comments may be off base, but it seems like you may have taken some coding agent's suggestions here and submitted them without verifying they have the desired effect. When it comes to performance optimization, the changes must be verified independently -- they don't always have the effect you think they will, and including a lot of changes that aren't addressing the same specific issue can conflate the individual change with the overall results.

In order to accept these changes, I think each directly related change should be separated into individual pull requests (for example: changing SplitN to Cut in multiple places), each pull request should include some metrics from tests you have done and sample input that we can use ourselves to demonstrate the improvement. I think what you will find some of these changes are actually marginally detrimental to performance, most of them aren't doing much, and a couple probably have a reasonably positive impact. You would be able to tell which changes are helpful before submitting future PRs by including the aforementioned performance metrics and inputs.

I left more detailed comments inline about some of the specific changes and some other observations.

kzantow · 2026-02-04T15:24:57Z

syft/pkg/cataloger/internal/dependency/resolver.go

-	sort.Strings(list)
-	return list
+	// use map for O(1) lookups without strset overhead
+	unique := make(map[string]struct{}, len(ss))


This change might be good, but it's sacrificing readability for what is probably a small gain.

Tangentially, I question whether this needs to be sorted.

kzantow · 2026-02-04T15:26:15Z

syft/pkg/cataloger/internal/dependency/resolver.go

 	}

-	seen := strset.New()
+	seen := make(map[pairKey]struct{})


This looks like a good change -- we should avoid creating temporary strings like this as much as possible, but using the map[]struct{} is always cumbersome. This is a change we could accept, though I hope at some point we add a generic set. We have a library to share across our projects for this purpose, and added a suggestion here: anchore/go-collections#5

kzantow · 2026-02-04T22:07:40Z

syft/pkg/cataloger/javascript/parse_yarn_lock.go

+		yarnPkgs, err = parseYarnV1LockFile(bufReader)
 	} else {
+		// For v2+, we need to load the entire file as YAML
+		data, err := io.ReadAll(bufReader)


It looks like this should just be able to pass the bufReader to an io.NopCloser and not have to io.ReadAll, either

kzantow · 2026-02-04T22:09:15Z

syft/pkg/cataloger/javascript/parse_pnpm_lock.go

 			continue
 		}
-		pkgKey := name + "@" + ver
+		pkgKey := strings.Join([]string{name, ver}, "@")


This is the same less efficient change.

kzantow · 2026-02-13T00:22:44Z

internal/licenses/context.go

 func ContextLicenseScanner(ctx context.Context) (Scanner, error) {
-	if s, ok := ctx.Value(ctxKey).(Scanner); ok {
-		return s, nil
+	s, ok := ctx.Value(ctxKey).(Scanner)


There is still an unnecessary change here. There is a scanner created before cataloging happens, and that scanner is passed in the context to all users.

kzantow · 2026-02-13T00:54:41Z

syft/pkg/cataloger/javascript/parse_pnpm_lock.go

 			continue
 		}
-		pkgKey := name + "@" + ver
+		pkgKey := strings.Join([]string{name, ver}, "@")


This is still concatenating strings together and is less efficient -- up to half as efficient. This was my intuition, so I wrote a benchmark to verify this; I'll skip to the results:

% go test -bench=. -benchmem ... Benchmark_strcat-12 24962109 41.53 ns/op 0 B/op 0 allocs/op Benchmark_stringsJoin-12 18459672 62.24 ns/op 16 B/op 1 allocs/op

Test:

package main import ( "math/rand" "strings" "testing" "time" ) func Benchmark_strcat(b *testing.B) { for b.Loop() { s1, s2 := samples() got := s1 + "/" + s2 if len(got) > 100 { return } } } func Benchmark_stringsJoin(b *testing.B) { for b.Loop() { s1, s2 := samples() got := strings.Join([]string{s1, s2}, "/") if len(got) > 100 { return } } } var input = []string{ "dsfgdsg", "34gg", "dfsgdfg", "asqwfdf", "qwefqwef", "agsah", "adgsadf", "wergerwg", } var ilen = len(input) var r = rand.New(rand.NewSource(time.Now().UnixMilli())) func samples() (s1, s2 string) { i := r.Intn(ilen) return input[i], input[(i+1)%ilen] }

To be sure there wasn't too much being optimized out, I took the example further and it continued to show the concat form uses fewer allocations and less time.

kzantow · 2026-02-13T00:55:54Z

syft/pkg/cataloger/javascript/parse_pnpm_lock.go

 			continue
 		}
-		pkgKey := name + "@" + ver
+		pkgKey := strings.Join([]string{name, ver}, "@")


Another instance.

kzantow · 2026-02-13T02:06:28Z

syft/pkg/cataloger/javascript/parse_pnpm_lock.go

-			var normalizedVersion = strings.SplitN(depVersion, "(", 2)[0]
-			dependencies[depName] = normalizedVersion
+			// Use strings.Cut for more efficient splitting
+			if normalizedVersion, _, ok := strings.Cut(depVersion, "("); ok {


All of these changes to Cut seem like they have very limited impact, neither function is allocating and this change makes it more verbose. Additionally, the Split(..)[0] pattern is used throughout Syft whereas the other form is not; without a demonstrated benefit here, I think the should not be changed.

kzantow · 2026-02-13T02:06:38Z

syft/pkg/cataloger/javascript/parse_pnpm_lock.go

-				var normalizedVersion = strings.SplitN(versionSpecifier, "(", 2)[0]
-				pkg.Dependencies[name] = normalizedVersion
+				// Use strings.Cut for more efficient splitting
+				if normalizedVersion, _, ok := strings.Cut(versionSpecifier, "("); ok {


Another Cut change.

matthyx · 2026-02-13T06:28:31Z

Thanks @kzantow for the detailed review and suggestions.
I will rework this to focus on hot paths with separate micro-benchmarks for each of them... but in 10 days or so, after my PTO.
Thanks for your time and understanding.

Use bufio.Scanner for line-by-line reading instead of loading entire file into memory. For large yarn v1 lockfiles, this significantly reduces peak memory usage by avoiding reading and regex.Splitting the entire file. Changes: - Add isYarnV1Lockfile() to peek at first 100 bytes to detect version - Use bufio.Scanner in parseYarnV1LockFile() for streaming parsing - Keep YAML-based parsing for v2+ lockfiles Benchmarks: - BenchmarkParseYarnV1LockFile: 37017 ns/op, 17481 B/op, 211 allocs/op - BenchmarkParseYarnV1LockFile_Large (1000 pkgs): 3.46 ms/op, 927 KB, 13037 allocs/op - BenchmarkParseYarnV1LockFile_VeryLarge (5000 pkgs): 17.6 ms/op, 5.09 MB, 65074 allocs/op Related: PR anchore#4585 (closed)

Replace strset-based deduplication with efficient map-based approach to avoid external dependency overhead. Use struct keys instead of string concatenation for relationship tracking, eliminating temporary string allocations. Changes: - Add pairKey struct to track relationship pairs without string concatenation - Replace strset.New() with map[string]struct{} in deduplicate() - Use pairKey struct in Resolve() to avoid string key construction - Remove dependency on scylladb/go-set/strset Benchmarks: - BenchmarkDeduplicate_VeryLarge (5000 strings): 85566 ns/op, 237232 B/op, 27 allocs/op - BenchmarkResolve_CraftedRelationships_VeryLarge (1000 pkgs): 857822 ns/op, 1.03 MB, 12063 allocs/op Related: PR anchore#4585 (closed)

matthyx mentioned this pull request Jan 30, 2026

[Performance] Optimize memory usage for large JavaScript SBOM generation (3-5GB → 1GB) #4586

Open

15 tasks

kzantow reviewed Jan 30, 2026

View reviewed changes

MEMORY_ANALYSIS.md Outdated Show resolved Hide resolved

matthyx force-pushed the memory-optimization-phase-1 branch from 22460ae to 1b62af8 Compare January 30, 2026 17:39

feat: Phase 1 memory optimization for JavaScript SBOM generation

8ba824d

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

matthyx force-pushed the memory-optimization-phase-1 branch from 67c164b to 8ba824d Compare January 30, 2026 18:04

kzantow requested changes Feb 13, 2026

View reviewed changes

matthyx mentioned this pull request Feb 18, 2026

perf: optimize yarn.lock v1 parsing with bufio.Scanner matthyx/syft#1

Open

matthyx mentioned this pull request Feb 18, 2026

perf: optimize dependency resolver with map-based deduplication matthyx/syft#2

Open

Conversation

matthyx commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kzantow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthyx commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

matthyx commented Jan 30, 2026 •

edited

Loading