Skip to content

feat: Phase 1 memory optimization for JavaScript SBOM generation#4585

Open
matthyx wants to merge 1 commit intoanchore:mainfrom
matthyx:memory-optimization-phase-1
Open

feat: Phase 1 memory optimization for JavaScript SBOM generation#4585
matthyx wants to merge 1 commit intoanchore:mainfrom
matthyx:memory-optimization-phase-1

Conversation

@matthyx
Copy link

@matthyx matthyx commented Jan 30, 2026

This commit implements Phase 1 of the memory optimization plan, focusing on quick wins that reduce memory allocations.

Changes:

  1. JavaScript Lock File Parsers:

    • yarn.lock: Use bufio.Scanner for line-by-line reading instead of
      loading entire file and regex.Split, reducing peak memory usage.
      Only loads the first 100 bytes to detect version, then parses v1 files
      line-by-line directly from the file handle.
    • pnpm-lock.yaml: Replace string concatenation with strings.Join and
      use strings.Cut for efficient splitting, reducing string allocations
    • Pre-allocate map sizes to reduce reallocations
  2. Dependency Resolution:

    • Replace strset-based deduplication with efficient map-based approach
    • Use struct keys instead of string concatenation for relationship tracking
    • Eliminate temporary string allocations in deduplicate function

Note: A previous optimization to defer license scanner creation was removed
after review, as the scanner should always be set in the context before
cataloging begins.

Memory Impact:

  • Reduced memory usage for JavaScript SBOM generation
  • Peak memory reduction for large JS projects (10k+ packages)
  • Maintains all existing functionality and test coverage

All existing tests pass with these changes.

Related: Detailed analysis available at https://gist.github.com/matthyx/e500d3282876fd6a41064770eacc7229
Issue: #4586

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about this change, there should be a single global license scanner available in the context already, perhaps there needs to be some changes to pass context further instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, sorry

@matthyx matthyx force-pushed the memory-optimization-phase-1 branch from 22460ae to 1b62af8 Compare January 30, 2026 17:39
Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
@matthyx matthyx force-pushed the memory-optimization-phase-1 branch from 67c164b to 8ba824d Compare January 30, 2026 18:04
Copy link
Contributor

@kzantow kzantow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR, @matthyx. Apologies for the delay responding here, and if my comments may be off base, but it seems like you may have taken some coding agent's suggestions here and submitted them without verifying they have the desired effect. When it comes to performance optimization, the changes must be verified independently -- they don't always have the effect you think they will, and including a lot of changes that aren't addressing the same specific issue can conflate the individual change with the overall results.

In order to accept these changes, I think each directly related change should be separated into individual pull requests (for example: changing SplitN to Cut in multiple places), each pull request should include some metrics from tests you have done and sample input that we can use ourselves to demonstrate the improvement. I think what you will find some of these changes are actually marginally detrimental to performance, most of them aren't doing much, and a couple probably have a reasonably positive impact. You would be able to tell which changes are helpful before submitting future PRs by including the aforementioned performance metrics and inputs.

I left more detailed comments inline about some of the specific changes and some other observations.

sort.Strings(list)
return list
// use map for O(1) lookups without strset overhead
unique := make(map[string]struct{}, len(ss))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change might be good, but it's sacrificing readability for what is probably a small gain.

Tangentially, I question whether this needs to be sorted.

}

seen := strset.New()
seen := make(map[pairKey]struct{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good change -- we should avoid creating temporary strings like this as much as possible, but using the map[]struct{} is always cumbersome. This is a change we could accept, though I hope at some point we add a generic set. We have a library to share across our projects for this purpose, and added a suggestion here: anchore/go-collections#5

yarnPkgs, err = parseYarnV1LockFile(bufReader)
} else {
// For v2+, we need to load the entire file as YAML
data, err := io.ReadAll(bufReader)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this should just be able to pass the bufReader to an io.NopCloser and not have to io.ReadAll, either

continue
}
pkgKey := name + "@" + ver
pkgKey := strings.Join([]string{name, ver}, "@")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same less efficient change.

func ContextLicenseScanner(ctx context.Context) (Scanner, error) {
if s, ok := ctx.Value(ctxKey).(Scanner); ok {
return s, nil
s, ok := ctx.Value(ctxKey).(Scanner)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still an unnecessary change here. There is a scanner created before cataloging happens, and that scanner is passed in the context to all users.

continue
}
pkgKey := name + "@" + ver
pkgKey := strings.Join([]string{name, ver}, "@")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still concatenating strings together and is less efficient -- up to half as efficient. This was my intuition, so I wrote a benchmark to verify this; I'll skip to the results:

% go test -bench=. -benchmem
...
Benchmark_strcat-12             24962109                41.53 ns/op            0 B/op          0 allocs/op
Benchmark_stringsJoin-12        18459672                62.24 ns/op           16 B/op          1 allocs/op

Test:

package main

import (
	"math/rand"
	"strings"
	"testing"
	"time"
)

func Benchmark_strcat(b *testing.B) {
	for b.Loop() {
		s1, s2 := samples()
		got := s1 + "/" + s2
		if len(got) > 100 {
			return
		}
	}
}

func Benchmark_stringsJoin(b *testing.B) {
	for b.Loop() {
		s1, s2 := samples()
		got := strings.Join([]string{s1, s2}, "/")
		if len(got) > 100 {
			return
		}

	}
}

var input = []string{
	"dsfgdsg",
	"34gg",
	"dfsgdfg",
	"asqwfdf",
	"qwefqwef",
	"agsah",
	"adgsadf",
	"wergerwg",
}

var ilen = len(input)
var r = rand.New(rand.NewSource(time.Now().UnixMilli()))

func samples() (s1, s2 string) {
	i := r.Intn(ilen)
	return input[i], input[(i+1)%ilen]
}

To be sure there wasn't too much being optimized out, I took the example further and it continued to show the concat form uses fewer allocations and less time.

continue
}
pkgKey := name + "@" + ver
pkgKey := strings.Join([]string{name, ver}, "@")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another instance.

var normalizedVersion = strings.SplitN(depVersion, "(", 2)[0]
dependencies[depName] = normalizedVersion
// Use strings.Cut for more efficient splitting
if normalizedVersion, _, ok := strings.Cut(depVersion, "("); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these changes to Cut seem like they have very limited impact, neither function is allocating and this change makes it more verbose. Additionally, the Split(..)[0] pattern is used throughout Syft whereas the other form is not; without a demonstrated benefit here, I think the should not be changed.

var normalizedVersion = strings.SplitN(versionSpecifier, "(", 2)[0]
pkg.Dependencies[name] = normalizedVersion
// Use strings.Cut for more efficient splitting
if normalizedVersion, _, ok := strings.Cut(versionSpecifier, "("); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another Cut change.

@matthyx
Copy link
Author

matthyx commented Feb 13, 2026

Thanks @kzantow for the detailed review and suggestions.
I will rework this to focus on hot paths with separate micro-benchmarks for each of them... but in 10 days or so, after my PTO.
Thanks for your time and understanding.

matthyx added a commit to matthyx/syft that referenced this pull request Feb 18, 2026
Use bufio.Scanner for line-by-line reading instead of loading entire file
into memory. For large yarn v1 lockfiles, this significantly reduces peak
memory usage by avoiding reading and regex.Splitting the entire file.

Changes:
- Add isYarnV1Lockfile() to peek at first 100 bytes to detect version
- Use bufio.Scanner in parseYarnV1LockFile() for streaming parsing
- Keep YAML-based parsing for v2+ lockfiles

Benchmarks:
- BenchmarkParseYarnV1LockFile: 37017 ns/op, 17481 B/op, 211 allocs/op
- BenchmarkParseYarnV1LockFile_Large (1000 pkgs): 3.46 ms/op, 927 KB, 13037 allocs/op
- BenchmarkParseYarnV1LockFile_VeryLarge (5000 pkgs): 17.6 ms/op, 5.09 MB, 65074 allocs/op

Related: PR anchore#4585 (closed)
matthyx added a commit to matthyx/syft that referenced this pull request Feb 18, 2026
Replace strset-based deduplication with efficient map-based approach to avoid
external dependency overhead. Use struct keys instead of string concatenation
for relationship tracking, eliminating temporary string allocations.

Changes:
- Add pairKey struct to track relationship pairs without string concatenation
- Replace strset.New() with map[string]struct{} in deduplicate()
- Use pairKey struct in Resolve() to avoid string key construction
- Remove dependency on scylladb/go-set/strset

Benchmarks:
- BenchmarkDeduplicate_VeryLarge (5000 strings): 85566 ns/op, 237232 B/op, 27 allocs/op
- BenchmarkResolve_CraftedRelationships_VeryLarge (1000 pkgs): 857822 ns/op, 1.03 MB, 12063 allocs/op

Related: PR anchore#4585 (closed)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments