feat: Phase 1 memory optimization for JavaScript SBOM generation#4585
feat: Phase 1 memory optimization for JavaScript SBOM generation#4585matthyx wants to merge 1 commit intoanchore:mainfrom
Conversation
internal/licenses/context.go
Outdated
There was a problem hiding this comment.
I'm a little confused about this change, there should be a single global license scanner available in the context already, perhaps there needs to be some changes to pass context further instead?
22460ae to
1b62af8
Compare
Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
67c164b to
8ba824d
Compare
kzantow
left a comment
There was a problem hiding this comment.
Thank you for the PR, @matthyx. Apologies for the delay responding here, and if my comments may be off base, but it seems like you may have taken some coding agent's suggestions here and submitted them without verifying they have the desired effect. When it comes to performance optimization, the changes must be verified independently -- they don't always have the effect you think they will, and including a lot of changes that aren't addressing the same specific issue can conflate the individual change with the overall results.
In order to accept these changes, I think each directly related change should be separated into individual pull requests (for example: changing SplitN to Cut in multiple places), each pull request should include some metrics from tests you have done and sample input that we can use ourselves to demonstrate the improvement. I think what you will find some of these changes are actually marginally detrimental to performance, most of them aren't doing much, and a couple probably have a reasonably positive impact. You would be able to tell which changes are helpful before submitting future PRs by including the aforementioned performance metrics and inputs.
I left more detailed comments inline about some of the specific changes and some other observations.
| sort.Strings(list) | ||
| return list | ||
| // use map for O(1) lookups without strset overhead | ||
| unique := make(map[string]struct{}, len(ss)) |
There was a problem hiding this comment.
This change might be good, but it's sacrificing readability for what is probably a small gain.
Tangentially, I question whether this needs to be sorted.
| } | ||
|
|
||
| seen := strset.New() | ||
| seen := make(map[pairKey]struct{}) |
There was a problem hiding this comment.
This looks like a good change -- we should avoid creating temporary strings like this as much as possible, but using the map[]struct{} is always cumbersome. This is a change we could accept, though I hope at some point we add a generic set. We have a library to share across our projects for this purpose, and added a suggestion here: anchore/go-collections#5
| yarnPkgs, err = parseYarnV1LockFile(bufReader) | ||
| } else { | ||
| // For v2+, we need to load the entire file as YAML | ||
| data, err := io.ReadAll(bufReader) |
There was a problem hiding this comment.
It looks like this should just be able to pass the bufReader to an io.NopCloser and not have to io.ReadAll, either
| continue | ||
| } | ||
| pkgKey := name + "@" + ver | ||
| pkgKey := strings.Join([]string{name, ver}, "@") |
There was a problem hiding this comment.
This is the same less efficient change.
| func ContextLicenseScanner(ctx context.Context) (Scanner, error) { | ||
| if s, ok := ctx.Value(ctxKey).(Scanner); ok { | ||
| return s, nil | ||
| s, ok := ctx.Value(ctxKey).(Scanner) |
There was a problem hiding this comment.
There is still an unnecessary change here. There is a scanner created before cataloging happens, and that scanner is passed in the context to all users.
| continue | ||
| } | ||
| pkgKey := name + "@" + ver | ||
| pkgKey := strings.Join([]string{name, ver}, "@") |
There was a problem hiding this comment.
This is still concatenating strings together and is less efficient -- up to half as efficient. This was my intuition, so I wrote a benchmark to verify this; I'll skip to the results:
% go test -bench=. -benchmem
...
Benchmark_strcat-12 24962109 41.53 ns/op 0 B/op 0 allocs/op
Benchmark_stringsJoin-12 18459672 62.24 ns/op 16 B/op 1 allocs/op
Test:
package main
import (
"math/rand"
"strings"
"testing"
"time"
)
func Benchmark_strcat(b *testing.B) {
for b.Loop() {
s1, s2 := samples()
got := s1 + "/" + s2
if len(got) > 100 {
return
}
}
}
func Benchmark_stringsJoin(b *testing.B) {
for b.Loop() {
s1, s2 := samples()
got := strings.Join([]string{s1, s2}, "/")
if len(got) > 100 {
return
}
}
}
var input = []string{
"dsfgdsg",
"34gg",
"dfsgdfg",
"asqwfdf",
"qwefqwef",
"agsah",
"adgsadf",
"wergerwg",
}
var ilen = len(input)
var r = rand.New(rand.NewSource(time.Now().UnixMilli()))
func samples() (s1, s2 string) {
i := r.Intn(ilen)
return input[i], input[(i+1)%ilen]
}
To be sure there wasn't too much being optimized out, I took the example further and it continued to show the concat form uses fewer allocations and less time.
| continue | ||
| } | ||
| pkgKey := name + "@" + ver | ||
| pkgKey := strings.Join([]string{name, ver}, "@") |
| var normalizedVersion = strings.SplitN(depVersion, "(", 2)[0] | ||
| dependencies[depName] = normalizedVersion | ||
| // Use strings.Cut for more efficient splitting | ||
| if normalizedVersion, _, ok := strings.Cut(depVersion, "("); ok { |
There was a problem hiding this comment.
All of these changes to Cut seem like they have very limited impact, neither function is allocating and this change makes it more verbose. Additionally, the Split(..)[0] pattern is used throughout Syft whereas the other form is not; without a demonstrated benefit here, I think the should not be changed.
| var normalizedVersion = strings.SplitN(versionSpecifier, "(", 2)[0] | ||
| pkg.Dependencies[name] = normalizedVersion | ||
| // Use strings.Cut for more efficient splitting | ||
| if normalizedVersion, _, ok := strings.Cut(versionSpecifier, "("); ok { |
|
Thanks @kzantow for the detailed review and suggestions. |
Use bufio.Scanner for line-by-line reading instead of loading entire file into memory. For large yarn v1 lockfiles, this significantly reduces peak memory usage by avoiding reading and regex.Splitting the entire file. Changes: - Add isYarnV1Lockfile() to peek at first 100 bytes to detect version - Use bufio.Scanner in parseYarnV1LockFile() for streaming parsing - Keep YAML-based parsing for v2+ lockfiles Benchmarks: - BenchmarkParseYarnV1LockFile: 37017 ns/op, 17481 B/op, 211 allocs/op - BenchmarkParseYarnV1LockFile_Large (1000 pkgs): 3.46 ms/op, 927 KB, 13037 allocs/op - BenchmarkParseYarnV1LockFile_VeryLarge (5000 pkgs): 17.6 ms/op, 5.09 MB, 65074 allocs/op Related: PR anchore#4585 (closed)
Replace strset-based deduplication with efficient map-based approach to avoid
external dependency overhead. Use struct keys instead of string concatenation
for relationship tracking, eliminating temporary string allocations.
Changes:
- Add pairKey struct to track relationship pairs without string concatenation
- Replace strset.New() with map[string]struct{} in deduplicate()
- Use pairKey struct in Resolve() to avoid string key construction
- Remove dependency on scylladb/go-set/strset
Benchmarks:
- BenchmarkDeduplicate_VeryLarge (5000 strings): 85566 ns/op, 237232 B/op, 27 allocs/op
- BenchmarkResolve_CraftedRelationships_VeryLarge (1000 pkgs): 857822 ns/op, 1.03 MB, 12063 allocs/op
Related: PR anchore#4585 (closed)
This commit implements Phase 1 of the memory optimization plan, focusing on quick wins that reduce memory allocations.
Changes:
JavaScript Lock File Parsers:
loading entire file and regex.Split, reducing peak memory usage.
Only loads the first 100 bytes to detect version, then parses v1 files
line-by-line directly from the file handle.
use strings.Cut for efficient splitting, reducing string allocations
Dependency Resolution:
Note: A previous optimization to defer license scanner creation was removed
after review, as the scanner should always be set in the context before
cataloging begins.
Memory Impact:
All existing tests pass with these changes.
Related: Detailed analysis available at https://gist.github.com/matthyx/e500d3282876fd6a41064770eacc7229
Issue: #4586