Skip to content

[Performance] Optimize memory usage for large JavaScript SBOM generation (3-5GB → 1GB) #4586

@matthyx

Description

@matthyx

Problem Description

Syft consumes 3-5+ GB of RAM when generating SBOMs for large JavaScript applications (10,000+ packages). This makes Syft unusable for many real-world projects and causes OOM errors in CI/CD environments with limited memory.

Root Causes

Memory profiling (available in MEMORY_ANALYSIS.md) identified five primary issues:

  1. JavaScript Lock File Parsers - Load entire documents into memory without streaming (~150-300MB)
  2. Dependency Resolution - O(n²) complexity with extensive string operations (~70-130MB)
  3. Package ID Generation - Creates large string representations of metadata (~20-100MB)
  4. License Scanning - Initializes massive regex structures at startup (~8-15MB)
  5. File Indexing - Keeps entire file trees in memory (~7-15MB)

For a typical large JavaScript project, peak memory allocation is 355-760MB+, but with GC pressure and runtime overhead, this grows to 3-5GB+.

See the complete analysis: MEMORY_ANALYSIS.md

Solution: 5-Phase Optimization Plan

Phase 1: Quick Wins ✅ (PR #4585)

Goal: Reduce memory by 15-20% (100-150MB)
Status: ✅ Submitted for review

  • Optimize string operations in JS parsers
  • Reduce string duplication in dependency resolution
  • Implement lazy license scanner initialization

Expected Impact: 100-150MB reduction
Tracking: PR #4585


Phase 2: Parser Optimization 🚧

Goal: Reduce memory by 30-40% (200-300MB)
Status: Not started

  • Stream JSON parsing for package-lock.json
  • Optimize yarn.lock line-by-line parsing
  • Implement streaming YAML parsing for pnpm-lock.yaml

Expected Impact: 200-300MB reduction
Cumulative Impact: 40-50% total reduction


Phase 3: Dependency Resolution 🚧

Goal: Reduce memory by 15-25% (100-150MB)
Status: Not started

  • Implement incremental resolution
  • Optimize set operations throughout
  • Reduce memory pressure in concurrent processing

Expected Impact: 100-150MB reduction
Cumulative Impact: 55-65% total reduction


Phase 4: ID Generation Optimization 🚧

Goal: Reduce memory by 10-20% (50-100MB)
Status: Not started

  • Implement selective metadata hashing
  • Optimize sorting to avoid metadata stringification
  • Cache ID computations where possible

Expected Impact: 50-100MB reduction
Cumulative Impact: 65-75% total reduction


Phase 5: Advanced Optimizations 🚧

Goal: Final 10-15% reduction (50-100MB)
Status: Not started

  • Implement memory pooling for frequently used structures
  • Add chunked processing for large datasets
  • Add configuration options for memory limits

Expected Impact: 50-100MB reduction
Cumulative Impact: 75-80% total reduction

Expected Overall Impact

Phase Reduction Cumulative
Phase 1 100-150MB 15-20% (✅ Done)
Phase 2 200-300MB 40-50%
Phase 3 100-150MB 55-65%
Phase 4 50-100MB 65-75%
Phase 5 50-100MB 75-80%

Final Goal: Reduce peak memory from 3-5+ GB to 600MB-1GB

Related PRs

Testing & Validation

Performance metrics to track:

  • Peak memory allocation (heap profiling)
  • Allocation rate (pprof)
  • GC pause times and frequency
  • Execution time (ensure no regression)

Test cases needed:

  • Small JS projects (<100 packages)
  • Medium JS projects (100-1,000 packages)
  • Large JS projects (1,000-10,000 packages)
  • Very large JS projects (10,000+ packages)

Benchmark commands:

# Run with memory profiling
go test -bench=. -benchmem -memprofile=mem.prof ./...

# Compare profiles
go tool pprof -base mem_before.prof mem_after.prof

# Visualize allocations
go tool pprof -http=:8080 mem.prof

Additional Considerations

  1. Configuration: Add memory limits and graceful degradation
  2. Metrics: Expose memory usage metrics for monitoring
  3. Documentation: Update docs with memory requirements
  4. Testing: Add regression tests for memory usage

Motivation

Large JavaScript applications are common in modern development. Syft should be able to handle these without requiring excessive resources. These optimizations will:

  • Make Syft viable for real-world projects
  • Reduce CI/CD costs (smaller memory requirements)
  • Prevent OOM errors in production
  • Improve overall user experience

References

  • Full analysis: See MEMORY_ANALYSIS.md in the codebase
  • Profiling data: Available in pprof_baseline/ directories
  • Original issue: High memory consumption for JS projects

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions