Skip to content

Refactor: Deduplicate schemas using hash-based storage#4319

Merged
kddejong merged 2 commits intoaws-cloudformation:mainfrom
kddejong:fix/schema/dedup
Dec 10, 2025
Merged

Refactor: Deduplicate schemas using hash-based storage#4319
kddejong merged 2 commits intoaws-cloudformation:mainfrom
kddejong:fix/schema/dedup

Conversation

@kddejong
Copy link
Contributor

@kddejong kddejong commented Dec 9, 2025

This commit refactors the schema management system to eliminate duplicate storage of identical resource schemas across regions by implementing a hash-based de-duplication strategy.

Changes:

  • Modified schema generation to store unique schemas in resources/ folder
  • Updated provider files to reference schemas by hash instead of embedding
  • Added hash checking before schema loading to reduce redundant I/O

Storage Impact:

  • Reduced schema storage from 25MB to 15.9MB (~36% reduction)

Performance:

  • Schema loading optimized by checking hash before reading file content
  • Eliminates redundant schema loads when same schema exists across regions

Technical Details:

  • Schemas with identical content now stored once with hash-based filename
  • Provider files map resource types to schema hashes for each region
  • Maintains backward compatibility with existing schema lookup APIs

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@codecov
Copy link

codecov bot commented Dec 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.48%. Comparing base (52b4975) to head (cd00e12).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4319      +/-   ##
==========================================
- Coverage   94.35%   93.48%   -0.88%     
==========================================
  Files         416      417       +1     
  Lines       14059    14130      +71     
  Branches     2787     2816      +29     
==========================================
- Hits        13266    13210      -56     
- Misses        445      573     +128     
+ Partials      348      347       -1     
Flag Coverage Δ
unittests 93.47% <100.00%> (-0.88%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This commit refactors the schema management system to eliminate duplicate
storage of identical resource schemas across regions by implementing a
hash-based deduplication strategy.

Changes:
- Modified schema generation to store unique schemas in resources/ folder
- Updated provider files to reference schemas by hash instead of embedding
- Added hash checking before schema loading to reduce redundant I/O
- Fixed Custom:: resource type normalization for proper hash lookup
- Added type annotations to resolve mypy errors
- Updated ruff noqa comments for consistency with project standards

Storage Impact:
- Reduced schema storage from 25MB to 15.9MB (~36% reduction)
- Provider files: 25MB → 2.9MB (now contain hash references)
- Resources folder: 8KB → 13MB (deduplicated schema storage)

Performance:
- Schema loading optimized by checking hash before reading file content
- Eliminates redundant schema loads when same schema exists across regions
- Test suite runtime impact: ~4-5 seconds slower (within acceptable range)

Technical Details:
- Schemas with identical content now stored once with hash-based filename
- Provider files map resource types to schema hashes for each region
- Custom:: resources normalized to AWS::CloudFormation::CustomResource
- Maintains backward compatibility with existing schema lookup APIs
- Replace region-specific schema files with hash-based storage system
- Schemas now stored in resources/ directory with hash-based filenames
- Region files map resource types to schema hashes for deduplication
- Update patch_schemas() method to work with new storage structure
- Remove obsolete _patch_region_schemas() method
- Clean up redundant module.json creation in _update_provider_schema
- Maintain patching functionality in --update-specs command
- Fix related test failures for new storage system

This reduces storage from ~56 duplicated files per region to shared
hash-based files, significantly reducing repository size while
maintaining all existing functionality.
@kddejong kddejong merged commit da3d01f into aws-cloudformation:main Dec 10, 2025
22 checks passed
@kddejong kddejong deleted the fix/schema/dedup branch December 10, 2025 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant