Fix registry incremental build processing all providers#63769
Merged
kaxil merged 11 commits intoapache:mainfrom Mar 17, 2026
Merged
Fix registry incremental build processing all providers#63769kaxil merged 11 commits intoapache:mainfrom
kaxil merged 11 commits intoapache:mainfrom
Conversation
The `--provider` flag was only passed to `extract_metadata.py` but not to `extract_parameters.py` or `extract_connections.py`. This caused incremental builds to scan all 99 providers and 1625 modules instead of just the requested one.
gopidesupavan
approved these changes
Mar 17, 2026
The registry workflow was building the CI image from scratch every run (~24 min) because it lacked the BuildKit mount cache that ci-image-build.yml provides. Inline `breeze ci-image build` with registry cache doesn't help because Docker layer cache invalidates on every commit when the build context changes. Split into two jobs following the established pattern used by ci-amd-arm.yml and update-constraints-on-push.yml: - `build-ci-image`: calls ci-image-build.yml which handles mount cache restore, ghcr.io login, registry cache, and image stashing - `build-and-publish-registry`: restores the stashed image via prepare_breeze_and_image action, then runs the rest unchanged
7cc1435 to
305e9a9
Compare
extract_parameters.py with --provider intentionally skips writing modules.json (only the targeted provider's parameters are extracted). The merge script assumed modules.json always exists, causing a FileNotFoundError during incremental builds. Handle missing new_modules_path the same way missing existing_modules_path is already handled: treat it as an empty list.
The prepare_breeze_and_image action loads the CI image from /mnt, which requires make_mnt_writeable.sh to run first. Each job gets a fresh runner, so the writeable /mnt from the build job doesn't carry over.
Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with --frozen-lockfile. Regenerate the lockfile with pnpm 9 to match.
The prebuild script ran `uv run` without --project, causing uv to resolve the full workspace including samba → krb5 which needs libkrb5-dev (not installed on the CI runner).
… on S3 Eleventy pagination templates emit empty fallback JSON for every provider, even when only one provider's data was extracted. A plain `aws s3 sync` uploads those stubs and overwrites real connection/parameter data. Changes: - Exclude per-provider connections.json and parameters.json from the main S3 sync during incremental builds, then selectively upload only the target provider's API files - Filter connections early in extract_connections.py (before the loop) and support space-separated multi-provider IDs - Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI - Document the Eleventy pagination limitation in README and AGENTS.md
The previous exclude only covered connections.json and parameters.json, but modules.json and versions.json for non-target providers also contain incomplete data (no version info extracted) and would overwrite correct data on S3. Simplify to exclude the entire api/providers/* subtree and selectively upload only the target provider's directory.
Non-target provider pages are rebuilt without connection/parameter data (the version-specific extraction files don't exist locally). Without this exclude, the incremental build overwrites complete HTML pages on S3 with versions missing the connection builder section.
The providers listing page uses merged data (all providers) and must be updated during incremental builds — especially for new providers. AWS CLI --include after --exclude re-includes the specific file.
imrichardwu
pushed a commit
to imrichardwu/airflow
that referenced
this pull request
Mar 18, 2026
The `--provider` flag was only passed to `extract_metadata.py` but not to `extract_parameters.py` or `extract_connections.py`. This caused incremental builds to scan all 99 providers and 1625 modules instead of just the requested one. The registry workflow was building the CI image from scratch every run (~24 min) because it lacked the BuildKit mount cache that ci-image-build.yml provides. Inline `breeze ci-image build` with registry cache doesn't help because Docker layer cache invalidates on every commit when the build context changes. Split into two jobs following the established pattern used by ci-amd-arm.yml and update-constraints-on-push.yml: - `build-ci-image`: calls ci-image-build.yml which handles mount cache restore, ghcr.io login, registry cache, and image stashing - `build-and-publish-registry`: restores the stashed image via prepare_breeze_and_image action, then runs the rest unchanged * Fix merge crash when incremental extract skips modules.json extract_parameters.py with --provider intentionally skips writing modules.json (only the targeted provider's parameters are extracted). The merge script assumed modules.json always exists, causing a FileNotFoundError during incremental builds. Handle missing new_modules_path the same way missing existing_modules_path is already handled: treat it as an empty list. * Fix /mnt not writable when loading stashed CI image The prepare_breeze_and_image action loads the CI image from /mnt, which requires make_mnt_writeable.sh to run first. Each job gets a fresh runner, so the writeable /mnt from the build job doesn't carry over. * Regenerate pnpm lockfile for workspace mode Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with --frozen-lockfile. Regenerate the lockfile with pnpm 9 to match. * Scope prebuild uv resolution to dev/registry project The prebuild script ran `uv run` without --project, causing uv to resolve the full workspace including samba → krb5 which needs libkrb5-dev (not installed on the CI runner). Eleventy pagination templates emit empty fallback JSON for every provider, even when only one provider's data was extracted. A plain `aws s3 sync` uploads those stubs and overwrites real connection/parameter data. Changes: - Exclude per-provider connections.json and parameters.json from the main S3 sync during incremental builds, then selectively upload only the target provider's API files - Filter connections early in extract_connections.py (before the loop) and support space-separated multi-provider IDs - Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI - Document the Eleventy pagination limitation in README and AGENTS.md * Exclude all per-provider API files during incremental S3 sync The previous exclude only covered connections.json and parameters.json, but modules.json and versions.json for non-target providers also contain incomplete data (no version info extracted) and would overwrite correct data on S3. Simplify to exclude the entire api/providers/* subtree and selectively upload only the target provider's directory. * Also exclude provider HTML pages during incremental S3 sync Non-target provider pages are rebuilt without connection/parameter data (the version-specific extraction files don't exist locally). Without this exclude, the incremental build overwrites complete HTML pages on S3 with versions missing the connection builder section. The providers listing page uses merged data (all providers) and must be updated during incremental builds — especially for new providers. AWS CLI --include after --exclude re-includes the specific file.
imrichardwu
pushed a commit
to imrichardwu/airflow
that referenced
this pull request
Mar 18, 2026
The `--provider` flag was only passed to `extract_metadata.py` but not to `extract_parameters.py` or `extract_connections.py`. This caused incremental builds to scan all 99 providers and 1625 modules instead of just the requested one. The registry workflow was building the CI image from scratch every run (~24 min) because it lacked the BuildKit mount cache that ci-image-build.yml provides. Inline `breeze ci-image build` with registry cache doesn't help because Docker layer cache invalidates on every commit when the build context changes. Split into two jobs following the established pattern used by ci-amd-arm.yml and update-constraints-on-push.yml: - `build-ci-image`: calls ci-image-build.yml which handles mount cache restore, ghcr.io login, registry cache, and image stashing - `build-and-publish-registry`: restores the stashed image via prepare_breeze_and_image action, then runs the rest unchanged * Fix merge crash when incremental extract skips modules.json extract_parameters.py with --provider intentionally skips writing modules.json (only the targeted provider's parameters are extracted). The merge script assumed modules.json always exists, causing a FileNotFoundError during incremental builds. Handle missing new_modules_path the same way missing existing_modules_path is already handled: treat it as an empty list. * Fix /mnt not writable when loading stashed CI image The prepare_breeze_and_image action loads the CI image from /mnt, which requires make_mnt_writeable.sh to run first. Each job gets a fresh runner, so the writeable /mnt from the build job doesn't carry over. * Regenerate pnpm lockfile for workspace mode Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with --frozen-lockfile. Regenerate the lockfile with pnpm 9 to match. * Scope prebuild uv resolution to dev/registry project The prebuild script ran `uv run` without --project, causing uv to resolve the full workspace including samba → krb5 which needs libkrb5-dev (not installed on the CI runner). Eleventy pagination templates emit empty fallback JSON for every provider, even when only one provider's data was extracted. A plain `aws s3 sync` uploads those stubs and overwrites real connection/parameter data. Changes: - Exclude per-provider connections.json and parameters.json from the main S3 sync during incremental builds, then selectively upload only the target provider's API files - Filter connections early in extract_connections.py (before the loop) and support space-separated multi-provider IDs - Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI - Document the Eleventy pagination limitation in README and AGENTS.md * Exclude all per-provider API files during incremental S3 sync The previous exclude only covered connections.json and parameters.json, but modules.json and versions.json for non-target providers also contain incomplete data (no version info extracted) and would overwrite correct data on S3. Simplify to exclude the entire api/providers/* subtree and selectively upload only the target provider's directory. * Also exclude provider HTML pages during incremental S3 sync Non-target provider pages are rebuilt without connection/parameter data (the version-specific extraction files don't exist locally). Without this exclude, the incremental build overwrites complete HTML pages on S3 with versions missing the connection builder section. The providers listing page uses merged data (all providers) and must be updated during incremental builds — especially for new providers. AWS CLI --include after --exclude re-includes the specific file.
fat-catTW
pushed a commit
to fat-catTW/airflow
that referenced
this pull request
Mar 22, 2026
The `--provider` flag was only passed to `extract_metadata.py` but not to `extract_parameters.py` or `extract_connections.py`. This caused incremental builds to scan all 99 providers and 1625 modules instead of just the requested one. The registry workflow was building the CI image from scratch every run (~24 min) because it lacked the BuildKit mount cache that ci-image-build.yml provides. Inline `breeze ci-image build` with registry cache doesn't help because Docker layer cache invalidates on every commit when the build context changes. Split into two jobs following the established pattern used by ci-amd-arm.yml and update-constraints-on-push.yml: - `build-ci-image`: calls ci-image-build.yml which handles mount cache restore, ghcr.io login, registry cache, and image stashing - `build-and-publish-registry`: restores the stashed image via prepare_breeze_and_image action, then runs the rest unchanged * Fix merge crash when incremental extract skips modules.json extract_parameters.py with --provider intentionally skips writing modules.json (only the targeted provider's parameters are extracted). The merge script assumed modules.json always exists, causing a FileNotFoundError during incremental builds. Handle missing new_modules_path the same way missing existing_modules_path is already handled: treat it as an empty list. * Fix /mnt not writable when loading stashed CI image The prepare_breeze_and_image action loads the CI image from /mnt, which requires make_mnt_writeable.sh to run first. Each job gets a fresh runner, so the writeable /mnt from the build job doesn't carry over. * Regenerate pnpm lockfile for workspace mode Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with --frozen-lockfile. Regenerate the lockfile with pnpm 9 to match. * Scope prebuild uv resolution to dev/registry project The prebuild script ran `uv run` without --project, causing uv to resolve the full workspace including samba → krb5 which needs libkrb5-dev (not installed on the CI runner). Eleventy pagination templates emit empty fallback JSON for every provider, even when only one provider's data was extracted. A plain `aws s3 sync` uploads those stubs and overwrites real connection/parameter data. Changes: - Exclude per-provider connections.json and parameters.json from the main S3 sync during incremental builds, then selectively upload only the target provider's API files - Filter connections early in extract_connections.py (before the loop) and support space-separated multi-provider IDs - Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI - Document the Eleventy pagination limitation in README and AGENTS.md * Exclude all per-provider API files during incremental S3 sync The previous exclude only covered connections.json and parameters.json, but modules.json and versions.json for non-target providers also contain incomplete data (no version info extracted) and would overwrite correct data on S3. Simplify to exclude the entire api/providers/* subtree and selectively upload only the target provider's directory. * Also exclude provider HTML pages during incremental S3 sync Non-target provider pages are rebuilt without connection/parameter data (the version-specific extraction files don't exist locally). Without this exclude, the incremental build overwrites complete HTML pages on S3 with versions missing the connection builder section. The providers listing page uses merged data (all providers) and must be updated during incremental builds — especially for new providers. AWS CLI --include after --exclude re-includes the specific file.
techcodie
pushed a commit
to techcodie/airflow
that referenced
this pull request
Mar 23, 2026
The `--provider` flag was only passed to `extract_metadata.py` but not to `extract_parameters.py` or `extract_connections.py`. This caused incremental builds to scan all 99 providers and 1625 modules instead of just the requested one. The registry workflow was building the CI image from scratch every run (~24 min) because it lacked the BuildKit mount cache that ci-image-build.yml provides. Inline `breeze ci-image build` with registry cache doesn't help because Docker layer cache invalidates on every commit when the build context changes. Split into two jobs following the established pattern used by ci-amd-arm.yml and update-constraints-on-push.yml: - `build-ci-image`: calls ci-image-build.yml which handles mount cache restore, ghcr.io login, registry cache, and image stashing - `build-and-publish-registry`: restores the stashed image via prepare_breeze_and_image action, then runs the rest unchanged * Fix merge crash when incremental extract skips modules.json extract_parameters.py with --provider intentionally skips writing modules.json (only the targeted provider's parameters are extracted). The merge script assumed modules.json always exists, causing a FileNotFoundError during incremental builds. Handle missing new_modules_path the same way missing existing_modules_path is already handled: treat it as an empty list. * Fix /mnt not writable when loading stashed CI image The prepare_breeze_and_image action loads the CI image from /mnt, which requires make_mnt_writeable.sh to run first. Each job gets a fresh runner, so the writeable /mnt from the build job doesn't carry over. * Regenerate pnpm lockfile for workspace mode Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with --frozen-lockfile. Regenerate the lockfile with pnpm 9 to match. * Scope prebuild uv resolution to dev/registry project The prebuild script ran `uv run` without --project, causing uv to resolve the full workspace including samba → krb5 which needs libkrb5-dev (not installed on the CI runner). Eleventy pagination templates emit empty fallback JSON for every provider, even when only one provider's data was extracted. A plain `aws s3 sync` uploads those stubs and overwrites real connection/parameter data. Changes: - Exclude per-provider connections.json and parameters.json from the main S3 sync during incremental builds, then selectively upload only the target provider's API files - Filter connections early in extract_connections.py (before the loop) and support space-separated multi-provider IDs - Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI - Document the Eleventy pagination limitation in README and AGENTS.md * Exclude all per-provider API files during incremental S3 sync The previous exclude only covered connections.json and parameters.json, but modules.json and versions.json for non-target providers also contain incomplete data (no version info extracted) and would overwrite correct data on S3. Simplify to exclude the entire api/providers/* subtree and selectively upload only the target provider's directory. * Also exclude provider HTML pages during incremental S3 sync Non-target provider pages are rebuilt without connection/parameter data (the version-specific extraction files don't exist locally). Without this exclude, the incremental build overwrites complete HTML pages on S3 with versions missing the connection builder section. The providers listing page uses merged data (all providers) and must be updated during incremental builds — especially for new providers. AWS CLI --include after --exclude re-includes the specific file.
abhijeets25012-tech
pushed a commit
to abhijeets25012-tech/airflow
that referenced
this pull request
Apr 9, 2026
The `--provider` flag was only passed to `extract_metadata.py` but not to `extract_parameters.py` or `extract_connections.py`. This caused incremental builds to scan all 99 providers and 1625 modules instead of just the requested one. The registry workflow was building the CI image from scratch every run (~24 min) because it lacked the BuildKit mount cache that ci-image-build.yml provides. Inline `breeze ci-image build` with registry cache doesn't help because Docker layer cache invalidates on every commit when the build context changes. Split into two jobs following the established pattern used by ci-amd-arm.yml and update-constraints-on-push.yml: - `build-ci-image`: calls ci-image-build.yml which handles mount cache restore, ghcr.io login, registry cache, and image stashing - `build-and-publish-registry`: restores the stashed image via prepare_breeze_and_image action, then runs the rest unchanged * Fix merge crash when incremental extract skips modules.json extract_parameters.py with --provider intentionally skips writing modules.json (only the targeted provider's parameters are extracted). The merge script assumed modules.json always exists, causing a FileNotFoundError during incremental builds. Handle missing new_modules_path the same way missing existing_modules_path is already handled: treat it as an empty list. * Fix /mnt not writable when loading stashed CI image The prepare_breeze_and_image action loads the CI image from /mnt, which requires make_mnt_writeable.sh to run first. Each job gets a fresh runner, so the writeable /mnt from the build job doesn't carry over. * Regenerate pnpm lockfile for workspace mode Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with --frozen-lockfile. Regenerate the lockfile with pnpm 9 to match. * Scope prebuild uv resolution to dev/registry project The prebuild script ran `uv run` without --project, causing uv to resolve the full workspace including samba → krb5 which needs libkrb5-dev (not installed on the CI runner). Eleventy pagination templates emit empty fallback JSON for every provider, even when only one provider's data was extracted. A plain `aws s3 sync` uploads those stubs and overwrites real connection/parameter data. Changes: - Exclude per-provider connections.json and parameters.json from the main S3 sync during incremental builds, then selectively upload only the target provider's API files - Filter connections early in extract_connections.py (before the loop) and support space-separated multi-provider IDs - Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI - Document the Eleventy pagination limitation in README and AGENTS.md * Exclude all per-provider API files during incremental S3 sync The previous exclude only covered connections.json and parameters.json, but modules.json and versions.json for non-target providers also contain incomplete data (no version info extracted) and would overwrite correct data on S3. Simplify to exclude the entire api/providers/* subtree and selectively upload only the target provider's directory. * Also exclude provider HTML pages during incremental S3 sync Non-target provider pages are rebuilt without connection/parameter data (the version-specific extraction files don't exist locally). Without this exclude, the incremental build overwrites complete HTML pages on S3 with versions missing the connection builder section. The providers listing page uses merged data (all providers) and must be updated during incremental builds — especially for new providers. AWS CLI --include after --exclude re-includes the specific file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The
--providerflag inbreeze registry extract-datawas only passed toextract_metadata.pybut not toextract_parameters.pyorextract_connections.py. This caused incremental builds to correctly extract metadata for just the requested provider, but then scan all 99 providers and all 1625 modules for parameters and connections.Example from CI run — metadata correctly scoped to
common-ai, but parameters processed everything:The fix forwards the
--providerflag to all three extraction scripts. Bothextract_parameters.pyandextract_connections.pyalready support--provider— it just wasn't being passed through.