Fix registry incremental build processing all providers by kaxil · Pull Request #63769 · apache/airflow

kaxil · 2026-03-17T00:26:38Z

The --provider flag in breeze registry extract-data was only passed to extract_metadata.py but not to extract_parameters.py or extract_connections.py. This caused incremental builds to correctly extract metadata for just the requested provider, but then scan all 99 providers and all 1625 modules for parameters and connections.

Example from CI run — metadata correctly scoped to common-ai, but parameters processed everything:

Incremental mode: extracting provider(s) {'common-ai'}
Wrote 1 providers to /opt/airflow/registry/src/_data    ← correct

Found 99 provider.yaml files                            ← should be 1
Processed 1625 classes                                  ← should be ~3
Extracted 14514 total parameters                        ← way too many

The fix forwards the --provider flag to all three extraction scripts. Both extract_parameters.py and extract_connections.py already support --provider — it just wasn't being passed through.

The `--provider` flag was only passed to `extract_metadata.py` but not to `extract_parameters.py` or `extract_connections.py`. This caused incremental builds to scan all 99 providers and 1625 modules instead of just the requested one.

The registry workflow was building the CI image from scratch every run (~24 min) because it lacked the BuildKit mount cache that ci-image-build.yml provides. Inline `breeze ci-image build` with registry cache doesn't help because Docker layer cache invalidates on every commit when the build context changes. Split into two jobs following the established pattern used by ci-amd-arm.yml and update-constraints-on-push.yml: - `build-ci-image`: calls ci-image-build.yml which handles mount cache restore, ghcr.io login, registry cache, and image stashing - `build-and-publish-registry`: restores the stashed image via prepare_breeze_and_image action, then runs the rest unchanged

extract_parameters.py with --provider intentionally skips writing modules.json (only the targeted provider's parameters are extracted). The merge script assumed modules.json always exists, causing a FileNotFoundError during incremental builds. Handle missing new_modules_path the same way missing existing_modules_path is already handled: treat it as an empty list.

The prepare_breeze_and_image action loads the CI image from /mnt, which requires make_mnt_writeable.sh to run first. Each job gets a fresh runner, so the writeable /mnt from the build job doesn't carry over.

Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with --frozen-lockfile. Regenerate the lockfile with pnpm 9 to match.

The prebuild script ran `uv run` without --project, causing uv to resolve the full workspace including samba → krb5 which needs libkrb5-dev (not installed on the CI runner).

… on S3 Eleventy pagination templates emit empty fallback JSON for every provider, even when only one provider's data was extracted. A plain `aws s3 sync` uploads those stubs and overwrites real connection/parameter data. Changes: - Exclude per-provider connections.json and parameters.json from the main S3 sync during incremental builds, then selectively upload only the target provider's API files - Filter connections early in extract_connections.py (before the loop) and support space-separated multi-provider IDs - Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI - Document the Eleventy pagination limitation in README and AGENTS.md

The previous exclude only covered connections.json and parameters.json, but modules.json and versions.json for non-target providers also contain incomplete data (no version info extracted) and would overwrite correct data on S3. Simplify to exclude the entire api/providers/* subtree and selectively upload only the target provider's directory.

Non-target provider pages are rebuilt without connection/parameter data (the version-specific extraction files don't exist locally). Without this exclude, the incremental build overwrites complete HTML pages on S3 with versions missing the connection builder section.

The providers listing page uses merged data (all providers) and must be updated during incremental builds — especially for new providers. AWS CLI --include after --exclude re-includes the specific file.

The `--provider` flag was only passed to `extract_metadata.py` but not to `extract_parameters.py` or `extract_connections.py`. This caused incremental builds to scan all 99 providers and 1625 modules instead of just the requested one. The registry workflow was building the CI image from scratch every run (~24 min) because it lacked the BuildKit mount cache that ci-image-build.yml provides. Inline `breeze ci-image build` with registry cache doesn't help because Docker layer cache invalidates on every commit when the build context changes. Split into two jobs following the established pattern used by ci-amd-arm.yml and update-constraints-on-push.yml: - `build-ci-image`: calls ci-image-build.yml which handles mount cache restore, ghcr.io login, registry cache, and image stashing - `build-and-publish-registry`: restores the stashed image via prepare_breeze_and_image action, then runs the rest unchanged * Fix merge crash when incremental extract skips modules.json extract_parameters.py with --provider intentionally skips writing modules.json (only the targeted provider's parameters are extracted). The merge script assumed modules.json always exists, causing a FileNotFoundError during incremental builds. Handle missing new_modules_path the same way missing existing_modules_path is already handled: treat it as an empty list. * Fix /mnt not writable when loading stashed CI image The prepare_breeze_and_image action loads the CI image from /mnt, which requires make_mnt_writeable.sh to run first. Each job gets a fresh runner, so the writeable /mnt from the build job doesn't carry over. * Regenerate pnpm lockfile for workspace mode Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with --frozen-lockfile. Regenerate the lockfile with pnpm 9 to match. * Scope prebuild uv resolution to dev/registry project The prebuild script ran `uv run` without --project, causing uv to resolve the full workspace including samba → krb5 which needs libkrb5-dev (not installed on the CI runner). Eleventy pagination templates emit empty fallback JSON for every provider, even when only one provider's data was extracted. A plain `aws s3 sync` uploads those stubs and overwrites real connection/parameter data. Changes: - Exclude per-provider connections.json and parameters.json from the main S3 sync during incremental builds, then selectively upload only the target provider's API files - Filter connections early in extract_connections.py (before the loop) and support space-separated multi-provider IDs - Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI - Document the Eleventy pagination limitation in README and AGENTS.md * Exclude all per-provider API files during incremental S3 sync The previous exclude only covered connections.json and parameters.json, but modules.json and versions.json for non-target providers also contain incomplete data (no version info extracted) and would overwrite correct data on S3. Simplify to exclude the entire api/providers/* subtree and selectively upload only the target provider's directory. * Also exclude provider HTML pages during incremental S3 sync Non-target provider pages are rebuilt without connection/parameter data (the version-specific extraction files don't exist locally). Without this exclude, the incremental build overwrites complete HTML pages on S3 with versions missing the connection builder section. The providers listing page uses merged data (all providers) and must be updated during incremental builds — especially for new providers. AWS CLI --include after --exclude re-includes the specific file.

kaxil requested review from amoghrajesh, ashb, bugraoz93, choo121600, ephraimbuddy, gopidesupavan, jason810496, jedcunningham, jscheffl and potiuk as code owners March 17, 2026 00:26

boring-cyborg bot added area:dev-tools area:registry backport-to-v3-1-test labels Mar 17, 2026

github-project-automation bot added this to Airflow Registry Mar 17, 2026

github-project-automation bot moved this to Backlog in Airflow Registry Mar 17, 2026

kaxil removed the backport-to-v3-1-test label Mar 17, 2026

kaxil moved this from Backlog to In review in Airflow Registry Mar 17, 2026

gopidesupavan approved these changes Mar 17, 2026

View reviewed changes

kaxil added 2 commits March 17, 2026 01:07

Fix registry workflow failures due to workspace dependency resolution

b8bc2dd

kaxil force-pushed the fix-registry-incremental-provider-flag branch from 7cc1435 to 305e9a9 Compare March 17, 2026 01:42

kaxil added 8 commits March 17, 2026 01:43

Fix /mnt not writable when loading stashed CI image

f78dc02

The prepare_breeze_and_image action loads the CI image from /mnt, which requires make_mnt_writeable.sh to run first. Each job gets a fresh runner, so the writeable /mnt from the build job doesn't carry over.

Regenerate pnpm lockfile for workspace mode

374d89d

Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with --frozen-lockfile. Regenerate the lockfile with pnpm 9 to match.

Scope prebuild uv resolution to dev/registry project

014a317

The prebuild script ran `uv run` without --project, causing uv to resolve the full workspace including samba → krb5 which needs libkrb5-dev (not installed on the CI runner).

Re-include providers/index.html in incremental S3 sync

fce97bc

The providers listing page uses merged data (all providers) and must be updated during incremental builds — especially for new providers. AWS CLI --include after --exclude re-includes the specific file.

kaxil merged commit 208eab4 into apache:main Mar 17, 2026
130 checks passed

kaxil deleted the fix-registry-incremental-provider-flag branch March 17, 2026 19:43

github-project-automation bot moved this from In review to Done in Airflow Registry Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix registry incremental build processing all providers#63769

Fix registry incremental build processing all providers#63769
kaxil merged 11 commits intoapache:mainfrom
astronomer:fix-registry-incremental-provider-flag

kaxil commented Mar 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaxil commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaxil commented Mar 17, 2026 •

edited

Loading