Skip to content

Fix registry incremental build processing all providers#63769

Merged
kaxil merged 11 commits intoapache:mainfrom
astronomer:fix-registry-incremental-provider-flag
Mar 17, 2026
Merged

Fix registry incremental build processing all providers#63769
kaxil merged 11 commits intoapache:mainfrom
astronomer:fix-registry-incremental-provider-flag

Conversation

@kaxil
Copy link
Copy Markdown
Member

@kaxil kaxil commented Mar 17, 2026

The --provider flag in breeze registry extract-data was only passed to extract_metadata.py but not to extract_parameters.py or extract_connections.py. This caused incremental builds to correctly extract metadata for just the requested provider, but then scan all 99 providers and all 1625 modules for parameters and connections.

Example from CI run — metadata correctly scoped to common-ai, but parameters processed everything:

Incremental mode: extracting provider(s) {'common-ai'}
Wrote 1 providers to /opt/airflow/registry/src/_data    ← correct

Found 99 provider.yaml files                            ← should be 1
Processed 1625 classes                                  ← should be ~3
Extracted 14514 total parameters                        ← way too many

The fix forwards the --provider flag to all three extraction scripts. Both extract_parameters.py and extract_connections.py already support --provider — it just wasn't being passed through.

The `--provider` flag was only passed to `extract_metadata.py` but not
to `extract_parameters.py` or `extract_connections.py`. This caused
incremental builds to scan all 99 providers and 1625 modules instead
of just the requested one.
kaxil added 2 commits March 17, 2026 01:07
The registry workflow was building the CI image from scratch every run
(~24 min) because it lacked the BuildKit mount cache that
ci-image-build.yml provides. Inline `breeze ci-image build` with
registry cache doesn't help because Docker layer cache invalidates
on every commit when the build context changes.

Split into two jobs following the established pattern used by
ci-amd-arm.yml and update-constraints-on-push.yml:

- `build-ci-image`: calls ci-image-build.yml which handles mount cache
  restore, ghcr.io login, registry cache, and image stashing
- `build-and-publish-registry`: restores the stashed image via
  prepare_breeze_and_image action, then runs the rest unchanged
@kaxil kaxil force-pushed the fix-registry-incremental-provider-flag branch from 7cc1435 to 305e9a9 Compare March 17, 2026 01:42
kaxil added 8 commits March 17, 2026 01:43
extract_parameters.py with --provider intentionally skips writing
modules.json (only the targeted provider's parameters are extracted).
The merge script assumed modules.json always exists, causing a
FileNotFoundError during incremental builds.

Handle missing new_modules_path the same way missing
existing_modules_path is already handled: treat it as an empty list.
The prepare_breeze_and_image action loads the CI image from /mnt, which
requires make_mnt_writeable.sh to run first. Each job gets a fresh
runner, so the writeable /mnt from the build job doesn't carry over.
Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm
processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with
--frozen-lockfile. Regenerate the lockfile with pnpm 9 to match.
The prebuild script ran `uv run` without --project, causing uv to
resolve the full workspace including samba → krb5 which needs
libkrb5-dev (not installed on the CI runner).
… on S3

Eleventy pagination templates emit empty fallback JSON for every provider,
even when only one provider's data was extracted.  A plain `aws s3 sync`
uploads those stubs and overwrites real connection/parameter data.

Changes:
- Exclude per-provider connections.json and parameters.json from the main
  S3 sync during incremental builds, then selectively upload only the
  target provider's API files
- Filter connections early in extract_connections.py (before the loop)
  and support space-separated multi-provider IDs
- Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI
- Document the Eleventy pagination limitation in README and AGENTS.md
The previous exclude only covered connections.json and parameters.json,
but modules.json and versions.json for non-target providers also contain
incomplete data (no version info extracted) and would overwrite correct
data on S3.  Simplify to exclude the entire api/providers/* subtree and
selectively upload only the target provider's directory.
Non-target provider pages are rebuilt without connection/parameter data
(the version-specific extraction files don't exist locally). Without
this exclude, the incremental build overwrites complete HTML pages on
S3 with versions missing the connection builder section.
The providers listing page uses merged data (all providers) and must
be updated during incremental builds — especially for new providers.
AWS CLI --include after --exclude re-includes the specific file.
@kaxil kaxil merged commit 208eab4 into apache:main Mar 17, 2026
130 checks passed
@kaxil kaxil deleted the fix-registry-incremental-provider-flag branch March 17, 2026 19:43
@github-project-automation github-project-automation bot moved this from In review to Done in Airflow Registry Mar 17, 2026
imrichardwu pushed a commit to imrichardwu/airflow that referenced this pull request Mar 18, 2026
The `--provider` flag was only passed to `extract_metadata.py` but not
to `extract_parameters.py` or `extract_connections.py`. This caused
incremental builds to scan all 99 providers and 1625 modules instead
of just the requested one.

The registry workflow was building the CI image from scratch every run
(~24 min) because it lacked the BuildKit mount cache that
ci-image-build.yml provides. Inline `breeze ci-image build` with
registry cache doesn't help because Docker layer cache invalidates
on every commit when the build context changes.

Split into two jobs following the established pattern used by
ci-amd-arm.yml and update-constraints-on-push.yml:

- `build-ci-image`: calls ci-image-build.yml which handles mount cache
  restore, ghcr.io login, registry cache, and image stashing
- `build-and-publish-registry`: restores the stashed image via
  prepare_breeze_and_image action, then runs the rest unchanged

* Fix merge crash when incremental extract skips modules.json

extract_parameters.py with --provider intentionally skips writing
modules.json (only the targeted provider's parameters are extracted).
The merge script assumed modules.json always exists, causing a
FileNotFoundError during incremental builds.

Handle missing new_modules_path the same way missing
existing_modules_path is already handled: treat it as an empty list.

* Fix /mnt not writable when loading stashed CI image

The prepare_breeze_and_image action loads the CI image from /mnt, which
requires make_mnt_writeable.sh to run first. Each job gets a fresh
runner, so the writeable /mnt from the build job doesn't carry over.

* Regenerate pnpm lockfile for workspace mode

Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm
processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with
--frozen-lockfile. Regenerate the lockfile with pnpm 9 to match.

* Scope prebuild uv resolution to dev/registry project

The prebuild script ran `uv run` without --project, causing uv to
resolve the full workspace including samba → krb5 which needs
libkrb5-dev (not installed on the CI runner).

Eleventy pagination templates emit empty fallback JSON for every provider,
even when only one provider's data was extracted.  A plain `aws s3 sync`
uploads those stubs and overwrites real connection/parameter data.

Changes:
- Exclude per-provider connections.json and parameters.json from the main
  S3 sync during incremental builds, then selectively upload only the
  target provider's API files
- Filter connections early in extract_connections.py (before the loop)
  and support space-separated multi-provider IDs
- Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI
- Document the Eleventy pagination limitation in README and AGENTS.md

* Exclude all per-provider API files during incremental S3 sync

The previous exclude only covered connections.json and parameters.json,
but modules.json and versions.json for non-target providers also contain
incomplete data (no version info extracted) and would overwrite correct
data on S3.  Simplify to exclude the entire api/providers/* subtree and
selectively upload only the target provider's directory.

* Also exclude provider HTML pages during incremental S3 sync

Non-target provider pages are rebuilt without connection/parameter data
(the version-specific extraction files don't exist locally). Without
this exclude, the incremental build overwrites complete HTML pages on
S3 with versions missing the connection builder section.

The providers listing page uses merged data (all providers) and must
be updated during incremental builds — especially for new providers.
AWS CLI --include after --exclude re-includes the specific file.
imrichardwu pushed a commit to imrichardwu/airflow that referenced this pull request Mar 18, 2026
The `--provider` flag was only passed to `extract_metadata.py` but not
to `extract_parameters.py` or `extract_connections.py`. This caused
incremental builds to scan all 99 providers and 1625 modules instead
of just the requested one.

The registry workflow was building the CI image from scratch every run
(~24 min) because it lacked the BuildKit mount cache that
ci-image-build.yml provides. Inline `breeze ci-image build` with
registry cache doesn't help because Docker layer cache invalidates
on every commit when the build context changes.

Split into two jobs following the established pattern used by
ci-amd-arm.yml and update-constraints-on-push.yml:

- `build-ci-image`: calls ci-image-build.yml which handles mount cache
  restore, ghcr.io login, registry cache, and image stashing
- `build-and-publish-registry`: restores the stashed image via
  prepare_breeze_and_image action, then runs the rest unchanged

* Fix merge crash when incremental extract skips modules.json

extract_parameters.py with --provider intentionally skips writing
modules.json (only the targeted provider's parameters are extracted).
The merge script assumed modules.json always exists, causing a
FileNotFoundError during incremental builds.

Handle missing new_modules_path the same way missing
existing_modules_path is already handled: treat it as an empty list.

* Fix /mnt not writable when loading stashed CI image

The prepare_breeze_and_image action loads the CI image from /mnt, which
requires make_mnt_writeable.sh to run first. Each job gets a fresh
runner, so the writeable /mnt from the build job doesn't carry over.

* Regenerate pnpm lockfile for workspace mode

Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm
processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with
--frozen-lockfile. Regenerate the lockfile with pnpm 9 to match.

* Scope prebuild uv resolution to dev/registry project

The prebuild script ran `uv run` without --project, causing uv to
resolve the full workspace including samba → krb5 which needs
libkrb5-dev (not installed on the CI runner).

Eleventy pagination templates emit empty fallback JSON for every provider,
even when only one provider's data was extracted.  A plain `aws s3 sync`
uploads those stubs and overwrites real connection/parameter data.

Changes:
- Exclude per-provider connections.json and parameters.json from the main
  S3 sync during incremental builds, then selectively upload only the
  target provider's API files
- Filter connections early in extract_connections.py (before the loop)
  and support space-separated multi-provider IDs
- Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI
- Document the Eleventy pagination limitation in README and AGENTS.md

* Exclude all per-provider API files during incremental S3 sync

The previous exclude only covered connections.json and parameters.json,
but modules.json and versions.json for non-target providers also contain
incomplete data (no version info extracted) and would overwrite correct
data on S3.  Simplify to exclude the entire api/providers/* subtree and
selectively upload only the target provider's directory.

* Also exclude provider HTML pages during incremental S3 sync

Non-target provider pages are rebuilt without connection/parameter data
(the version-specific extraction files don't exist locally). Without
this exclude, the incremental build overwrites complete HTML pages on
S3 with versions missing the connection builder section.

The providers listing page uses merged data (all providers) and must
be updated during incremental builds — especially for new providers.
AWS CLI --include after --exclude re-includes the specific file.
fat-catTW pushed a commit to fat-catTW/airflow that referenced this pull request Mar 22, 2026
The `--provider` flag was only passed to `extract_metadata.py` but not
to `extract_parameters.py` or `extract_connections.py`. This caused
incremental builds to scan all 99 providers and 1625 modules instead
of just the requested one.

The registry workflow was building the CI image from scratch every run
(~24 min) because it lacked the BuildKit mount cache that
ci-image-build.yml provides. Inline `breeze ci-image build` with
registry cache doesn't help because Docker layer cache invalidates
on every commit when the build context changes.

Split into two jobs following the established pattern used by
ci-amd-arm.yml and update-constraints-on-push.yml:

- `build-ci-image`: calls ci-image-build.yml which handles mount cache
  restore, ghcr.io login, registry cache, and image stashing
- `build-and-publish-registry`: restores the stashed image via
  prepare_breeze_and_image action, then runs the rest unchanged

* Fix merge crash when incremental extract skips modules.json

extract_parameters.py with --provider intentionally skips writing
modules.json (only the targeted provider's parameters are extracted).
The merge script assumed modules.json always exists, causing a
FileNotFoundError during incremental builds.

Handle missing new_modules_path the same way missing
existing_modules_path is already handled: treat it as an empty list.

* Fix /mnt not writable when loading stashed CI image

The prepare_breeze_and_image action loads the CI image from /mnt, which
requires make_mnt_writeable.sh to run first. Each job gets a fresh
runner, so the writeable /mnt from the build job doesn't carry over.

* Regenerate pnpm lockfile for workspace mode

Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm
processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with
--frozen-lockfile. Regenerate the lockfile with pnpm 9 to match.

* Scope prebuild uv resolution to dev/registry project

The prebuild script ran `uv run` without --project, causing uv to
resolve the full workspace including samba → krb5 which needs
libkrb5-dev (not installed on the CI runner).

Eleventy pagination templates emit empty fallback JSON for every provider,
even when only one provider's data was extracted.  A plain `aws s3 sync`
uploads those stubs and overwrites real connection/parameter data.

Changes:
- Exclude per-provider connections.json and parameters.json from the main
  S3 sync during incremental builds, then selectively upload only the
  target provider's API files
- Filter connections early in extract_connections.py (before the loop)
  and support space-separated multi-provider IDs
- Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI
- Document the Eleventy pagination limitation in README and AGENTS.md

* Exclude all per-provider API files during incremental S3 sync

The previous exclude only covered connections.json and parameters.json,
but modules.json and versions.json for non-target providers also contain
incomplete data (no version info extracted) and would overwrite correct
data on S3.  Simplify to exclude the entire api/providers/* subtree and
selectively upload only the target provider's directory.

* Also exclude provider HTML pages during incremental S3 sync

Non-target provider pages are rebuilt without connection/parameter data
(the version-specific extraction files don't exist locally). Without
this exclude, the incremental build overwrites complete HTML pages on
S3 with versions missing the connection builder section.

The providers listing page uses merged data (all providers) and must
be updated during incremental builds — especially for new providers.
AWS CLI --include after --exclude re-includes the specific file.
techcodie pushed a commit to techcodie/airflow that referenced this pull request Mar 23, 2026
The `--provider` flag was only passed to `extract_metadata.py` but not
to `extract_parameters.py` or `extract_connections.py`. This caused
incremental builds to scan all 99 providers and 1625 modules instead
of just the requested one.

The registry workflow was building the CI image from scratch every run
(~24 min) because it lacked the BuildKit mount cache that
ci-image-build.yml provides. Inline `breeze ci-image build` with
registry cache doesn't help because Docker layer cache invalidates
on every commit when the build context changes.

Split into two jobs following the established pattern used by
ci-amd-arm.yml and update-constraints-on-push.yml:

- `build-ci-image`: calls ci-image-build.yml which handles mount cache
  restore, ghcr.io login, registry cache, and image stashing
- `build-and-publish-registry`: restores the stashed image via
  prepare_breeze_and_image action, then runs the rest unchanged

* Fix merge crash when incremental extract skips modules.json

extract_parameters.py with --provider intentionally skips writing
modules.json (only the targeted provider's parameters are extracted).
The merge script assumed modules.json always exists, causing a
FileNotFoundError during incremental builds.

Handle missing new_modules_path the same way missing
existing_modules_path is already handled: treat it as an empty list.

* Fix /mnt not writable when loading stashed CI image

The prepare_breeze_and_image action loads the CI image from /mnt, which
requires make_mnt_writeable.sh to run first. Each job gets a fresh
runner, so the writeable /mnt from the build job doesn't carry over.

* Regenerate pnpm lockfile for workspace mode

Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm
processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with
--frozen-lockfile. Regenerate the lockfile with pnpm 9 to match.

* Scope prebuild uv resolution to dev/registry project

The prebuild script ran `uv run` without --project, causing uv to
resolve the full workspace including samba → krb5 which needs
libkrb5-dev (not installed on the CI runner).

Eleventy pagination templates emit empty fallback JSON for every provider,
even when only one provider's data was extracted.  A plain `aws s3 sync`
uploads those stubs and overwrites real connection/parameter data.

Changes:
- Exclude per-provider connections.json and parameters.json from the main
  S3 sync during incremental builds, then selectively upload only the
  target provider's API files
- Filter connections early in extract_connections.py (before the loop)
  and support space-separated multi-provider IDs
- Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI
- Document the Eleventy pagination limitation in README and AGENTS.md

* Exclude all per-provider API files during incremental S3 sync

The previous exclude only covered connections.json and parameters.json,
but modules.json and versions.json for non-target providers also contain
incomplete data (no version info extracted) and would overwrite correct
data on S3.  Simplify to exclude the entire api/providers/* subtree and
selectively upload only the target provider's directory.

* Also exclude provider HTML pages during incremental S3 sync

Non-target provider pages are rebuilt without connection/parameter data
(the version-specific extraction files don't exist locally). Without
this exclude, the incremental build overwrites complete HTML pages on
S3 with versions missing the connection builder section.

The providers listing page uses merged data (all providers) and must
be updated during incremental builds — especially for new providers.
AWS CLI --include after --exclude re-includes the specific file.
abhijeets25012-tech pushed a commit to abhijeets25012-tech/airflow that referenced this pull request Apr 9, 2026
The `--provider` flag was only passed to `extract_metadata.py` but not
to `extract_parameters.py` or `extract_connections.py`. This caused
incremental builds to scan all 99 providers and 1625 modules instead
of just the requested one.

The registry workflow was building the CI image from scratch every run
(~24 min) because it lacked the BuildKit mount cache that
ci-image-build.yml provides. Inline `breeze ci-image build` with
registry cache doesn't help because Docker layer cache invalidates
on every commit when the build context changes.

Split into two jobs following the established pattern used by
ci-amd-arm.yml and update-constraints-on-push.yml:

- `build-ci-image`: calls ci-image-build.yml which handles mount cache
  restore, ghcr.io login, registry cache, and image stashing
- `build-and-publish-registry`: restores the stashed image via
  prepare_breeze_and_image action, then runs the rest unchanged

* Fix merge crash when incremental extract skips modules.json

extract_parameters.py with --provider intentionally skips writing
modules.json (only the targeted provider's parameters are extracted).
The merge script assumed modules.json always exists, causing a
FileNotFoundError during incremental builds.

Handle missing new_modules_path the same way missing
existing_modules_path is already handled: treat it as an empty list.

* Fix /mnt not writable when loading stashed CI image

The prepare_breeze_and_image action loads the CI image from /mnt, which
requires make_mnt_writeable.sh to run first. Each job gets a fresh
runner, so the writeable /mnt from the build job doesn't carry over.

* Regenerate pnpm lockfile for workspace mode

Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm
processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with
--frozen-lockfile. Regenerate the lockfile with pnpm 9 to match.

* Scope prebuild uv resolution to dev/registry project

The prebuild script ran `uv run` without --project, causing uv to
resolve the full workspace including samba → krb5 which needs
libkrb5-dev (not installed on the CI runner).

Eleventy pagination templates emit empty fallback JSON for every provider,
even when only one provider's data was extracted.  A plain `aws s3 sync`
uploads those stubs and overwrites real connection/parameter data.

Changes:
- Exclude per-provider connections.json and parameters.json from the main
  S3 sync during incremental builds, then selectively upload only the
  target provider's API files
- Filter connections early in extract_connections.py (before the loop)
  and support space-separated multi-provider IDs
- Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI
- Document the Eleventy pagination limitation in README and AGENTS.md

* Exclude all per-provider API files during incremental S3 sync

The previous exclude only covered connections.json and parameters.json,
but modules.json and versions.json for non-target providers also contain
incomplete data (no version info extracted) and would overwrite correct
data on S3.  Simplify to exclude the entire api/providers/* subtree and
selectively upload only the target provider's directory.

* Also exclude provider HTML pages during incremental S3 sync

Non-target provider pages are rebuilt without connection/parameter data
(the version-specific extraction files don't exist locally). Without
this exclude, the incremental build overwrites complete HTML pages on
S3 with versions missing the connection builder section.

The providers listing page uses merged data (all providers) and must
be updated during incremental builds — especially for new providers.
AWS CLI --include after --exclude re-includes the specific file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants