Skip to content

Support staging Python packages given via a URL#38348

Open
chamikaramj wants to merge 3 commits intoapache:masterfrom
chamikaramj:python_provider_staging
Open

Support staging Python packages given via a URL#38348
chamikaramj wants to merge 3 commits intoapache:masterfrom
chamikaramj:python_provider_staging

Conversation

@chamikaramj
Copy link
Copy Markdown
Contributor

For example, with this Beam YAML users can provide packages that include Python providers via a URL instead of having to provide a local package or a PyPI package.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the summary. You can try again by commenting /gemini summary.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Assigning reviewers:

R: @claudevdm for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@claudevdm
Copy link
Copy Markdown
Collaborator

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Python SDK stager to download remote packages specified in requirements files. A bug was identified in the requirements thinning logic where using local file paths instead of original URLs prevents the URLs from being removed from the thinned requirements file. Additionally, the filename extraction from URLs is fragile and should be improved to handle query parameters and avoid potential name collisions.

dep)
local_file_path = FileSystems.join(staging_temp_dir, last_component)
Stager._download_file(dep, local_file_path)
local_deps.append(local_file_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a bug in the requirements thinning logic. The Stager._remove_dependency_from_requirements method (called after this loop) uses the items in local_deps to identify which lines to remove from the original requirements file.

By appending the local temporary path (local_file_path) to local_deps instead of the original URL string (dep), the thinning logic will fail to match the URL line in the file. Consequently, the URL will remain in the thinned requirements file, causing the worker's pip to attempt to download it again, which defeats the purpose of staging the package locally.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this analysis is correct. AFAICT Stager._remove_dependency_from_requirements is called one time specifically to remove the "apache-beam" package. Not as a way to trim the requirements file based on local deps.

tmp_requirements_filepath = Stager._remove_dependency_from_requirements(

Comment thread sdks/python/apache_beam/runners/portability/stager.py Outdated
@chamikaramj
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the Beam Python stager to support downloading remote packages listed in requirements files. It updates _extract_local_packages to parse URLs, download remote assets to a temporary directory, and include them in the staging process. Feedback highlights a missing urlparse import that would cause a runtime error and suggests improvements to handle filename collisions and unsupported URL schemes (like VCS) more gracefully.

staging_temp_dir = tempfile.mkdtemp(dir=temp_dir)
for line in fin:
dep = line.strip()
parsed_url = urlparse(dep)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The urlparse function is used here but it is not imported in this file. This will cause a NameError at runtime. Please add from urllib.parse import urlparse to the imports section of the file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect analysis (import is available at the top of the file).

Comment on lines +725 to +727
local_file_path = FileSystems.join(staging_temp_dir, last_component)
Stager._download_file(dep, local_file_path)
local_deps.append(local_file_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This implementation has two potential issues:

  1. Filename Collision: If multiple URLs have the same basename (e.g., .../v1/pkg.tgz and .../v2/pkg.tgz), they will overwrite each other in the temporary directory, and local_deps will contain the same path twice.
  2. Unsupported Schemes: FileSystems.open (used in _download_file) does not support VCS schemes like git+https. If such a URL is encountered, the staging process will crash.

Adding a unique prefix to the filename and wrapping the download in a try-except block to fall back to pypi_deps (letting pip handle it on the worker) makes this more robust.

Suggested change
local_file_path = FileSystems.join(staging_temp_dir, last_component)
Stager._download_file(dep, local_file_path)
local_deps.append(local_file_path)
local_file_path = FileSystems.join(
staging_temp_dir, f"{len(local_deps)}_{last_component}")
try:
Stager._download_file(dep, local_file_path)
local_deps.append(local_file_path)
except Exception as e:
_LOGGER.warning(
'Failed to download remote package %s, falling back to pip: %s',
dep,
e)
pypi_deps.append(dep)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants