Support staging Python packages given via a URL#38348
Support staging Python packages given via a URL#38348chamikaramj wants to merge 3 commits intoapache:masterfrom
Conversation
|
Warning Gemini encountered an error creating the summary. You can try again by commenting |
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
|
Assigning reviewers: R: @claudevdm for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the Python SDK stager to download remote packages specified in requirements files. A bug was identified in the requirements thinning logic where using local file paths instead of original URLs prevents the URLs from being removed from the thinned requirements file. Additionally, the filename extraction from URLs is fragile and should be improved to handle query parameters and avoid potential name collisions.
| dep) | ||
| local_file_path = FileSystems.join(staging_temp_dir, last_component) | ||
| Stager._download_file(dep, local_file_path) | ||
| local_deps.append(local_file_path) |
There was a problem hiding this comment.
There is a bug in the requirements thinning logic. The Stager._remove_dependency_from_requirements method (called after this loop) uses the items in local_deps to identify which lines to remove from the original requirements file.
By appending the local temporary path (local_file_path) to local_deps instead of the original URL string (dep), the thinning logic will fail to match the URL line in the file. Consequently, the URL will remain in the thinned requirements file, causing the worker's pip to attempt to download it again, which defeats the purpose of staging the package locally.
There was a problem hiding this comment.
I don't think this analysis is correct. AFAICT Stager._remove_dependency_from_requirements is called one time specifically to remove the "apache-beam" package. Not as a way to trim the requirements file based on local deps.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request enhances the Beam Python stager to support downloading remote packages listed in requirements files. It updates _extract_local_packages to parse URLs, download remote assets to a temporary directory, and include them in the staging process. Feedback highlights a missing urlparse import that would cause a runtime error and suggests improvements to handle filename collisions and unsupported URL schemes (like VCS) more gracefully.
| staging_temp_dir = tempfile.mkdtemp(dir=temp_dir) | ||
| for line in fin: | ||
| dep = line.strip() | ||
| parsed_url = urlparse(dep) |
There was a problem hiding this comment.
Incorrect analysis (import is available at the top of the file).
| local_file_path = FileSystems.join(staging_temp_dir, last_component) | ||
| Stager._download_file(dep, local_file_path) | ||
| local_deps.append(local_file_path) |
There was a problem hiding this comment.
This implementation has two potential issues:
- Filename Collision: If multiple URLs have the same basename (e.g.,
.../v1/pkg.tgzand.../v2/pkg.tgz), they will overwrite each other in the temporary directory, andlocal_depswill contain the same path twice. - Unsupported Schemes:
FileSystems.open(used in_download_file) does not support VCS schemes likegit+https. If such a URL is encountered, the staging process will crash.
Adding a unique prefix to the filename and wrapping the download in a try-except block to fall back to pypi_deps (letting pip handle it on the worker) makes this more robust.
| local_file_path = FileSystems.join(staging_temp_dir, last_component) | |
| Stager._download_file(dep, local_file_path) | |
| local_deps.append(local_file_path) | |
| local_file_path = FileSystems.join( | |
| staging_temp_dir, f"{len(local_deps)}_{last_component}") | |
| try: | |
| Stager._download_file(dep, local_file_path) | |
| local_deps.append(local_file_path) | |
| except Exception as e: | |
| _LOGGER.warning( | |
| 'Failed to download remote package %s, falling back to pip: %s', | |
| dep, | |
| e) | |
| pypi_deps.append(dep) |
For example, with this Beam YAML users can provide packages that include Python providers via a URL instead of having to provide a local package or a PyPI package.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.