Skip to content

Implement download command for training job#8029

Merged
saanikaguptamicrosoft merged 13 commits into
Azure:foundry-training-devfrom
saanikaguptamicrosoft:saanika/downlod2
May 8, 2026
Merged

Implement download command for training job#8029
saanikaguptamicrosoft merged 13 commits into
Azure:foundry-training-devfrom
saanikaguptamicrosoft:saanika/downlod2

Conversation

@saanikaguptamicrosoft
Copy link
Copy Markdown
Collaborator

@saanikaguptamicrosoft saanikaguptamicrosoft commented May 4, 2026

Notes

  • Add download capability for job artifacts and named outputs. The download is only supported for jobs in terminal states: Completed, Canceled, Failed, Paused, NotResponding
  • High level flow
    • Named outputs → resolves the asset's SAS URI via the foundry API and shells out to azcopy for the transfer (handles large blobs and folder containers efficiently).
    • Default artifacts → calls AML history APIs to list artifacts, fetches each one's contentUri, and downloads them in parallel.
    • Downloads run with up to 8 concurrent workers. Retries on transient failures (HTTP 429, 5xx, transport errors) with exponential backoff.
    • Each artifact is written to a sibling .tmp file and atomically renamed on success — partial/interrupted downloads never leave corrupted files at the destination. Stale .tmp files from a prior interrupted run are swept on the next download.
    • Defends against path-traversal: file paths returned by the API are validated to stay inside the destination directory, so a malformed response can't write files outside it.
    • Enhancement - Added job name to download path (if not provided by user) so that we don't override an existing user file by mistake. In case of AML, the override happens silently.

Testing

Happy paths-

  • image
  • image
  • image
  • image
  • Re-running download to the same path overwrites cleanly

Unhappy paths-

  • image
  • image
  • Download interruption via Ctrl+C doesn't lead to download of corrupted or temporary files, the operation is atomic

Copy link
Copy Markdown
Member

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid start on the download command. A few things to address before this is merge-ready:

Path safety - downloadOne joins server-provided paths directly into the destination directory without validating for traversal (../). Defense-in-depth: sanitize before writing.

Retry policy - IsRetryable returns true for every non-nil error (line 41), making the url.Error check on line 38 dead code. The status-code checks below (lines 43-48) are unreachable when err != nil. Combined with callers that always pass status=0, permanent failures get retried pointlessly.

Test coverage - 915 new lines with no tests. At minimum: retry classification, path sanitization, mode selection logic (default vs named vs all), and the tracking-endpoint extraction deserve unit tests.

Partial file cleanup - failed downloads can leave empty/partial files on disk. Consider writing to a temp file and renaming on success, or cleaning up on error.

Duplicate formatBytes - azcopy/runner.go already has this function. Consider extracting to a shared util or reusing.

Comment thread cli/azd/extensions/azure.ai.customtraining/internal/download/download.go Outdated
Comment thread cli/azd/extensions/azure.ai.customtraining/internal/download/download.go Outdated
Comment thread cli/azd/extensions/azure.ai.customtraining/internal/download/download.go Outdated
Comment thread cli/azd/extensions/azure.ai.customtraining/internal/cmd/job_download.go Outdated
Copy link
Copy Markdown
Contributor

@wbreza wbreza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #8029: Implement download command for training job

TL;DR: Adds �zd custom-training job download with three modes (default artifacts, single named output, all outputs). Includes retry logic with exponential backoff and parallel downloading across 7 files (+915/-1). Supplements @jongio's existing review with additional findings.

Note: @jongio already flagged path traversal, broken retry classification, zero test coverage, partial file cleanup, and duplicate ormatBytes. The findings below are net-new only.


🔴 Must Fix

1. HTTP response body never closed on error paths — pkg/client/download.go
All 5 API methods (GetModelVersion, GetModelCredentials, GetDatasetCredentials, ListRunArtifacts, GetRunArtifactContentInfo) call c.HandleError(resp) when status ≠ 200 but never close
esp.Body. This leaks TCP connections under sustained error conditions.
Fix: Add defer resp.Body.Close() immediately after the Do() call, before the status check.

2. Semaphore leak on nil artifact entries — internal/download/download.go ~line 71
When info == nil, the code acquires a semaphore slot (sem <- struct{}{}) then continues the loop without releasing it. After enough nil entries, the semaphore fills and the entire download hangs.
Fix: Move sem <- struct{}{} after the nil check, or release the slot in the nil path.

3. Unvalidated tracking endpoint URL (SSRF risk) — internal/cmd/job_download.go ~line 328, pkg/client/download.go ~line 62
�xtractTrackingEndpoint extracts a URL from untrusted server JSON and uses it directly to construct API requests. A malicious or compromised job response could redirect artifact downloads to an attacker-controlled server.
Fix: Validate the URL scheme is https:// and the host matches expected Azure domains (e.g., *.api.azureml.ms).


🟡 Should Fix

4. Nil dereference when API returns (nil, nil) — internal/cmd/job_download.go ~lines 168, 190, 209, 227
Pattern: �ar modelVer *models.ModelVersion → retry closure sets it on success → code dereferences without nil guard. If the API returns (nil, nil), the next access panics. Appears in 4 places (modelVer, creds ×2, history).
Fix: Add if modelVer == nil { return fmt.Errorf("...") } after each retry block.

5. Missing nil guard on pagination response — internal/cmd/job_download.go ~line 223
If ListRunArtifacts returns (nil, nil), accessing page.Value panics.
Fix: Guard with if page == nil { break } before appending.

6. Credential leakage in error messages — internal/download/download.go ~line 106
Failed download responses (up to 1024 bytes) are included verbatim in error messages. If the server echoes SAS tokens or credentials in error responses, these propagate to user-visible output.
Fix: Limit error details to status code; don't include raw response body.

7. Unbounded download size — internal/download/download.go ~line 103
io.Copy(f, resp.Body) streams directly to disk without Content-Length validation. A malicious server could exhaust disk space.
Fix: Validate Content-Length header or wrap
esp.Body with io.LimitedReader.

8. Race condition on duplicate artifact paths — internal/download/download.go ~line 78
Multiple goroutines can race to os.Create the same file path if the server returns duplicate artifact paths.
Fix: Deduplicate artifact paths before the download loop, or use temp files with atomic rename.

9. Client initialization boilerplate duplicated — internal/cmd/job_download.go ~line 134

ewDownloadClient() duplicates the exact credential/client setup from job_delete.go and job_cancel.go.
Fix: Extract a shared createAuthenticatedClient(ctx) helper.


🟢 Nitpick

10. Parallelism default (8) defined in two places — job_download.go line 305 and download.go line 56. Define once as a shared constant.

11. User-unfriendly error formatting — job_download.go ~line 32. Terminal states printed with %v renders as [Completed Failed ...] instead of a readable list.

12. Debug logging may expose full URLs — pkg/client/download.go lines 53, 77. When debugBody is on, full request URLs including tokens are printed.


Overall: solid foundation — needs one more pass

The download command is well-structured with good use of parallel processing and retry logic. The must-fix items (resource leaks, semaphore deadlock, SSRF) are crash/security risks that should be resolved before merge, along with @jongio's existing feedback.

Copy link
Copy Markdown
Member

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addresses all my previous feedback. The retry logic is properly scoped now (context cancellation, transport errors only), path traversal is blocked via safeJoin with filepath.Abs comparison, partial downloads use tmp+rename, formatBytes is shared, and there's good test coverage on the critical paths. Clean work.

@saanikaguptamicrosoft
Copy link
Copy Markdown
Collaborator Author

saanikaguptamicrosoft commented May 8, 2026

@wbreza -

  1. HTTP body never closed — Already handled: every method has defer resp.Body.Close() immediately after the successful Do(req)/doDataPlane(...), before the status check (e.g., GetModelVersion line 28, ListRunArtifacts line 130). No leak.
  2. Semaphore leak on nil entries — The nil/empty check continues before sem <- struct{}{} is reached, so the slot is never acquired in that branch. No leak.
  3. SSRF via tracking endpoint — The endpoint is read from Azure APIs, and an allowlist creates an ongoing maintenance burden. Deferring for now.
  4. Nil deref on (nil, nil) returns — Added a defensive nil guard for modelVer. The creds sites already go through extractSasURI, which is nil-safe.
  5. Nil page on pagination — Added if page == nil { break } before appending.
  6. Credential leakage in error body — Done: error now returns download <path>: HTTP <code> with no response body.
  7. Unbounded download size — Won't apply: contentURI is from an Azure-issued SAS, large model checkpoints (multi-GB) are legitimate, and the temp-file path already cleans up disk-full failures. Don't want to cap with an arbitrary value.
  8. Race on duplicate paths — Done: dedupe by path before launching workers; subsequent duplicates are reported as skipped.
  9. Duplicated client init — Pattern is in 9 places, not 3. Refactoring all of them in this PR risks unrelated regressions; deferring for now.
  10. Parallelism default in two places — Done: extracted download.DefaultParallelism.
  11. %v on terminal states — Done: now formatted via strings.Join(..., ", ").
  12. Debug logging may expose URLs with tokens — The two [DEBUG] prints log only the AML history-service URLs (no SAS tokens; auth is in headers, SAS URIs are in response bodies which we don't log). Could you point at a specific line where a SAS-bearing URL is printed?

@saanikaguptamicrosoft saanikaguptamicrosoft merged commit 8de9139 into Azure:foundry-training-dev May 8, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants