Implement download command for training job#8029
Conversation
jongio
left a comment
There was a problem hiding this comment.
Solid start on the download command. A few things to address before this is merge-ready:
Path safety - downloadOne joins server-provided paths directly into the destination directory without validating for traversal (../). Defense-in-depth: sanitize before writing.
Retry policy - IsRetryable returns true for every non-nil error (line 41), making the url.Error check on line 38 dead code. The status-code checks below (lines 43-48) are unreachable when err != nil. Combined with callers that always pass status=0, permanent failures get retried pointlessly.
Test coverage - 915 new lines with no tests. At minimum: retry classification, path sanitization, mode selection logic (default vs named vs all), and the tracking-endpoint extraction deserve unit tests.
Partial file cleanup - failed downloads can leave empty/partial files on disk. Consider writing to a temp file and renaming on success, or cleaning up on error.
Duplicate formatBytes - azcopy/runner.go already has this function. Consider extracting to a shared util or reusing.
wbreza
left a comment
There was a problem hiding this comment.
Code Review — PR #8029: Implement download command for training job
TL;DR: Adds �zd custom-training job download with three modes (default artifacts, single named output, all outputs). Includes retry logic with exponential backoff and parallel downloading across 7 files (+915/-1). Supplements @jongio's existing review with additional findings.
Note: @jongio already flagged path traversal, broken retry classification, zero test coverage, partial file cleanup, and duplicate ormatBytes. The findings below are net-new only.
🔴 Must Fix
1. HTTP response body never closed on error paths — pkg/client/download.go
All 5 API methods (GetModelVersion, GetModelCredentials, GetDatasetCredentials, ListRunArtifacts, GetRunArtifactContentInfo) call c.HandleError(resp) when status ≠ 200 but never close
esp.Body. This leaks TCP connections under sustained error conditions.
Fix: Add defer resp.Body.Close() immediately after the Do() call, before the status check.
2. Semaphore leak on nil artifact entries — internal/download/download.go ~line 71
When info == nil, the code acquires a semaphore slot (sem <- struct{}{}) then continues the loop without releasing it. After enough nil entries, the semaphore fills and the entire download hangs.
Fix: Move sem <- struct{}{} after the nil check, or release the slot in the nil path.
3. Unvalidated tracking endpoint URL (SSRF risk) — internal/cmd/job_download.go ~line 328, pkg/client/download.go ~line 62
�xtractTrackingEndpoint extracts a URL from untrusted server JSON and uses it directly to construct API requests. A malicious or compromised job response could redirect artifact downloads to an attacker-controlled server.
Fix: Validate the URL scheme is https:// and the host matches expected Azure domains (e.g., *.api.azureml.ms).
🟡 Should Fix
4. Nil dereference when API returns (nil, nil) — internal/cmd/job_download.go ~lines 168, 190, 209, 227
Pattern: �ar modelVer *models.ModelVersion → retry closure sets it on success → code dereferences without nil guard. If the API returns (nil, nil), the next access panics. Appears in 4 places (modelVer, creds ×2, history).
Fix: Add if modelVer == nil { return fmt.Errorf("...") } after each retry block.
5. Missing nil guard on pagination response — internal/cmd/job_download.go ~line 223
If ListRunArtifacts returns (nil, nil), accessing page.Value panics.
Fix: Guard with if page == nil { break } before appending.
6. Credential leakage in error messages — internal/download/download.go ~line 106
Failed download responses (up to 1024 bytes) are included verbatim in error messages. If the server echoes SAS tokens or credentials in error responses, these propagate to user-visible output.
Fix: Limit error details to status code; don't include raw response body.
7. Unbounded download size — internal/download/download.go ~line 103
io.Copy(f, resp.Body) streams directly to disk without Content-Length validation. A malicious server could exhaust disk space.
Fix: Validate Content-Length header or wrap
esp.Body with io.LimitedReader.
8. Race condition on duplicate artifact paths — internal/download/download.go ~line 78
Multiple goroutines can race to os.Create the same file path if the server returns duplicate artifact paths.
Fix: Deduplicate artifact paths before the download loop, or use temp files with atomic rename.
9. Client initialization boilerplate duplicated — internal/cmd/job_download.go ~line 134
ewDownloadClient() duplicates the exact credential/client setup from job_delete.go and job_cancel.go.
Fix: Extract a shared createAuthenticatedClient(ctx) helper.
🟢 Nitpick
10. Parallelism default (8) defined in two places — job_download.go line 305 and download.go line 56. Define once as a shared constant.
11. User-unfriendly error formatting — job_download.go ~line 32. Terminal states printed with %v renders as [Completed Failed ...] instead of a readable list.
12. Debug logging may expose full URLs — pkg/client/download.go lines 53, 77. When debugBody is on, full request URLs including tokens are printed.
Overall: solid foundation — needs one more pass
The download command is well-structured with good use of parallel processing and retry logic. The must-fix items (resource leaks, semaphore deadlock, SSRF) are crash/security risks that should be resolved before merge, along with @jongio's existing feedback.
jongio
left a comment
There was a problem hiding this comment.
Addresses all my previous feedback. The retry logic is properly scoped now (context cancellation, transport errors only), path traversal is blocked via safeJoin with filepath.Abs comparison, partial downloads use tmp+rename, formatBytes is shared, and there's good test coverage on the critical paths. Clean work.
|
@wbreza -
|
8de9139
into
Azure:foundry-training-dev
Notes
Testing
Happy paths-
Re-running download to the same path overwrites cleanly
Unhappy paths-
Download interruption via
Ctrl+Cdoesn't lead to download of corrupted or temporary files, the operation is atomic