Implement stream command for training job#7939
Merged
saanikaguptamicrosoft merged 11 commits intoApr 29, 2026
Merged
Conversation
…d instead of offset to avoid abrupt line breaks and match with AML experience
…, add header every time for better readability
wbreza
reviewed
Apr 28, 2026
Contributor
wbreza
left a comment
There was a problem hiding this comment.
Code Review: PR #7939 — Implement stream command for training job
TL;DR
Adds a job stream subcommand to the azure.ai.customtraining extension that streams real-time training job logs via polling with sigmoid-based adaptive intervals. Good feature with smart polling design, but has a potential build issue, no tests, and several error handling gaps.
🔴 Critical (1)
1. Undefined rootFlags — potential build failure
- File:
internal/cmd/job_stream.go(line 88) - Issue: References
rootFlags.DebugbutrootFlagsis not defined in this file. If it's not a package-level global in another file in this package (e.g.,root.go), this won't compile. - Suggested Fix: Define
rootFlagsinroot.go(matchingazure.ai.modelspattern) or pass the debug flag through parameters.
🟠 High (2)
2. Zero test coverage for 507 lines of new code
- Files: All 7 new/modified files
- Issue: No
_test.gofiles for any new code. Key untested paths: polling loop, sigmoid interval,filterLogFiles,parseTrackingEndpoint,GetBlobContent, retry logic, context cancellation. - Suggested Fix: Add
stream_service_test.gowith table-driven tests for the streaming lifecycle, retry thresholds, sigmoid boundaries, URL parsing, and log filtering. Usetestify/mockfor client dependencies.
3. Swallowed errors — silent data loss in log streaming
- File:
stream_service.go(lines 343-346, 382) - Issue: (a)
flushLogsdiscards all errors:_, _, _ = s.pollAndPrintLogs(...). (b) Blob read errors are silently skipped withcontinue. Users get incomplete logs with no warning. - Suggested Fix: Log warnings on error so users know logs may be incomplete:
// flushLogs
_, _, err := s.pollAndPrintLogs(ctx, trackingEndpoint, jobName, processedLines, multiFile)
if err != nil {
fmt.Fprintf(os.Stderr, "Warning: failed to flush final logs: %v\n", err)
}🟡 Medium (7)
4. SSRF risk — SAS URIs used without domain validation
- File:
pkg/client/blob.go - Issue:
GetBlobContentmakes HTTP requests to SAS URIs from the API response without validating the domain. A compromised response could redirect to attacker-controlled endpoints. - Suggested Fix: Validate the parsed URL hostname ends with
.blob.core.windows.net(or sovereign cloud variants).
5. Full blob re-downloaded on every poll
- Files:
blob.go,stream_service.go - Issue: Every poll downloads entire blob and re-splits all lines, printing only new ones. O(n²) network I/O. Consider HTTP Range headers to fetch only new content.
6. Context cancellation not honored during sleep
- File:
stream_service.go - Issue:
time.Sleep()doesn't respond to context cancellation. Ctrl+C may take up to 60s. Useselect { case <-ctx.Done(): ... case <-time.After(...): }instead.
7. No polling loop timeout
- File:
stream_service.go - Issue: Bare
for {}loop runs indefinitely for stuck jobs. Addcontext.WithTimeout.
8. Missing HTTP request timeouts
- Files:
blob.go,history.go - Issue: No explicit request-level timeout on HTTP calls. Individual requests can hang indefinitely.
9. SAS tokens in debug logs
- Files:
client.go,history.go - Issue: Debug logging prints full URLs with potential SAS tokens in query params. Redact sensitive parameters before logging.
10. 1MB blob limit may truncate logs silently
- File:
blob.go - Issue:
io.LimitReader(resp.Body, 1<<20)with no truncation warning. Verbose training logs may be cut off.
🟢 Low (8)
- Missing structured logging (stderr is adequate for CLI, but trace IDs would help debugging)
StreamJobLogsat 79 lines — consider extracting helper methods- Bare error returns at
job_stream.go:93andhistory.go:503— wrap withfmt.Errorf - Sigmoid formula hardcodes
60.0— usemaxInterval.Seconds()for maintainability - Retry counter mixes job-status and log-streaming errors into one
consecutiveErrs - Job name validation only checks empty — consider
MarkFlagRequiredand length limits - Blob client error handling inconsistent with history client (
HandleError()) processedLinesmap not synchronized — document single-threaded constraint
✅ What Looks Good
- Sigmoid adaptive polling — smart approach to reduce API load for long-running jobs
- Terminal state detection — properly skips streaming for completed/failed/canceled jobs
- Defensive nil checks — prevents panics on missing data
- 1MB blob limit — prevents memory exhaustion
- No breaking changes — all changes are additive
Summary
| Priority | Count |
|---|---|
| 🔴 Critical | 1 |
| 🟠 High | 2 |
| 🟡 Medium | 7 |
| 🟢 Low | 8 |
| Total | 18 |
achauhan-scc
approved these changes
Apr 29, 2026
Collaborator
Author
4cbb9c0
into
Azure:foundry-training-dev
1 of 2 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Approach
Log file selection — Matches the Azure ML SDK pattern:
user_logs/std_log[\D]*[0]*(?:_ps)?\.txt(Common Runtime user logs — stdout/stderr from the training script)azureml-logs/[\d]{2}.+\.txt(legacy compute targets)Polling & incremental output — Each poll cycle:
GET /jobs/{name}to check job status and extract the Tracking service endpointGET /history/v1.0/{workspace}/runs/{runId}/details(AML History API) to get log file SAS URIsPoll interval — Sigmoid curve from 2s → 60s based on elapsed wall-clock time, matching the Azure ML SDK:
This keeps early streaming responsive (2s) while reducing API load for long-running jobs.
Multi-file handling — For distributed/multi-node jobs with multiple log files:
Terminal state handling — If the job is already completed when stream starts, it skips directly to the execution summary without downloading logs. If the job completes mid-stream, one final flush poll captures remaining output before printing the summary.
Execution summary — On completion, prints RunId, Status, and Studio Web View URL.
Testing
azd x buildgo test ./internal/service/ ./pkg/client/ -v -count=1Help command

Missing required field

nameRunning stream for a job already in terminal state

Streaming of running job
