Implement stream command for training job by saanikaguptamicrosoft · Pull Request #7939 · Azure/azure-dev

saanikaguptamicrosoft · 2026-04-28T15:14:23Z

Approach

Log file selection — Matches the Azure ML SDK pattern:

Primary: user_logs/std_log[\D]*[0]*(?:_ps)?\.txt (Common Runtime user logs — stdout/stderr from the training script)
Fallback: azureml-logs/[\d]{2}.+\.txt (legacy compute targets)
Activity/system logs are excluded — they use non-append-only blobs that cause streaming artifacts

Polling & incremental output — Each poll cycle:

GET /jobs/{name} to check job status and extract the Tracking service endpoint
GET /history/v1.0/{workspace}/runs/{runId}/details (AML History API) to get log file SAS URIs
Full download of each matched log file via SAS URI
Line-count delta: skip already-printed lines, print only new ones

Poll interval — Sigmoid curve from 2s → 60s based on elapsed wall-clock time, matching the Azure ML SDK:

This keeps early streaming responsive (2s) while reducing API load for long-running jobs.

Multi-file handling — For distributed/multi-node jobs with multiple log files:

Files are processed in alphabetical order
A multiFile flag latches true once >1 file is seen, enabling per-file headers on every poll for clarity
Single-file jobs get a clean, header-once experience

Terminal state handling — If the job is already completed when stream starts, it skips directly to the execution summary without downloading logs. If the job completes mid-stream, one final flush poll captures remaining output before printing the summary.

Terminal states — Completed, Failed, Canceled, NotResponding, Paused → final log flush + execution summary
Active states — NotStarted, Queued, Preparing, Provisioning, Starting, Running, Finalizing → continue polling

Execution summary — On completion, prints RunId, Status, and Studio Web View URL.

Testing

azd x build
UTs: go test ./internal/service/ ./pkg/client/ -v -count=1

Help command

Missing required field name

Running stream for a job already in terminal state

Streaming of running job

…d instead of offset to avoid abrupt line breaks and match with AML experience

…, add header every time for better readability

wbreza

Code Review: PR #7939 — Implement stream command for training job

TL;DR

Adds a job stream subcommand to the azure.ai.customtraining extension that streams real-time training job logs via polling with sigmoid-based adaptive intervals. Good feature with smart polling design, but has a potential build issue, no tests, and several error handling gaps.

🔴 Critical (1)

1. Undefined `rootFlags` — potential build failure

File: internal/cmd/job_stream.go (line 88)
Issue: References rootFlags.Debug but rootFlags is not defined in this file. If it's not a package-level global in another file in this package (e.g., root.go), this won't compile.
Suggested Fix: Define rootFlags in root.go (matching azure.ai.models pattern) or pass the debug flag through parameters.

🟠 High (2)

2. Zero test coverage for 507 lines of new code

Files: All 7 new/modified files
Issue: No _test.go files for any new code. Key untested paths: polling loop, sigmoid interval, filterLogFiles, parseTrackingEndpoint, GetBlobContent, retry logic, context cancellation.
Suggested Fix: Add stream_service_test.go with table-driven tests for the streaming lifecycle, retry thresholds, sigmoid boundaries, URL parsing, and log filtering. Use testify/mock for client dependencies.

3. Swallowed errors — silent data loss in log streaming

File: stream_service.go (lines 343-346, 382)
Issue: (a) flushLogs discards all errors: _, _, _ = s.pollAndPrintLogs(...). (b) Blob read errors are silently skipped with continue. Users get incomplete logs with no warning.
Suggested Fix: Log warnings on error so users know logs may be incomplete:

// flushLogs
_, _, err := s.pollAndPrintLogs(ctx, trackingEndpoint, jobName, processedLines, multiFile)
if err != nil {
    fmt.Fprintf(os.Stderr, "Warning: failed to flush final logs: %v\n", err)
}

🟡 Medium (7)

4. SSRF risk — SAS URIs used without domain validation

File: pkg/client/blob.go
Issue: GetBlobContent makes HTTP requests to SAS URIs from the API response without validating the domain. A compromised response could redirect to attacker-controlled endpoints.
Suggested Fix: Validate the parsed URL hostname ends with .blob.core.windows.net (or sovereign cloud variants).

5. Full blob re-downloaded on every poll

Files: blob.go, stream_service.go
Issue: Every poll downloads entire blob and re-splits all lines, printing only new ones. O(n²) network I/O. Consider HTTP Range headers to fetch only new content.

6. Context cancellation not honored during sleep

File: stream_service.go
Issue: time.Sleep() doesn't respond to context cancellation. Ctrl+C may take up to 60s. Use select { case <-ctx.Done(): ... case <-time.After(...): } instead.

7. No polling loop timeout

File: stream_service.go
Issue: Bare for {} loop runs indefinitely for stuck jobs. Add context.WithTimeout.

8. Missing HTTP request timeouts

Files: blob.go, history.go
Issue: No explicit request-level timeout on HTTP calls. Individual requests can hang indefinitely.

9. SAS tokens in debug logs

Files: client.go, history.go
Issue: Debug logging prints full URLs with potential SAS tokens in query params. Redact sensitive parameters before logging.

10. 1MB blob limit may truncate logs silently

File: blob.go
Issue: io.LimitReader(resp.Body, 1<<20) with no truncation warning. Verbose training logs may be cut off.

🟢 Low (8)

Missing structured logging (stderr is adequate for CLI, but trace IDs would help debugging)
StreamJobLogs at 79 lines — consider extracting helper methods
Bare error returns at job_stream.go:93 and history.go:503 — wrap with fmt.Errorf
Sigmoid formula hardcodes 60.0 — use maxInterval.Seconds() for maintainability
Retry counter mixes job-status and log-streaming errors into one consecutiveErrs
Job name validation only checks empty — consider MarkFlagRequired and length limits
Blob client error handling inconsistent with history client (HandleError())
processedLines map not synchronized — document single-threaded constraint

✅ What Looks Good

Sigmoid adaptive polling — smart approach to reduce API load for long-running jobs
Terminal state detection — properly skips streaming for completed/failed/canceled jobs
Defensive nil checks — prevents panics on missing data
1MB blob limit — prevents memory exhaustion
No breaking changes — all changes are additive

Summary

Priority	Count
🔴 Critical	1
🟠 High	2
🟡 Medium	7
🟢 Low	8
Total	18

saanikaguptamicrosoft · 2026-04-29T05:19:16Z

1. Undefined rootFlags — potential build failure

File: internal/cmd/job_stream.go (line 88)

Issue: References rootFlags.Debug but rootFlags is not defined in this file. If it's not a package-level global in another file in this package (e.g., root.go), this won't compile.

Suggested Fix: Define rootFlags in root.go (matching azure.ai.models pattern) or pass the debug flag through parameters.

False positive, build is successful and root.go already defines rootFlags

2. Zero test coverage for 507 lines of new code

Files: All 7 new/modified files

Issue: No _test.go files for any new code. Key untested paths: polling loop, sigmoid interval, filterLogFiles, parseTrackingEndpoint, GetBlobContent, retry logic, context cancellation.

Suggested Fix: Add stream_service_test.go with table-driven tests for the streaming lifecycle, retry thresholds, sigmoid boundaries, URL parsing, and log filtering. Use testify/mock for client dependencies.

Added tests

3. Swallowed errors — silent data loss in log streaming

File: stream_service.go (lines 343-346, 382)

Issue: (a) flushLogs discards all errors: _, _, _ = s.pollAndPrintLogs(...). (b) Blob read errors are silently skipped with continue. Users get incomplete logs with no warning.

Suggested Fix: Log warnings on error so users know logs may be incomplete:
// flushLogs
_, _, err := s.pollAndPrintLogs(ctx, trackingEndpoint, jobName, processedLines, multiFile)
if err != nil {
    fmt.Fprintf(os.Stderr, "Warning: failed to flush final logs: %v\n", err)
}

Makes sense, updated

4. SSRF risk — SAS URIs used without domain validation

File: pkg/client/blob.go

Issue: GetBlobContent makes HTTP requests to SAS URIs from the API response without validating the domain. A compromised response could redirect to attacker-controlled endpoints.

Suggested Fix: Validate the parsed URL hostname ends with .blob.core.windows.net (or sovereign cloud variants).

Risk is low since URIs come from authenticated Azure API. Don't want to hard-code any validations here.

5. Full blob re-downloaded on every poll

Files: blob.go, stream_service.go

Issue: Every poll downloads entire blob and re-splits all lines, printing only new ones. O(n²) network I/O. Consider HTTP Range headers to fetch only new content.

6. Context cancellation not honored during sleep

File: stream_service.go

Issue: time.Sleep() doesn't respond to context cancellation. Ctrl+C may take up to 60s. Use select { case <-ctx.Done(): ... case <-time.After(...): } instead.

Good point, added

7. No polling loop timeout

File: stream_service.go

Issue: Bare for {} loop runs indefinitely for stuck jobs. Add context.WithTimeout.

Training jobs can run for hours/days. A hard timeout would kill legitimate streams. User has Ctrl+C.

8. Missing HTTP request timeouts

Files: blob.go, history.go

Issue: No explicit request-level timeout on HTTP calls. Individual requests can hang indefinitely.

Already handeled

9. SAS tokens in debug logs

Files: client.go, history.go

Issue: Debug logging prints full URLs with potential SAS tokens in query params. Redact sensitive parameters before logging.

SAS URIs aren't logged. Debug prints are only on the AML history API URL (bearer auth, no SAS). Blob.go makes its own request without debug logging.

10. 1MB blob limit may truncate logs silently

File: blob.go

Issue: io.LimitReader(resp.Body, 1<<20) with no truncation warning. Verbose training logs may be cut off.

Unlikely in practice for training logs, and hard to detect cleanly with LimitReader. Low priority.

saanikaguptamicrosoft added 5 commits April 28, 2026 15:46

Initial changes for stream command

2cbaeb1

Wrap debug logs

254cfa0

Add regex to only log specific files as in AML | Use line count metho…

29aa9f9

…d instead of offset to avoid abrupt line breaks and match with AML experience

Skip download for jobs already in terminal state

2592919

Use Sigmoid for Poll interval - to reduce API load for multi-hour jobs

2401276

saanikaguptamicrosoft requested review from JeffreyCA, achauhan-scc, hemarina, kingernupur, rabollin, rajeshkamal5050, tg-msft, trangevi, vhvb1989, wbreza and weikanglim as code owners April 28, 2026 15:14

microsoft-github-policy-service Bot assigned saanikaguptamicrosoft Apr 28, 2026

saanikaguptamicrosoft added 2 commits April 28, 2026 20:55

Nit: Refactor - rename logs.go to blob.go

0d76722

Add header for file - enhancement from AML - for multi file streaming…

da62d5c

…, add header every time for better readability

wbreza reviewed Apr 28, 2026

View reviewed changes

achauhan-scc approved these changes Apr 29, 2026

View reviewed changes

saanikaguptamicrosoft added 4 commits April 29, 2026 10:00

Log warnings on error so users know logs may be incomplete

2e36334

Add UTs

6d355db

Nit

636c97c

Honor context cancellation during sleep

5b9865b

saanikaguptamicrosoft merged commit 4cbb9c0 into Azure:foundry-training-dev Apr 29, 2026
1 of 2 checks passed

saanikaguptamicrosoft mentioned this pull request May 18, 2026

Add azure.ai.training extension #8130

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement stream command for training job#7939

Implement stream command for training job#7939
saanikaguptamicrosoft merged 11 commits into
Azure:foundry-training-devfrom
saanikaguptamicrosoft:saanika/stream2

saanikaguptamicrosoft commented Apr 28, 2026 •

edited

Loading

Uh oh!

wbreza left a comment

Uh oh!

saanikaguptamicrosoft commented Apr 29, 2026 •

edited

Loading

1. Undefined `rootFlags` — potential build failure

2. Zero test coverage for 507 lines of new code

3. Swallowed errors — silent data loss in log streaming

4. SSRF risk — SAS URIs used without domain validation

5. Full blob re-downloaded on every poll

6. Context cancellation not honored during sleep

7. No polling loop timeout

8. Missing HTTP request timeouts

9. SAS tokens in debug logs

10. 1MB blob limit may truncate logs silently

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

saanikaguptamicrosoft commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approach

Testing

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

Code Review: PR #7939 — Implement stream command for training job

TL;DR

🔴 Critical (1)

1. Undefined rootFlags — potential build failure

🟠 High (2)

2. Zero test coverage for 507 lines of new code

3. Swallowed errors — silent data loss in log streaming

🟡 Medium (7)

4. SSRF risk — SAS URIs used without domain validation

5. Full blob re-downloaded on every poll

6. Context cancellation not honored during sleep

7. No polling loop timeout

8. Missing HTTP request timeouts

9. SAS tokens in debug logs

10. 1MB blob limit may truncate logs silently

🟢 Low (8)

✅ What Looks Good

Summary

Uh oh!

saanikaguptamicrosoft commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Undefined rootFlags — potential build failure

2. Zero test coverage for 507 lines of new code

3. Swallowed errors — silent data loss in log streaming

4. SSRF risk — SAS URIs used without domain validation

5. Full blob re-downloaded on every poll

6. Context cancellation not honored during sleep

7. No polling loop timeout

8. Missing HTTP request timeouts

9. SAS tokens in debug logs

10. 1MB blob limit may truncate logs silently

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saanikaguptamicrosoft commented Apr 28, 2026 •

edited

Loading

1. Undefined `rootFlags` — potential build failure

saanikaguptamicrosoft commented Apr 29, 2026 •

edited

Loading

1. Undefined `rootFlags` — potential build failure