Add azure.ai.training extension#8130
Conversation
* adding design detaiils for command job CLI * adding more details * adding dedup details * adding api details * adding execution plan * adding draft version of custom training commands
…to show
- Make job name optional in YAML; auto-generate {adj}_{noun}_{suffix} (matching AML SDK)
- Fix buildProjectEndpoint to use services.ai.azure.com (not cognitiveservices.azure.com)
- Rename 'job get' to 'job show' to match models/finetune extensions
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* adding design detaiils for command job CLI * adding more details * adding dedup details * adding api details * adding execution plan * adding draft version of custom training commands * integrating with API
…rt (Azure#7203) - Add --skip-token flag for pagination with next-page UX message - Add --tag and --properties flags for server-side filtering - Add --include-archived flag for listViewType control - Add SystemData (createdBy, createdAt) to job list output - Update doDataPlane() to support variadic query params Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… code, and input resolution (Azure#7205) - Rename job create command to job submit for consistency with finetune extension - Add resolver interfaces: ComputeResolver, CodeResolver, InputResolver - Add JobResolver orchestrator that resolves all references in JobDefinition - Wire resolver into submit flow before buildJobResource() - Stub implementations guide users to provide full ARM IDs / remote URIs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Custom training (Azure#7180) * adding design detaiils for command job CLI * adding more details * adding dedup details * adding api details * adding execution plan * adding draft version of custom training commands * integrating with API * adding -e -s override * fixing asset resolution * custom training: enhance job show, fix asset resolution, add full resource config support - Enhanced job show with rich output: run history, metrics, artifacts, timing, compute info - Added client APIs for run history, metrics, and artifacts endpoints - Fixed dataset version field: json:dataType -> json:type - Fixed input/output mode mapping: ro_mount -> ReadOnlyMount, rw_mount -> ReadWriteMount - Added full resource config support: instanceType, shmSize, dockerArgs, properties - Added ResourceDefinition YAML struct with AISuperComputer properties pass-through - Backward compatible: flat instance_count still works Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * custom training: add spinner progress to job show command Show animated spinner with progress text while fetching job details. Updates text as each parallel fetch (run history, metrics, artifacts) completes, showing remaining items until all data is loaded. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…zure#7892) This reverts commit 5216202.
…oint flags over stored env values (Azure#8093)
jongio
left a comment
There was a problem hiding this comment.
New extension for Azure AI Foundry training jobs. Solid foundation overall - clean command tree, good offline validation, safe download handling (path traversal protection, atomic writes). A few items worth addressing before merge:
TenantId auth bug: init.go uses Subscription.TenantId (resource tenant) from PromptSubscription() instead of Subscription.UserTenantId (user access tenant). The azure.ai.finetune extension already handles this correctly. Multi-tenant/guest users will hit auth failures with the current code.
CODEOWNERS scope: This PR adds the training extension CODEOWNERS line (expected), but also adds @jongio and @wbreza to ~15 unrelated ownership lines (installer, ext/, .github/, eng/, schemas/, etc.). Those changes should be a separate PR to keep review scope clean.
No PR description: The PR body is empty. For a 14K+ line new extension, some context on the design, what training jobs are, and how this relates to the existing finetune extension would help reviewers.
…ections unmarshal error
…multi-tenant users
…onal tags parameter
jongio
left a comment
There was a problem hiding this comment.
Addresses my prior feedback - TenantId fix, go version bump, SDK v2, and semver pin all look correct. One remaining item: the PR body is still empty. For a 14K-line new extension, even a brief summary (what training jobs are, how this relates to the finetune extension, key design decisions) would help reviewers and show up in the git log.
… config and CI lint + release pipeline
c276ad8 to
09cc098
Compare
…→any, CutPrefix, drop loopvar capture)
|
Test comment |
wbreza
left a comment
There was a problem hiding this comment.
Code review — supplementing prior reviews
Thanks for the substantial work here, @saanikaguptamicrosoft. Below are findings that aren't already covered by Copilot's or @jongio's earlier comments. Verified that the UserTenantId fix, go 1.26.1 bump, semver pin, and CODEOWNERS scope are all in place. ✅
🟧 High
1. --debug flag prints HTTP request URLs & bodies to stdout — risk of leaking SAS URIs / sensitive payloads
pkg/client/client.go:91–103. SetDebugBody() is called from production commands (job_submit.go:93, job_stream.go, job_download.go, job_connect_ssh.go) when --debug is passed. Logs include URLs (which for dataset blob/SAS operations may include SAS tokens) and request bodies that flow through GetDatasetCredentials / GetModelCredentials. Repo guidance forbids logging credentials/tokens.
Fix: Redact sasUri, Authorization, sig, signature, and known credential fields before printing; or move this behind a separate developer-only env var and document that it may expose tokens.
🟨 Medium
2. SSH proxy stdin goroutine intentionally untracked
internal/cmd/job_ssh_proxy.go:167–182. The upstream stdin → websocket goroutine is not added to the sync.WaitGroup and blocks on os.Stdin.Read(). The comment acknowledges the Windows limitation and relies on process exit. Works for the short-lived ProxyCommand model, but wg.Wait() at line 211 cannot guarantee clean shutdown on graceful cancellation.
Fix (optional): Close os.Stdin from the cancellation watcher to force Read() to return, or document the lifecycle expectation more visibly.
3. azcopy runner silently drops malformed JSON and parse errors
internal/azcopy/runner.go:167 (JSON unmarshal) and :175–177 (ParseInt). On azcopy output format drift, progress reporting silently degrades to "0 bytes" instead of surfacing the problem.
Fix: Log a one-time warning on first unparseable line; preserve raw text for diagnostics.
4. Nil-guard / pattern duplication on job.Properties.Services access
internal/service/stream_service.go:100, internal/cmd/job_download.go:350 (extractTrackingEndpoint), internal/cmd/job_show_services.go:45. Three inline reimplementations of "extract service endpoint" without consistent nil guards. Sparse API responses can omit Properties or Services.
Fix: Centralize as internal/utils/ExtractServiceEndpoint(job, name) with proper nil checks; replace the three call sites.
5. Critical untested code paths
Production files with zero test coverage: internal/service/upload_service.go (dedup + zombie recovery + sentinel tagging), internal/service/hash.go (deterministic directory hashing — drives dedup correctness), pkg/client/download.go (artifact/output downloads). The team's existing test files (resolver_test.go, job_validator_test.go, job_show_services_test.go) demonstrate solid patterns with fakes — this is a coverage gap on the most failure-prone code.
Fix (recommended before merge): Unit tests for hash.go (determinism + symlink + cross-platform path normalization) and upload_service.go (hash collision fallback, missing SAS URI, azcopy failure).
6. PR description is still empty
Reiterating @jongio's request — for a 14K-line new first-party extension, the empty description hurts git log archaeology, future reviewers, and release notes. A brief summary of what training jobs are, how this relates to azure.ai.finetune, key design decisions, and a pointer to the included Design/ docs would help significantly.
7. PR description references job get, but the actual command is job show
internal/cmd/job.go:32–42 registers newJobShowCommand() (defined in job_get.go). There is no newJobGetCommand. Commit history shows a rename from job get → job show. The PR description and screenshots still reference job get.
Fix: Update PR description and any docs (including content in Design/command-job-guide.md) to match the shipped command names.
🟦 Low
extension.yamlis missing an example for themetadatacapability (other first-party extensions list examples for every declared capability).proxyEndpointPattern(job_ssh_proxy.go) allows%; if any downstream consumer URL-decodes the endpoint, this could enable host/scheme bypass. Quick audit ofbuildTunnelWSURLrecommended.internal/azcopy/runner.gosourceArglacks defense-in-depth shell-metachar validation (#nosec G204only coversazcopyPath).AdditionallyAllowedTenants: []string{"*"}ininit.go:86andjob_ssh_proxy.go:265is intentional for multi-tenant/guest support but should carry an inline security comment explaining the wide scope.- Missing godoc on exported symbols in
pkg/client/andpkg/models/per repo standards. - ~140 KB of
Design/markdown shipped in the extension source tree (design.md75 KB,execution-plan.md40 KB,command-job-guide.md25 KB). Will drift from code without a sync mechanism. Consider moving to repo-leveldocs/or wiki, or trimming to aREADMElink. - Client init boilerplate duplicated across ~15 command files. Extract
newDefaultJobClient(ctx)helper to centralize credential/endpoint/client construction.
✅ Verified clean / resolved
Subscription.UserTenantIdcorrectly used atinit.go:105.- CODEOWNERS scope creep: only the training-extension line added — no co-owner additions to unrelated paths.
go.modisgo 1.26.1with semver-pinned azd dependencyv1.25.1, aligned withextension.yaml'srequiredAzdVersion.safeJoinindownload.gocorrectly prevents zip-slip.resolver_test.go,job_validator_test.go,job_show_services_test.go,init_template_test.goare solid examples.- All 9 required CI checks passing.
Posted by Copilot CLI on behalf of the reviewer.
Notes
Testing
Other operations like validate, download, stream, delete, cancel are also supported.