|
| 1 | +# Deep Investigation: Binlog Comparison |
| 2 | + |
| 3 | +When a test **passes on the target branch but fails on a PR**, comparing MSBuild binlogs from both runs reveals the exact difference in task parameters without guessing. |
| 4 | + |
| 5 | +## When to Use This Pattern |
| 6 | + |
| 7 | +- Test assertion compares "expected vs actual" build outputs (e.g., CSC args, reference lists) |
| 8 | +- A build succeeds on one branch but fails on another with different MSBuild behavior |
| 9 | +- You need to find which MSBuild property/item change caused a specific task to behave differently |
| 10 | + |
| 11 | +## The Pattern: Delegate to Subagents |
| 12 | + |
| 13 | +> ⚠️ **Do NOT download, load, and parse binlogs in the main conversation context.** This burns 10+ turns on mechanical work. Delegate to subagents instead. |
| 14 | +
|
| 15 | +### Step 1: Identify the two work items to compare |
| 16 | + |
| 17 | +Use `Get-CIStatus.ps1` to find the failing Helix job + work item, then find a corresponding passing build (recent PR merged to the target branch, or a CI run on that branch). |
| 18 | + |
| 19 | +**Finding Helix job IDs from build artifacts (binlogs to find binlogs):** |
| 20 | +When the failing work item's Helix job ID isn't visible (e.g., canceled jobs, or finding a matching job from a passing build), the IDs are inside the build's `SendToHelix.binlog`: |
| 21 | + |
| 22 | +1. Download the build artifact with `az`: |
| 23 | + ``` |
| 24 | + az pipelines runs artifact list --run-id $buildId --org "https://dev.azure.com/dnceng-public" -p public --query "[].name" -o tsv |
| 25 | + az pipelines runs artifact download --run-id $buildId --artifact-name "TestBuild_linux_x64" --path "$env:TEMP\artifact" --org "https://dev.azure.com/dnceng-public" -p public |
| 26 | + ``` |
| 27 | +2. Load the `SendToHelix.binlog` and search for `Sent Helix Job` to find the GUIDs. |
| 28 | +3. Query each Helix job GUID with the CI script: |
| 29 | + ``` |
| 30 | + ./scripts/Get-CIStatus.ps1 -HelixJob "{GUID}" -FindBinlogs |
| 31 | + ``` |
| 32 | + |
| 33 | +**For Helix work item binlogs (the common case):** |
| 34 | +The CI script shows binlog URLs directly when you query a specific work item: |
| 35 | +``` |
| 36 | +./scripts/Get-CIStatus.ps1 -HelixJob "{JOB_ID}" -WorkItem "{WORK_ITEM}" |
| 37 | +# Output includes: 🔬 msbuild.binlog: https://helix...blob.core.windows.net/... |
| 38 | +``` |
| 39 | + |
| 40 | +### Step 2: Dispatch parallel subagents for extraction |
| 41 | + |
| 42 | +Launch two `task` subagents (can run in parallel), each with a prompt like: |
| 43 | + |
| 44 | +``` |
| 45 | +Download the msbuild.binlog from Helix job {JOB_ID} work item {WORK_ITEM}. |
| 46 | +Use the CI skill script to get the artifact URL: |
| 47 | + ./scripts/Get-CIStatus.ps1 -HelixJob "{JOB_ID}" -WorkItem "{WORK_ITEM}" |
| 48 | +Download the binlog, load it, find the {TASK_NAME} task, and extract CommandLineArguments. |
| 49 | +Normalize paths (see table below) and sort args. |
| 50 | +Parse into individual args using regex: (?:"[^"]+"|/[^\s]+|[^\s]+) |
| 51 | +Report the total arg count prominently. |
| 52 | +``` |
| 53 | + |
| 54 | +**Important:** When diffing, look for **extra or missing args** (different count), not value differences in existing args. A Debug/Release difference in `/define:` is expected noise — an extra `/analyzerconfig:` or `/reference:` arg is the real signal. |
| 55 | + |
| 56 | +### Step 3: Diff the results |
| 57 | + |
| 58 | +With two normalized arg lists, `Compare-Object` instantly reveals the difference. |
| 59 | + |
| 60 | +## Common Binlog Search Patterns |
| 61 | + |
| 62 | +When investigating binlogs, these search query patterns are most useful: |
| 63 | + |
| 64 | +- Search for a property: `analysislevel` |
| 65 | +- Search within a target: `under($target AddGlobalAnalyzerConfigForPackage_MicrosoftCodeAnalysisNetAnalyzers)` |
| 66 | +- Find all properties matching a pattern: `GlobalAnalyzerConfig` |
| 67 | + |
| 68 | +## Path Normalization |
| 69 | + |
| 70 | +Helix work items run on different machines with different paths. Normalize before comparing: |
| 71 | + |
| 72 | +| Pattern | Replacement | Example | |
| 73 | +|---------|-------------|---------| |
| 74 | +| `/datadisks/disk1/work/[A-F0-9]{8}` | `{W}` | Helix work directory (Linux) | |
| 75 | +| `C:\h\w\[A-F0-9]{8}` | `{W}` | Helix work directory (Windows) | |
| 76 | +| `Program-[a-f0-9]{64}` | `Program-{H}` | Runfile content hash | |
| 77 | +| `dotnetSdkTests\.[a-zA-Z0-9]+` | `dotnetSdkTests.{T}` | Temp test directory | |
| 78 | + |
| 79 | +### After normalizing paths, focus on structural differences |
| 80 | + |
| 81 | +> ⚠️ **Ignore value-only differences in existing args** (e.g., Debug vs Release in `/define:`, different hash paths). These are expected configuration differences. Focus on **extra or missing args** — a different arg count indicates a real build behavior change. |
| 82 | +
|
| 83 | +## Example: CscArguments Investigation |
| 84 | + |
| 85 | +A merge PR (release/10.0.3xx → main) had 208 CSC args vs 207 on main. The diff: |
| 86 | + |
| 87 | +``` |
| 88 | +FAIL-ONLY: /analyzerconfig:{W}/p/d/sdk/11.0.100-ci/Sdks/Microsoft.NET.Sdk/analyzers/build/config/analysislevel_11_default.globalconfig |
| 89 | +``` |
| 90 | + |
| 91 | +### What the binlog properties showed |
| 92 | + |
| 93 | +Both builds had identical property resolution: |
| 94 | +- `EffectiveAnalysisLevel = 11.0` |
| 95 | +- `_GlobalAnalyzerConfigFileName = analysislevel_11_default.globalconfig` |
| 96 | +- `_GlobalAnalyzerConfigFile = .../config/analysislevel_11_default.globalconfig` |
| 97 | + |
| 98 | +### The actual root cause |
| 99 | + |
| 100 | +The `AddGlobalAnalyzerConfigForPackage` target has an `Exists()` condition: |
| 101 | +```xml |
| 102 | +<ItemGroup Condition="Exists('$(_GlobalAnalyzerConfigFile_...)')"> |
| 103 | + <EditorConfigFiles Include="$(_GlobalAnalyzerConfigFile_...)" /> |
| 104 | +</ItemGroup> |
| 105 | +``` |
| 106 | + |
| 107 | +The merge's SDK layout **shipped** `analysislevel_11_default.globalconfig` on disk (from a newer roslyn-analyzers that flowed from 10.0.3xx), while main's SDK didn't have that file yet. Same property values, different files on disk = different build behavior. |
| 108 | + |
| 109 | +### Lesson learned |
| 110 | + |
| 111 | +Same MSBuild property resolution + different files on disk = different build behavior. Always check what's actually in the SDK layout, not just what the targets compute. |
| 112 | + |
| 113 | +## Anti-Patterns |
| 114 | + |
| 115 | +> ❌ **Don't manually split/parse CSC command lines in the main conversation.** CSC args have quoted paths, spaces, and complex structure. Regex parsing in PowerShell is fragile and burns turns on trial-and-error. Use a subagent. |
| 116 | +
|
| 117 | +> ❌ **Don't assume the MSBuild property diff explains the behavior diff.** Two branches can compute identical property values but produce different outputs because of different files on disk, different NuGet packages, or different task assemblies. Compare the actual task invocation. |
| 118 | +
|
| 119 | +> ❌ **Don't load large binlogs and browse them interactively in main context.** Use targeted searches rather than browsing interactively. Get in, get the data, get out. |
0 commit comments