Conversation
- add dedicated CI workflow for migration tests on Windows/macOS - workflow triggers on migration-related file changes only
- remove GO_MIN_VERSION and check_go_version scripts - go.mod already enforces minimum version (go 1.25) - fixes make build on Windows
6e98b61 to
2993b92
Compare
873c03e to
1f30611
Compare
fixes migration panic on Windows when upgrading from v0.37 to v0.38 by reading the entire config file into memory before performing atomic operations. this avoids file locking issues on Windows where open files cannot be renamed. also fixes: - TestRepoDir to set USERPROFILE on Windows (not just HOME) - CLI migration tests to sanitize directory names (remove colons) minimal fix that solves the "panic: error can't be dealt with transactionally: Access is denied" error without adding unnecessary platform-specific complexity.
1f30611 to
c5946ed
Compare
the CLI tests need the built ipfs binary to be in PATH
2c96dc5 to
9a8528a
Compare
replaces platform-specific signal handling with ipfs shutdown command which works consistently across all platforms including Windows
tests running in parallel with t.Parallel() were interfering with each other through global PATH modifications via os.Setenv(). this caused tests to download real migration binaries instead of using mocks, leading to Windows failures due to path separator issues in external tools. now each test builds its own custom PATH and passes it explicitly to commands, preventing interference between parallel tests.
- add .exe extension to mock migration binaries on Windows - handle repo lock file properly in mock migration binary - ensure lock is created and removed to prevent conflicts
49e08c3 to
7e00ac6
Compare
7e00ac6 to
e339ede
Compare
- check close error in Abort() before attempting removal - leave temp file on rename failure for debugging (like fs-repo-15-to-16) - improves consistency with external migration implementations
The repo migrate command was calling cctx.Context() which has a hidden side effect: it lazily constructs the IPFS node by calling GetNode(), which opens the repository and acquires repo.lock. When migrations then tried to acquire the same lock, it failed with "lock is already held by us" because go4.org/lock tracks locks per-process in a global map. The fix uses req.Context instead, which is a plain context.Context with no side effects. This provides what migrations need (cancellation handling) without triggering node construction or repo opening. Context types explained: - req.Context: Standard Go context for request lifetime, cancellation, and timeouts. No side effects. - cctx.Context(): Kubo-specific method that lazily constructs the full IPFS node (opens repo, acquires lock, initializes subsystems). Returns the node's internal context. Why req.Context is correct here: - Migrations work on raw filesystem (only need ConfigRoot path) - Command has SetDoesNotUseRepo(true) - doesn't need running node - Migrations handle their own locking via lockfile.Lock() - Need cancellation support but not node lifecycle The bug only appeared with embedded migrations (v16+) because they run in-process. External migrations (pre-v16) were separate processes, so each had isolated state. Sequential migrations (forward then backward) in the same process exposed this latent double-lock issue. Also adds repo.lock acquisition to RunEmbeddedMigrations to prevent concurrent migration access, and removes the now-unnecessary daemon lock check from the migrate command handler.
e339ede to
6cb5f8a
Compare
| // Use req.Context instead of cctx.Context() to avoid opening the repo before migrations run, | ||
| // which would acquire the lock that migrations need | ||
| err = migrations.RunHybridMigrations(req.Context, targetVersion, cctx.ConfigRoot, allowDowngrade) |
There was a problem hiding this comment.
Writing it down here as "Interesting Findings For Posterity^WLLM" :-)
The Two Contexts Explained
(1) req.Context - The Request Context (Standard Go Context)
What it is:
- A standard context.Context from Go's stdlib
- Created when the CLI command is invoked
- Carries request-scoped values like:
- Cancellation signals (Ctrl+C)
- Deadlines/timeouts
- Request-specific metadata
- Lives for the duration of the command execution
Where it comes from:
// From go-ipfs-cmds library
type Request struct {
Context context.Context // ← This is req.Context
Command *Command
Path []string
Arguments []string
Options OptMap
// ...
}Purpose: Standard Go context for cancellation, timeouts, and request lifetime management.
(2) cctx.Context() - The Node Context (Lazy-Loading)
What it is:
- A method on Kubo's custom commands.Context struct
- Lazily constructs an IPFS node if it doesn't exist
- Returns the node's internal context
The implementation (from commands/context.go:98-107):
func (c *Context) Context() context.Context {
n, err := c.GetNode() // ← CONSTRUCTS NODE if not cached!
if err != nil {
log.Debug("error getting node: ", err)
return context.Background()
}
return n.Context() // Returns the node's context
}What GetNode() does (lines 45-54):
func (c *Context) GetNode() (*core.IpfsNode, error) {
var err error
if c.node == nil {
if c.ConstructNode == nil {
return nil, errors.New("nil ConstructNode function")
}
c.node, err = c.ConstructNode() // ← CONSTRUCTS THE NODE
}
return c.node, err
}What ConstructNode does (from start.go:122-124):
ConstructNode: func() (n *core.IpfsNode, err error) {
r, err := fsrepo.Open(repoPath) // ← OPENS REPO AND ACQUIRES LOCK!
// ...
n, err = core.NewNode(ctx, &core.BuildCfg{Repo: r})
return n, nil
}Purpose: Provides access to the fully initialized IPFS node with all its subsystems (networking, DHT, bitswap, datastore, etc.).
Why This Caused the Bug
The Call Chain (BEFORE the fix):
ipfs repo migrate
↓
Run() handler (repo.go:404)
↓
migrations.RunHybridMigrations(cctx.Context(), ...) ← OLD CODE
↓
cctx.Context() (context.go:99)
↓
GetNode() (context.go:100)
↓
ConstructNode() (start.go:117)
↓
fsrepo.Open() (start.go:122) ← ACQUIRES LOCK #1
↓
lockfile.Lock(repoPath, "repo.lock")
✓ Lock acquired, added to go4.org/lock's global map
↓
migrations.RunEmbeddedMigrations()
↓
lockfile.Lock(repoPath, "repo.lock") ← TRIES TO ACQUIRE LOCK #2
✗ ERROR: "lock is already held by us" (same process, global map check fails)
The Problem:
- cctx.Context() has a side effect: It opens the repository
- Opening the repo acquires the lock (fsrepo.go:165)
- Migrations need the lock for themselves
- Same process, same lock path → go4.org/lock's global locked map reports "already locked"
Why The Fix Works
The Call Chain (AFTER the fix):
ipfs repo migrate
↓
Run() handler (repo.go:404)
↓
migrations.RunHybridMigrations(req.Context, ...) ← NEW CODE
↓
req.Context is just a plain context.Context
✓ No side effects, no node construction, no repo opening
↓
migrations.RunEmbeddedMigrations()
↓
lockfile.Lock(repoPath, "repo.lock") ← FIRST AND ONLY LOCK ATTEMPT
✓ Lock acquired successfully
✓ Migrations run
✓ Lock released
daemon.go was incorrectly using cctx.Context() in two critical places: 1. Line 337: migrations call - cctx.Context() triggers GetNode() which opens the repo and acquires repo.lock BEFORE migrations run, causing "lock is already held by us" errors when migrations try to lock 2. Line 390: autoconf client.Start() - uses context for HTTP timeouts and background updater lifecycle, doesn't need node construction Both now use req.Context (plain Go context) which provides: - request lifetime and cancellation - no side effects (doesn't construct node or open repo) - correct lifecycle for HTTP requests and background goroutines
three fixes based on code review: 1. atomic backup creation in WithBackup() - prevents partial backup files that could corrupt config during RevertBackup() if interrupted 2. proper error handling in atomicfile.Abort() - captures both close and remove errors instead of silently discarding remove errors 3. complete Windows path sanitization - handles all invalid filename characters (<>:"/\|?*) not just : and / all migration tests pass
move lock acquisition to right after CheckIpfsDir() to prevent any concurrent access as early as possible in the migration process
- consistent error handling in atomicfile.Close() - simplify PATH manipulation with slices.DeleteFunc - add unit tests for atomicfile package
- add atomicfile cleanup tests (multiple aborts, temp file verification) - add migration cleanup tests (successful/failed migrations, backup persistence) - add concurrent daemon migration test - add conversion failure cleanup test - use testify assertions for cleaner test code
| // Acquire lock once for all embedded migrations to prevent concurrent access | ||
| lk, err := lockfile.Lock(ipfsDir, "repo.lock") |
There was a problem hiding this comment.
ℹ️ we need to add this here, since we no longer use implicit locks provided by cctx.Context() (it was protecting via lock "by sheer luck", this is explicitly creating lock here, which is better/safer)
|
We have end-to-end tests on all green on CI on all three platforms, but to be extra sure I've run manual tests: Manual Windows smoke-testJust to be extra safe, I've manually tested binary with fix here: And then swapped Migration finished correctly on Windows and v0.38 build started fine. Manual Linux smoke-testCreated fresh repo ( $ ipfs daemon 130 ...indows-migration-panic
Initializing daemon...
Kubo version: 0.38.0
Repo version: 18
System version: amd64/linux
Golang version: go1.25.1 X:nodwarf5
Kubo repository at /tmp/tmp.OTlsNtYqTO has version 15 and needs to be migrated to version 18.
Run migrations now? [y/N] y
Starting hybrid migration from version 15 to 18
Using hybrid migration strategy: external to v16, then embedded
Phase 1: External migration from v15 to v16
Looking for suitable migration binaries.
Need 1 migrations, downloading.
Downloading migration: fs-repo-15-to-16...
Fetching with HTTP: "https://trustless-gateway.link/ipfs/QmRzRGJEjYDfbHHaALnHBuhzzrkXGdwcPMrgd5fgM7hqbe/fs-repo-15-to-16/versions"
Fetching with HTTP: "https://trustless-gateway.link/ipfs/QmRzRGJEjYDfbHHaALnHBuhzzrkXGdwcPMrgd5fgM7hqbe/fs-repo-15-to-16/v1.0.1/fs-repo-15-to-16_v1.0.1_linux-amd64.tar.gz"
Downloaded and unpacked migration: /tmp/migrations223617285/fs-repo-15-to-16 (v1.0.1)
Running migration fs-repo-15-to-16 ...
=> Running: /tmp/migrations223617285/fs-repo-15-to-16 -path=/tmp/tmp.OTlsNtYqTO -verbose=true
applying 15-to-16 repo migration
locking repo at "/tmp/tmp.OTlsNtYqTO"
- verifying version is '15'
> Upgrading config to new format
updated version file
Migration 15 to 16 succeeded
Success: fs-repo migrated to version 16.
Phase 2: Embedded migration from v16 to v18
Looking for embedded migrations.
Running embedded migration fs-repo-16-to-17...
applying 16-to-17 repo migration
> Upgrading config to use AutoConf system
updated version file
Migration 16-to-17 succeeded
Embedded migration fs-repo-16-to-17 completed successfully
Running embedded migration fs-repo-17-to-18...
applying 17-to-18 repo migration
> Migrating Provider and Reprovider configuration to unified Provide configuration
updated version file
Migration 17-to-18 succeeded
Embedded migration fs-repo-17-to-18 completed successfully
Success: fs-repo migrated to version 18 using embedded migrations.
Hybrid migration completed successfully: v15 → v18
PeerID: 12D3KooWLTLvxDNsA8KdbATivkp9qoJzycxkwDLQKXdvRLLvfZX1
Swarm listening on 127.0.0.1:4001 (TCP+UDP)
...
RPC API server listening on /ip4/127.0.0.1/tcp/5001
WebUI: http://127.0.0.1:5001/webui
Gateway server listening on /ip4/127.0.0.1/tcp/8080
Daemon is ready
^C
Received interrupt signal, shutting down...
(Hit ctrl-c again to force-shutdown the daemon.)Merging and starting release dance for v0.38.1 with this fix. |
* test: add migration tests for Windows and macOS - add dedicated CI workflow for migration tests on Windows/macOS - workflow triggers on migration-related file changes only * build: remove redundant go version checks - remove GO_MIN_VERSION and check_go_version scripts - go.mod already enforces minimum version (go 1.25) - fixes make build on Windows * fix: windows migration panic by reading config into memory fixes migration panic on Windows when upgrading from v0.37 to v0.38 by reading the entire config file into memory before performing atomic operations. this avoids file locking issues on Windows where open files cannot be renamed. also fixes: - TestRepoDir to set USERPROFILE on Windows (not just HOME) - CLI migration tests to sanitize directory names (remove colons) minimal fix that solves the "panic: error can't be dealt with transactionally: Access is denied" error without adding unnecessary platform-specific complexity. * fix: set PATH for CLI migration tests in CI the CLI tests need the built ipfs binary to be in PATH * fix: use ipfs shutdown for graceful daemon termination in tests replaces platform-specific signal handling with ipfs shutdown command which works consistently across all platforms including Windows * fix: isolate PATH modifications in parallel migration tests tests running in parallel with t.Parallel() were interfering with each other through global PATH modifications via os.Setenv(). this caused tests to download real migration binaries instead of using mocks, leading to Windows failures due to path separator issues in external tools. now each test builds its own custom PATH and passes it explicitly to commands, preventing interference between parallel tests. * chore: improve error messages in WithBackup * fix: Windows CI migration test failures - add .exe extension to mock migration binaries on Windows - handle repo lock file properly in mock migration binary - ensure lock is created and removed to prevent conflicts * refactor: align atomicfile error handling with fs-repo-migrations - check close error in Abort() before attempting removal - leave temp file on rename failure for debugging (like fs-repo-15-to-16) - improves consistency with external migration implementations * fix: use req.Context in repo migrate to avoid double-lock The repo migrate command was calling cctx.Context() which has a hidden side effect: it lazily constructs the IPFS node by calling GetNode(), which opens the repository and acquires repo.lock. When migrations then tried to acquire the same lock, it failed with "lock is already held by us" because go4.org/lock tracks locks per-process in a global map. The fix uses req.Context instead, which is a plain context.Context with no side effects. This provides what migrations need (cancellation handling) without triggering node construction or repo opening. Context types explained: - req.Context: Standard Go context for request lifetime, cancellation, and timeouts. No side effects. - cctx.Context(): Kubo-specific method that lazily constructs the full IPFS node (opens repo, acquires lock, initializes subsystems). Returns the node's internal context. Why req.Context is correct here: - Migrations work on raw filesystem (only need ConfigRoot path) - Command has SetDoesNotUseRepo(true) - doesn't need running node - Migrations handle their own locking via lockfile.Lock() - Need cancellation support but not node lifecycle The bug only appeared with embedded migrations (v16+) because they run in-process. External migrations (pre-v16) were separate processes, so each had isolated state. Sequential migrations (forward then backward) in the same process exposed this latent double-lock issue. Also adds repo.lock acquisition to RunEmbeddedMigrations to prevent concurrent migration access, and removes the now-unnecessary daemon lock check from the migrate command handler. * fix: use req.Context for migrations and autoconf in daemon startup daemon.go was incorrectly using cctx.Context() in two critical places: 1. Line 337: migrations call - cctx.Context() triggers GetNode() which opens the repo and acquires repo.lock BEFORE migrations run, causing "lock is already held by us" errors when migrations try to lock 2. Line 390: autoconf client.Start() - uses context for HTTP timeouts and background updater lifecycle, doesn't need node construction Both now use req.Context (plain Go context) which provides: - request lifetime and cancellation - no side effects (doesn't construct node or open repo) - correct lifecycle for HTTP requests and background goroutines (cherry picked from commit f4834e7)
* test: add migration tests for Windows and macOS - add dedicated CI workflow for migration tests on Windows/macOS - workflow triggers on migration-related file changes only * build: remove redundant go version checks - remove GO_MIN_VERSION and check_go_version scripts - go.mod already enforces minimum version (go 1.25) - fixes make build on Windows * fix: windows migration panic by reading config into memory fixes migration panic on Windows when upgrading from v0.37 to v0.38 by reading the entire config file into memory before performing atomic operations. this avoids file locking issues on Windows where open files cannot be renamed. also fixes: - TestRepoDir to set USERPROFILE on Windows (not just HOME) - CLI migration tests to sanitize directory names (remove colons) minimal fix that solves the "panic: error can't be dealt with transactionally: Access is denied" error without adding unnecessary platform-specific complexity. * fix: set PATH for CLI migration tests in CI the CLI tests need the built ipfs binary to be in PATH * fix: use ipfs shutdown for graceful daemon termination in tests replaces platform-specific signal handling with ipfs shutdown command which works consistently across all platforms including Windows * fix: isolate PATH modifications in parallel migration tests tests running in parallel with t.Parallel() were interfering with each other through global PATH modifications via os.Setenv(). this caused tests to download real migration binaries instead of using mocks, leading to Windows failures due to path separator issues in external tools. now each test builds its own custom PATH and passes it explicitly to commands, preventing interference between parallel tests. * chore: improve error messages in WithBackup * fix: Windows CI migration test failures - add .exe extension to mock migration binaries on Windows - handle repo lock file properly in mock migration binary - ensure lock is created and removed to prevent conflicts * refactor: align atomicfile error handling with fs-repo-migrations - check close error in Abort() before attempting removal - leave temp file on rename failure for debugging (like fs-repo-15-to-16) - improves consistency with external migration implementations * fix: use req.Context in repo migrate to avoid double-lock The repo migrate command was calling cctx.Context() which has a hidden side effect: it lazily constructs the IPFS node by calling GetNode(), which opens the repository and acquires repo.lock. When migrations then tried to acquire the same lock, it failed with "lock is already held by us" because go4.org/lock tracks locks per-process in a global map. The fix uses req.Context instead, which is a plain context.Context with no side effects. This provides what migrations need (cancellation handling) without triggering node construction or repo opening. Context types explained: - req.Context: Standard Go context for request lifetime, cancellation, and timeouts. No side effects. - cctx.Context(): Kubo-specific method that lazily constructs the full IPFS node (opens repo, acquires lock, initializes subsystems). Returns the node's internal context. Why req.Context is correct here: - Migrations work on raw filesystem (only need ConfigRoot path) - Command has SetDoesNotUseRepo(true) - doesn't need running node - Migrations handle their own locking via lockfile.Lock() - Need cancellation support but not node lifecycle The bug only appeared with embedded migrations (v16+) because they run in-process. External migrations (pre-v16) were separate processes, so each had isolated state. Sequential migrations (forward then backward) in the same process exposed this latent double-lock issue. Also adds repo.lock acquisition to RunEmbeddedMigrations to prevent concurrent migration access, and removes the now-unnecessary daemon lock check from the migrate command handler. * fix: use req.Context for migrations and autoconf in daemon startup daemon.go was incorrectly using cctx.Context() in two critical places: 1. Line 337: migrations call - cctx.Context() triggers GetNode() which opens the repo and acquires repo.lock BEFORE migrations run, causing "lock is already held by us" errors when migrations try to lock 2. Line 390: autoconf client.Start() - uses context for HTTP timeouts and background updater lifecycle, doesn't need node construction Both now use req.Context (plain Go context) which provides: - request lifetime and cancellation - no side effects (doesn't construct node or open repo) - correct lifecycle for HTTP requests and background goroutines (cherry picked from commit f4834e7)
What?
Why?
LockedByOtherProcessworks differently on Windows - it leaves behind a lock file that persists due toFILE_FLAG_DELETE_ON_CLOSEasync behaviorcctx.Context()instead ofreq.Context– thecctxone has side-effect of lazy-loading node, and as a side-effect, also creating a repo.lock. This, together with theFILE_FLAG_DELETE_ON_CLOSEbehavior of Windows broke Windows.