Skip to content

fix: migrations for Windows#11010

Merged
lidel merged 21 commits intomasterfrom
fix/windows-migration-panic
Oct 8, 2025
Merged

fix: migrations for Windows#11010
lidel merged 21 commits intomasterfrom
fix/windows-migration-panic

Conversation

@lidel
Copy link
Member

@lidel lidel commented Oct 3, 2025

What?

  • Aiming to fix 0.38.0: migration fails on Windows #11009:
    • add dedicated CI workflow for migration tests on Windows/macOS
    • workflow triggers on migration-related file changes only
    • confirm CI fails on Windows
    • fix Windows
    • add regression tests for lock-related behaviors

Why?

  • We did not test migrations on Windows, tests only run on linux
    • Fix: Need to test migrations on all platforms
  • LockedByOtherProcess works differently on Windows - it leaves behind a lock file that persists due to FILE_FLAG_DELETE_ON_CLOSE async behavior
    • Fix: Migrations used cctx.Context() instead of req.Context – the cctx one has side-effect of lazy-loading node, and as a side-effect, also creating a repo.lock. This, together with the FILE_FLAG_DELETE_ON_CLOSE behavior of Windows broke Windows.
    • See explainer inline.

lidel added 2 commits October 3, 2025 02:48
- add dedicated CI workflow for migration tests on Windows/macOS
- workflow triggers on migration-related file changes only
- remove GO_MIN_VERSION and check_go_version scripts
- go.mod already enforces minimum version (go 1.25)
- fixes make build on Windows
@lidel lidel force-pushed the fix/windows-migration-panic branch from 6e98b61 to 2993b92 Compare October 3, 2025 00:56
@lidel lidel changed the title test: add migration tests for Windows and macOS fix: migrations for Windows Oct 3, 2025
@lidel lidel force-pushed the fix/windows-migration-panic branch 6 times, most recently from 873c03e to 1f30611 Compare October 3, 2025 04:22
fixes migration panic on Windows when upgrading from v0.37 to v0.38
by reading the entire config file into memory before performing atomic
operations. this avoids file locking issues on Windows where open files
cannot be renamed.

also fixes:
- TestRepoDir to set USERPROFILE on Windows (not just HOME)
- CLI migration tests to sanitize directory names (remove colons)

minimal fix that solves the "panic: error can't be dealt with
transactionally: Access is denied" error without adding unnecessary
platform-specific complexity.
@lidel lidel force-pushed the fix/windows-migration-panic branch from 1f30611 to c5946ed Compare October 3, 2025 04:34
the CLI tests need the built ipfs binary to be in PATH
@lidel lidel force-pushed the fix/windows-migration-panic branch from 2c96dc5 to 9a8528a Compare October 3, 2025 05:11
lidel added 4 commits October 3, 2025 15:04
replaces platform-specific signal handling with ipfs shutdown command
which works consistently across all platforms including Windows
tests running in parallel with t.Parallel() were interfering with each
other through global PATH modifications via os.Setenv(). this caused
tests to download real migration binaries instead of using mocks,
leading to Windows failures due to path separator issues in external tools.

now each test builds its own custom PATH and passes it explicitly to
commands, preventing interference between parallel tests.
- add .exe extension to mock migration binaries on Windows
- handle repo lock file properly in mock migration binary
- ensure lock is created and removed to prevent conflicts
@lidel lidel force-pushed the fix/windows-migration-panic branch 11 times, most recently from 49e08c3 to 7e00ac6 Compare October 5, 2025 21:37
@lidel lidel force-pushed the fix/windows-migration-panic branch from 7e00ac6 to e339ede Compare October 5, 2025 22:07
@lidel lidel self-assigned this Oct 6, 2025
@lidel lidel mentioned this pull request Oct 6, 2025
49 tasks
lidel added 3 commits October 6, 2025 18:08
- check close error in Abort() before attempting removal
- leave temp file on rename failure for debugging (like fs-repo-15-to-16)
- improves consistency with external migration implementations
The repo migrate command was calling cctx.Context() which has a hidden
side effect: it lazily constructs the IPFS node by calling GetNode(),
which opens the repository and acquires repo.lock. When migrations then
tried to acquire the same lock, it failed with "lock is already held by us"
because go4.org/lock tracks locks per-process in a global map.

The fix uses req.Context instead, which is a plain context.Context with
no side effects. This provides what migrations need (cancellation handling)
without triggering node construction or repo opening.

Context types explained:
- req.Context: Standard Go context for request lifetime, cancellation,
  and timeouts. No side effects.
- cctx.Context(): Kubo-specific method that lazily constructs the full
  IPFS node (opens repo, acquires lock, initializes subsystems). Returns
  the node's internal context.

Why req.Context is correct here:
- Migrations work on raw filesystem (only need ConfigRoot path)
- Command has SetDoesNotUseRepo(true) - doesn't need running node
- Migrations handle their own locking via lockfile.Lock()
- Need cancellation support but not node lifecycle

The bug only appeared with embedded migrations (v16+) because they run
in-process. External migrations (pre-v16) were separate processes, so
each had isolated state. Sequential migrations (forward then backward)
in the same process exposed this latent double-lock issue.

Also adds repo.lock acquisition to RunEmbeddedMigrations to prevent
concurrent migration access, and removes the now-unnecessary daemon
lock check from the migrate command handler.
@lidel lidel force-pushed the fix/windows-migration-panic branch from e339ede to 6cb5f8a Compare October 7, 2025 19:19
Comment on lines +429 to +431
// Use req.Context instead of cctx.Context() to avoid opening the repo before migrations run,
// which would acquire the lock that migrations need
err = migrations.RunHybridMigrations(req.Context, targetVersion, cctx.ConfigRoot, allowDowngrade)
Copy link
Member Author

@lidel lidel Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing it down here as "Interesting Findings For Posterity^WLLM" :-)

The Two Contexts Explained

(1) req.Context - The Request Context (Standard Go Context)

What it is:

  • A standard context.Context from Go's stdlib
  • Created when the CLI command is invoked
  • Carries request-scoped values like:
    • Cancellation signals (Ctrl+C)
    • Deadlines/timeouts
    • Request-specific metadata
  • Lives for the duration of the command execution

Where it comes from:

  // From go-ipfs-cmds library
  type Request struct {
      Context context.Context  // ← This is req.Context
      Command *Command
      Path    []string
      Arguments []string
      Options OptMap
      // ...
  }

Purpose: Standard Go context for cancellation, timeouts, and request lifetime management.

(2) cctx.Context() - The Node Context (Lazy-Loading)

What it is:

  • A method on Kubo's custom commands.Context struct
  • Lazily constructs an IPFS node if it doesn't exist
  • Returns the node's internal context

The implementation (from commands/context.go:98-107):

  func (c *Context) Context() context.Context {
      n, err := c.GetNode()  // ← CONSTRUCTS NODE if not cached!
      if err != nil {
          log.Debug("error getting node: ", err)
          return context.Background()
      }
      return n.Context()  // Returns the node's context
  }

What GetNode() does (lines 45-54):

  func (c *Context) GetNode() (*core.IpfsNode, error) {
      var err error
      if c.node == nil {
          if c.ConstructNode == nil {
              return nil, errors.New("nil ConstructNode function")
          }
          c.node, err = c.ConstructNode()  // ← CONSTRUCTS THE NODE
      }
      return c.node, err
  }

What ConstructNode does (from start.go:122-124):

ConstructNode: func() (n *core.IpfsNode, err error) {
    r, err := fsrepo.Open(repoPath)  // ← OPENS REPO AND ACQUIRES LOCK!
    // ...
    n, err = core.NewNode(ctx, &core.BuildCfg{Repo: r})
    return n, nil
}

Purpose: Provides access to the fully initialized IPFS node with all its subsystems (networking, DHT, bitswap, datastore, etc.).

Why This Caused the Bug

The Call Chain (BEFORE the fix):

  ipfs repo migrate
    ↓
  Run() handler (repo.go:404)
    ↓
  migrations.RunHybridMigrations(cctx.Context(), ...)  ← OLD CODE
    ↓
  cctx.Context() (context.go:99)
    ↓
  GetNode() (context.go:100)
    ↓
  ConstructNode() (start.go:117)
    ↓
  fsrepo.Open() (start.go:122)  ← ACQUIRES LOCK #1
    ↓
  lockfile.Lock(repoPath, "repo.lock")
    ✓ Lock acquired, added to go4.org/lock's global map
    ↓
  migrations.RunEmbeddedMigrations()
    ↓
  lockfile.Lock(repoPath, "repo.lock")  ← TRIES TO ACQUIRE LOCK #2
    ✗ ERROR: "lock is already held by us" (same process, global map check fails)

The Problem:

  1. cctx.Context() has a side effect: It opens the repository
  2. Opening the repo acquires the lock (fsrepo.go:165)
  3. Migrations need the lock for themselves
  4. Same process, same lock path → go4.org/lock's global locked map reports "already locked"

Why The Fix Works

The Call Chain (AFTER the fix):

  ipfs repo migrate
    ↓
  Run() handler (repo.go:404)
    ↓
  migrations.RunHybridMigrations(req.Context, ...)  ← NEW CODE
    ↓
  req.Context is just a plain context.Context
    ✓ No side effects, no node construction, no repo opening
    ↓
  migrations.RunEmbeddedMigrations()
    ↓
  lockfile.Lock(repoPath, "repo.lock")  ← FIRST AND ONLY LOCK ATTEMPT
    ✓ Lock acquired successfully
    ✓ Migrations run
    ✓ Lock released

lidel added 8 commits October 7, 2025 21:51
daemon.go was incorrectly using cctx.Context() in two critical places:

1. Line 337: migrations call - cctx.Context() triggers GetNode() which
   opens the repo and acquires repo.lock BEFORE migrations run, causing
   "lock is already held by us" errors when migrations try to lock

2. Line 390: autoconf client.Start() - uses context for HTTP timeouts
   and background updater lifecycle, doesn't need node construction

Both now use req.Context (plain Go context) which provides:
- request lifetime and cancellation
- no side effects (doesn't construct node or open repo)
- correct lifecycle for HTTP requests and background goroutines
three fixes based on code review:

1. atomic backup creation in WithBackup() - prevents partial backup files
   that could corrupt config during RevertBackup() if interrupted

2. proper error handling in atomicfile.Abort() - captures both close and
   remove errors instead of silently discarding remove errors

3. complete Windows path sanitization - handles all invalid filename
   characters (<>:"/\|?*) not just : and /

all migration tests pass
move lock acquisition to right after CheckIpfsDir() to prevent any
concurrent access as early as possible in the migration process
- consistent error handling in atomicfile.Close()
- simplify PATH manipulation with slices.DeleteFunc
- add unit tests for atomicfile package
- add atomicfile cleanup tests (multiple aborts, temp file verification)
- add migration cleanup tests (successful/failed migrations, backup persistence)
- add concurrent daemon migration test
- add conversion failure cleanup test
- use testify assertions for cleaner test code
Comment on lines +113 to +114
// Acquire lock once for all embedded migrations to prevent concurrent access
lk, err := lockfile.Lock(ipfsDir, "repo.lock")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ we need to add this here, since we no longer use implicit locks provided by cctx.Context() (it was protecting via lock "by sheer luck", this is explicitly creating lock here, which is better/safer)

@lidel lidel marked this pull request as ready for review October 8, 2025 01:16
@lidel lidel requested a review from a team as a code owner October 8, 2025 01:16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@lidel
Copy link
Member Author

lidel commented Oct 8, 2025

We have end-to-end tests on all green on CI on all three platforms, but to be extra sure I've run manual tests:

Manual Windows smoke-test

Just to be extra safe, I've manually tested binary with fix here:

GOOS=windows GOARCH=amd64 go build -o cmd/ipfs/ipfs.exe ./cmd/ipfs

And then swapped ipfs.exe in failing repo from ipfs/ipfs-desktop#2998 and the error from #11009 is gone.

Migration finished correctly on Windows and v0.38 build started fine.

Manual Linux smoke-test

Created fresh repo (ipfs init) with kubo_v0.24.0_linux-amd64.tar.gz (produces repo verison 15), then started ipfs daemon with binary from this PR. It worked fine and executed both legacy external and then modern embedded migrations:

$ ipfs daemon                                                                                                                                                                        130 ...indows-migration-panic
Initializing daemon...
Kubo version: 0.38.0
Repo version: 18
System version: amd64/linux
Golang version: go1.25.1 X:nodwarf5
Kubo repository at /tmp/tmp.OTlsNtYqTO has version 15 and needs to be migrated to version 18.
Run migrations now? [y/N] y
Starting hybrid migration from version 15 to 18
Using hybrid migration strategy: external to v16, then embedded
Phase 1: External migration from v15 to v16
Looking for suitable migration binaries.
Need 1 migrations, downloading.
Downloading migration: fs-repo-15-to-16...
Fetching with HTTP: "https://trustless-gateway.link/ipfs/QmRzRGJEjYDfbHHaALnHBuhzzrkXGdwcPMrgd5fgM7hqbe/fs-repo-15-to-16/versions"
Fetching with HTTP: "https://trustless-gateway.link/ipfs/QmRzRGJEjYDfbHHaALnHBuhzzrkXGdwcPMrgd5fgM7hqbe/fs-repo-15-to-16/v1.0.1/fs-repo-15-to-16_v1.0.1_linux-amd64.tar.gz"
Downloaded and unpacked migration: /tmp/migrations223617285/fs-repo-15-to-16 (v1.0.1)
Running migration fs-repo-15-to-16 ...
  => Running: /tmp/migrations223617285/fs-repo-15-to-16 -path=/tmp/tmp.OTlsNtYqTO -verbose=true
applying 15-to-16 repo migration
locking repo at "/tmp/tmp.OTlsNtYqTO"
  - verifying version is '15'
> Upgrading config to new format
updated version file
Migration 15 to 16 succeeded
Success: fs-repo migrated to version 16.
Phase 2: Embedded migration from v16 to v18
Looking for embedded migrations.
Running embedded migration fs-repo-16-to-17...
applying 16-to-17 repo migration
> Upgrading config to use AutoConf system
updated version file
Migration 16-to-17 succeeded
Embedded migration fs-repo-16-to-17 completed successfully
Running embedded migration fs-repo-17-to-18...
applying 17-to-18 repo migration
> Migrating Provider and Reprovider configuration to unified Provide configuration
updated version file
Migration 17-to-18 succeeded
Embedded migration fs-repo-17-to-18 completed successfully
Success: fs-repo migrated to version 18 using embedded migrations.
Hybrid migration completed successfully: v15 → v18
PeerID: 12D3KooWLTLvxDNsA8KdbATivkp9qoJzycxkwDLQKXdvRLLvfZX1
Swarm listening on 127.0.0.1:4001 (TCP+UDP)

...

RPC API server listening on /ip4/127.0.0.1/tcp/5001
WebUI: http://127.0.0.1:5001/webui
Gateway server listening on /ip4/127.0.0.1/tcp/8080
Daemon is ready
^C
Received interrupt signal, shutting down...
(Hit ctrl-c again to force-shutdown the daemon.)

Merging and starting release dance for v0.38.1 with this fix.

@lidel lidel merged commit f4834e7 into master Oct 8, 2025
20 checks passed
@lidel lidel deleted the fix/windows-migration-panic branch October 8, 2025 16:02
lidel added a commit that referenced this pull request Oct 8, 2025
* test: add migration tests for Windows and macOS

- add dedicated CI workflow for migration tests on Windows/macOS
- workflow triggers on migration-related file changes only

* build: remove redundant go version checks

- remove GO_MIN_VERSION and check_go_version scripts
- go.mod already enforces minimum version (go 1.25)
- fixes make build on Windows

* fix: windows migration panic by reading config into memory

fixes migration panic on Windows when upgrading from v0.37 to v0.38
by reading the entire config file into memory before performing atomic
operations. this avoids file locking issues on Windows where open files
cannot be renamed.

also fixes:
- TestRepoDir to set USERPROFILE on Windows (not just HOME)
- CLI migration tests to sanitize directory names (remove colons)

minimal fix that solves the "panic: error can't be dealt with
transactionally: Access is denied" error without adding unnecessary
platform-specific complexity.

* fix: set PATH for CLI migration tests in CI

the CLI tests need the built ipfs binary to be in PATH

* fix: use ipfs shutdown for graceful daemon termination in tests

replaces platform-specific signal handling with ipfs shutdown command
which works consistently across all platforms including Windows

* fix: isolate PATH modifications in parallel migration tests

tests running in parallel with t.Parallel() were interfering with each
other through global PATH modifications via os.Setenv(). this caused
tests to download real migration binaries instead of using mocks,
leading to Windows failures due to path separator issues in external tools.

now each test builds its own custom PATH and passes it explicitly to
commands, preventing interference between parallel tests.

* chore: improve error messages in WithBackup

* fix: Windows CI migration test failures

- add .exe extension to mock migration binaries on Windows
- handle repo lock file properly in mock migration binary
- ensure lock is created and removed to prevent conflicts

* refactor: align atomicfile error handling with fs-repo-migrations

- check close error in Abort() before attempting removal
- leave temp file on rename failure for debugging (like fs-repo-15-to-16)
- improves consistency with external migration implementations

* fix: use req.Context in repo migrate to avoid double-lock

The repo migrate command was calling cctx.Context() which has a hidden
side effect: it lazily constructs the IPFS node by calling GetNode(),
which opens the repository and acquires repo.lock. When migrations then
tried to acquire the same lock, it failed with "lock is already held by us"
because go4.org/lock tracks locks per-process in a global map.

The fix uses req.Context instead, which is a plain context.Context with
no side effects. This provides what migrations need (cancellation handling)
without triggering node construction or repo opening.

Context types explained:
- req.Context: Standard Go context for request lifetime, cancellation,
  and timeouts. No side effects.
- cctx.Context(): Kubo-specific method that lazily constructs the full
  IPFS node (opens repo, acquires lock, initializes subsystems). Returns
  the node's internal context.

Why req.Context is correct here:
- Migrations work on raw filesystem (only need ConfigRoot path)
- Command has SetDoesNotUseRepo(true) - doesn't need running node
- Migrations handle their own locking via lockfile.Lock()
- Need cancellation support but not node lifecycle

The bug only appeared with embedded migrations (v16+) because they run
in-process. External migrations (pre-v16) were separate processes, so
each had isolated state. Sequential migrations (forward then backward)
in the same process exposed this latent double-lock issue.

Also adds repo.lock acquisition to RunEmbeddedMigrations to prevent
concurrent migration access, and removes the now-unnecessary daemon
lock check from the migrate command handler.

* fix: use req.Context for migrations and autoconf in daemon startup

daemon.go was incorrectly using cctx.Context() in two critical places:

1. Line 337: migrations call - cctx.Context() triggers GetNode() which
   opens the repo and acquires repo.lock BEFORE migrations run, causing
   "lock is already held by us" errors when migrations try to lock

2. Line 390: autoconf client.Start() - uses context for HTTP timeouts
   and background updater lifecycle, doesn't need node construction

Both now use req.Context (plain Go context) which provides:
- request lifetime and cancellation
- no side effects (doesn't construct node or open repo)
- correct lifecycle for HTTP requests and background goroutines

(cherry picked from commit f4834e7)
lidel added a commit that referenced this pull request Oct 8, 2025
* test: add migration tests for Windows and macOS

- add dedicated CI workflow for migration tests on Windows/macOS
- workflow triggers on migration-related file changes only

* build: remove redundant go version checks

- remove GO_MIN_VERSION and check_go_version scripts
- go.mod already enforces minimum version (go 1.25)
- fixes make build on Windows

* fix: windows migration panic by reading config into memory

fixes migration panic on Windows when upgrading from v0.37 to v0.38
by reading the entire config file into memory before performing atomic
operations. this avoids file locking issues on Windows where open files
cannot be renamed.

also fixes:
- TestRepoDir to set USERPROFILE on Windows (not just HOME)
- CLI migration tests to sanitize directory names (remove colons)

minimal fix that solves the "panic: error can't be dealt with
transactionally: Access is denied" error without adding unnecessary
platform-specific complexity.

* fix: set PATH for CLI migration tests in CI

the CLI tests need the built ipfs binary to be in PATH

* fix: use ipfs shutdown for graceful daemon termination in tests

replaces platform-specific signal handling with ipfs shutdown command
which works consistently across all platforms including Windows

* fix: isolate PATH modifications in parallel migration tests

tests running in parallel with t.Parallel() were interfering with each
other through global PATH modifications via os.Setenv(). this caused
tests to download real migration binaries instead of using mocks,
leading to Windows failures due to path separator issues in external tools.

now each test builds its own custom PATH and passes it explicitly to
commands, preventing interference between parallel tests.

* chore: improve error messages in WithBackup

* fix: Windows CI migration test failures

- add .exe extension to mock migration binaries on Windows
- handle repo lock file properly in mock migration binary
- ensure lock is created and removed to prevent conflicts

* refactor: align atomicfile error handling with fs-repo-migrations

- check close error in Abort() before attempting removal
- leave temp file on rename failure for debugging (like fs-repo-15-to-16)
- improves consistency with external migration implementations

* fix: use req.Context in repo migrate to avoid double-lock

The repo migrate command was calling cctx.Context() which has a hidden
side effect: it lazily constructs the IPFS node by calling GetNode(),
which opens the repository and acquires repo.lock. When migrations then
tried to acquire the same lock, it failed with "lock is already held by us"
because go4.org/lock tracks locks per-process in a global map.

The fix uses req.Context instead, which is a plain context.Context with
no side effects. This provides what migrations need (cancellation handling)
without triggering node construction or repo opening.

Context types explained:
- req.Context: Standard Go context for request lifetime, cancellation,
  and timeouts. No side effects.
- cctx.Context(): Kubo-specific method that lazily constructs the full
  IPFS node (opens repo, acquires lock, initializes subsystems). Returns
  the node's internal context.

Why req.Context is correct here:
- Migrations work on raw filesystem (only need ConfigRoot path)
- Command has SetDoesNotUseRepo(true) - doesn't need running node
- Migrations handle their own locking via lockfile.Lock()
- Need cancellation support but not node lifecycle

The bug only appeared with embedded migrations (v16+) because they run
in-process. External migrations (pre-v16) were separate processes, so
each had isolated state. Sequential migrations (forward then backward)
in the same process exposed this latent double-lock issue.

Also adds repo.lock acquisition to RunEmbeddedMigrations to prevent
concurrent migration access, and removes the now-unnecessary daemon
lock check from the migrate command handler.

* fix: use req.Context for migrations and autoconf in daemon startup

daemon.go was incorrectly using cctx.Context() in two critical places:

1. Line 337: migrations call - cctx.Context() triggers GetNode() which
   opens the repo and acquires repo.lock BEFORE migrations run, causing
   "lock is already held by us" errors when migrations try to lock

2. Line 390: autoconf client.Start() - uses context for HTTP timeouts
   and background updater lifecycle, doesn't need node construction

Both now use req.Context (plain Go context) which provides:
- request lifetime and cancellation
- no side effects (doesn't construct node or open repo)
- correct lifecycle for HTTP requests and background goroutines

(cherry picked from commit f4834e7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0.38.0: migration fails on Windows

2 participants