Conversation
7407077 to
bdb97bf
Compare
8c42a7f to
b7f74d4
Compare
|
Congratulations! One of the builds has completed. 🍾 You can install the built RPMs by following these steps:
Please note that the RPMs should be used only in a testing environment. |
97410b3 to
b40697c
Compare
480911e to
d7a7750
Compare
dbd1d21 to
f2fb731
Compare
|
@jbaublitz Please rebase when you get the chance. |
|
@jbaublitz Should we require the cryptsetup binary directly in cmd.rs instead of just indirectly through |
mulkieran
left a comment
There was a problem hiding this comment.
Please address the two comments at the bottom, i.e., rebase and add test dependencies.
This adds an upgrade and downgrade API that allows atomically converting from read to write locks and vice versa. This commit also applies the usage of this API in reencryption to allow listing pools during the actual reencryption operation.
42b20f1 to
573a13c
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
WalkthroughThis PR implements comprehensive support for online pool encryption, decryption, and reencryption operations in stratisd, primarily targeting metadata version 2. Changes include new Engine/Pool trait methods, DBus interface extensions for the pool_3_9 service, extensive cryptsetup bindings integration, lock system upgrades for read-to-write transitions, and updates to CI workflows and dependencies. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as DBus Client
participant DBus as DBus Interface
participant Pool as Pool (Trait)
participant Backstore as Backstore
participant CryptHandle as CryptHandle
participant Cryptsetup as cryptsetup CLI
Client->>DBus: encrypt_pool_method()
DBus->>Pool: start_encrypt_pool()
Pool->>Backstore: prepare_encrypt()
Backstore->>CryptHandle: setup_encrypt()
CryptHandle->>Cryptsetup: cryptsetup luksFormat
Cryptsetup-->>CryptHandle: LUKS header created
CryptHandle-->>Backstore: (slot, key)
Backstore-->>Pool: CreateAction::Created()
Pool-->>DBus: (true, slot, key_info)
DBus->>Pool: do_encrypt_pool()
Pool->>Backstore: do_encrypt()
Backstore->>Cryptsetup: cryptsetup reencrypt --encrypt
Cryptsetup-->>Backstore: encryption in progress
Backstore-->>Pool: StratisResult<()>
Pool-->>DBus: (true, 0, success)
DBus->>Pool: finish_encrypt_pool()
Pool->>Backstore: finish_encrypt()
Backstore->>CryptHandle: finalize encryption
CryptHandle-->>Backstore: success
Backstore-->>Pool: StratisResult<()>
Pool-->>DBus: (true, 0, success)
DBus->>Client: Signals: encrypted, keyring info
sequenceDiagram
participant Client as DBus Client
participant DBus as DBus Interface
participant Pool as Pool (Trait)
participant LockSystem as Lock System
participant Backstore as Backstore
participant Cryptsetup as cryptsetup CLI
Client->>DBus: reencrypt_pool_method()
DBus->>LockSystem: get_pool_read_guard()
LockSystem-->>DBus: SomeLockReadGuard
DBus->>LockSystem: upgrade(read_guard)
LockSystem->>LockSystem: WaitType::Upgrade queued
LockSystem-->>DBus: SomeLockWriteGuard
DBus->>Pool: start_reencrypt_pool()
Pool->>Backstore: prepare_reencrypt()
Backstore->>Cryptsetup: cryptsetup luksAddKey
Cryptsetup-->>Backstore: new keyslot allocated
Backstore-->>Pool: Vec<(slot, key, new_slot)>
Pool-->>DBus: CreateAction::Created()
DBus->>Pool: do_reencrypt_pool()
Pool->>Cryptsetup: cryptsetup reencrypt --resume-only
Cryptsetup-->>Pool: reencryption progresses
Pool-->>DBus: StratisResult<()>
DBus->>Pool: finish_reencrypt_pool()
Pool->>Pool: last_reencrypt = Utc::now()
Pool->>Pool: write metadata
Pool-->>DBus: ReencryptedDevice
DBus->>Client: Signal: last_reencrypted_timestamp
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can enable review details to help with troubleshooting, context usage and more.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 16
🧹 Nitpick comments (2)
src/engine/strat_engine/thinpool/thinpool.rs (1)
2964-2965: Add a non-zero offset regression case here.This test still exercises only
set_device(new_device, Sectors(0), OffsetDirection::Forwards). The new behavior is the offset adjustment itself, so a same-device/non-zero-offset case would cover the critical path that this PR is introducing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/engine/strat_engine/thinpool/thinpool.rs` around lines 2964 - 2965, Add a regression test that exercises the offset-adjustment path by calling pool.set_device with a non-zero offset and with the same device (i.e., reuse new_device) instead of Sectors(0); specifically, replace the current Sectors(0) call with a non-zero Sectors value (e.g., Sectors(>0)) in the set_device invocation on pool and assert the resulting device/offset state (or behavior) expected by the new implementation so the same-device/non-zero-offset branch is covered; reference set_device, Sectors, OffsetDirection::Forwards, pool and new_device when locating where to add this case.src/engine/strat_engine/pool/v1.rs (1)
2152-2167: Assertlast_reencrypt()after the restart.This test proves reencryption survives
stop_pool()/start_pool(), but it never checks the newlast_reencryptstate thatrecord()andsetup()now persist. A regression that drops the timestamp on write/read would still pass here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/engine/strat_engine/pool/v1.rs` around lines 2152 - 2167, The test currently restarts the pool but never verifies the persisted last_reencrypt timestamp; after the restart (after test_async!(engine.start_pool(...)).unwrap()), read pool.last_reencrypt() and assert it matches the previously observed value (e.g. capture key_info or call pool.last_reencrypt() before stop_pool and compare that to pool.last_reencrypt() after start_pool), ensuring record()/setup() persistence is validated; use the pool API (pool.last_reencrypt()) and appropriate Option handling (is_some() / equals) to make the assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/dbus/pool/pool_3_9/mod.rs`:
- Around line 336-358: The older PoolR0–PoolR8 interfaces are exposing the
`encrypted` property as immutable so clients cache stale state; update each
PoolR0..PoolR8 interface declaration for the `encrypted` property to include
emits_changed_signal = "true" (so property change notifications are emitted like
PoolR9) — locate the `encrypted` property declarations in the PoolR0..PoolR8
interface blocks and add emits_changed_signal = "true"; alternatively, if you
prefer not to change older interfaces, stop registering PoolR0–PoolR8 for this
object path and only register the PoolR9 interface (affecting where
encrypt_pool_method / reencrypt_pool_method / decrypt_pool_method are exposed)
so clients see live state via PoolR9 only.
In `@src/engine/strat_engine/backstore/backstore/v1.rs`:
- Around line 832-845: The reencrypt function currently zips blockdevs() with
key_info which silently truncates if lengths differ; before iterating, count the
blockdevs() (e.g. via self.blockdevs().map(|(_,_,bd)| bd).count() or collect
into a vec) and compare to key_info.len(), and if they differ return a
StratisResult::Err (use an appropriate error variant/message) so the function
fails fast; then iterate using the collected blockdev list and key_info to call
bd.do_reencrypt(...) as before. Include reference to prepare_reencrypt() in the
error message or comment to indicate stale key_info mismatch.
In `@src/engine/strat_engine/backstore/backstore/v2.rs`:
- Around line 1307-1315: The bug is that self.shift_alloc_offset(...) is called
before checking the result of the thinpool remap, so allocs are moved even if
get_devno_from_path() or thinpool.set_device() fails; change the sequence in the
function(s) using thinpool.suspend()/thinpool.resume() and set_device_res (e.g.,
the block around thinpool.suspend(), get_devno_from_path(...).and_then(|devno|
thinpool.set_device(...)), thinpool.resume(), self.shift_alloc_offset(...),
set_device_res) so that you only call self.shift_alloc_offset(offset,
offset_direction) after set_device_res? succeeds (i.e., after
unwrapping/returning on error), leaving allocs untouched on failure; apply the
same fix to the analogous block around lines where shift_alloc_offset is invoked
at 1409-1416.
In `@src/engine/strat_engine/backstore/blockdev/v1.rs`:
- Around line 410-415: The code in setup_reencrypt uses
underlying_device.crypt_handle().expect(...) which panics; change this to return
a StratisError wrapped in StratisResult when the crypt handle is missing (do not
panic). Specifically, in the method setup_reencrypt (and the similar block
around the other occurrence), check underlying_device.crypt_handle() and if None
return an appropriate StratisError (e.g., an InvalidState or NotEncrypted error)
instead of calling expect, then call crypt_handle.setup_reencrypt() and
propagate its StratisResult; update signatures/returns to preserve
StratisResult<(u32, SizedKeyMemory, u32)> and ensure error construction uses the
project’s StratisError constructors.
In `@src/engine/strat_engine/cmd.rs`:
- Around line 583-615: The three functions run_encrypt, run_reencrypt, and
run_decrypt currently call Command::new(CRYPTSETUP) which triggers PATH lookup;
change each to use the verified absolute path by calling
Command::new(get_executable(CRYPTSETUP)) (or
Command::new(get_executable(CRYPTSETUP).as_os_str())) so the cached, verified
executable is used consistently with other commands; keep the existing args and
then call execute_cmd(&mut cmd) as before.
In `@src/engine/strat_engine/crypt/shared.rs`:
- Around line 1100-1183: handle_setup_reencrypt currently always creates new
keyslots and appends to token JSON, causing duplicate mutations on retry; change
it to detect an existing reencryption context before mutating: for each token
slot (use device.token_handle().json_get and the token_contents Value used in
set_up_reencryption_token), inspect the token JSON for an existing
TOKEN_KEYSLOTS_KEY entry or a custom reencryption marker and, if present, parse
and reuse the already-created keyslot(s) instead of calling
device.keyslot_handle().add_by_key; similarly, before creating
single_new_keyslot, check token_contents or call get_keyslot_number to see if a
reencrypt keyslot already exists and skip creation and json_set when so,
ensuring set_up_reencryption_token only writes when it actually added a new
keyslot.
In `@src/engine/strat_engine/engine.rs`:
- Around line 111-113: The call to validate_key_descs(ei.key_descs()) is being
executed before the idempotency/existing-pool branch and can cause
otherwise-idempotent create_pool*() calls to fail; move the key-description
validation so it runs only when the code takes the new-pool initialization
branch (i.e., after you determine that a new pool will be created and not when
create_pool_idempotent_or_err() returns Identity). Locate the validate_key_descs
invocation(s) (the ones using encryption_info and the calls around lines
referenced previously) and relocate them into the new-pool path/branch that
performs actual pool initialization; ensure create_pool_idempotent_or_err()
still short-circuits without running validation for the existing-pool/idempotent
case.
In `@src/engine/strat_engine/keys.rs`:
- Around line 35-41: Update the incomplete doc comment for validate_key_descs to
a full sentence (e.g., "Validate that all key descriptions exist in the
persistent keyring."). In validate_key_descs, stop using the try operator alone:
call search_key_persistent(key_desc) and match its
StratisResult<Option<KeySerial>>; if it returns Ok(Some(_)) continue, if
Ok(None) convert that into an Err(StratisResult) that indicates the specific
missing key (use the appropriate StratisError variant or create a clear error
message including the KeyDescription), and if Err(_) propagate the error as
before. This ensures missing keys are treated as validation failures rather than
being ignored.
In `@src/engine/strat_engine/pool/v2.rs`:
- Line 701: The call to search_key_persistent(key_description) currently
discards its Result<Option<_>> so missing key descriptions (Ok(None)) are not
rejected; change the callers in this file (the call at
search_key_persistent(key_description)? and the similar calls in the 719-753
range) to inspect the returned Option and return an appropriate error when it is
None (i.e., treat Ok(None) as a preflight failure), e.g., propagate Err(...) or
convert None into a descriptive error before proceeding into the mutating path
so that missing keys are rejected at precheck time.
- Around line 1256-1296: The pool transition writes must persist a recovery
state before any on-disk LUKS/header or DM topology changes: modify
start_encrypt_pool and the corresponding decrypt path (the code that calls
do_decrypt_pool) to call write_metadata (or set a "transitioning" encryption
state in metadata) and flush it before invoking backstore.prepare_encrypt or
Backstore::do_decrypt_pool (and before any call to prepare_encrypt/
do_decrypt_pool in backstore); ensure the metadata records enough info to detect
an incomplete transition and that failure paths roll back or leave the metadata
in a recoverable state so finish_encrypt_pool (and its decrypt counterpart) can
clear the transition and write final metadata after the backstore operations
succeed.
In `@src/engine/strat_engine/serde_structs.rs`:
- Around line 96-104: The function serialize_date_time currently unwraps the
Option with expect and will panic if called with None (fragile because it
depends on skip_serializing_if elsewhere); update serialize_date_time to handle
None defensively by matching or if-let on timestamp and calling
serializer.serialize_i64(...) when Some(ts.timestamp()) and
serializer.serialize_none() (or serializer.serialize_unit()) when None, keeping
the same return type Result<S::Ok, S::Error>; modify the function body
(referencing serialize_date_time and its signature) so it no longer uses expect
and safely serializes None values.
In `@src/engine/strat_engine/thinpool/dm_structs.rs`:
- Around line 122-135: The Backwards branch currently computes
params.start_offset - offset without checking bounds; update the code paths that
construct LinearDevTargetParams::Flakey (and the analogous Linear branch using
FlakeyTargetParams::new and the Forwards/Backwards match) to validate that
params.start_offset >= offset when OffsetDirection::Backwards and return an
error (or otherwise fail fast) instead of building an invalid table; locate the
match on offset_direction around LinearDevTargetParams::Flakey /
FlakeyTargetParams::new and add the guard using params.start_offset and offset,
returning an appropriate Err result if the check fails.
In `@src/engine/strat_engine/thinpool/thinpool.rs`:
- Around line 1964-2010: Precompute and validate all segment shifts by creating
local copies of the segment start values and running apply_offset on those
copies (e.g., iterate self.segments.mdv_segments, .data_segments,
.meta_segments, .meta_spare_segments into temp collections and call apply_offset
on the temps) before calling thin_pool.set_meta_table, thin_pool.set_data_table,
or self.mdv.set_table; only after set_meta_table/set_data_table/mdv.set_table
all succeed, replace the actual segment collections with the precomputed shifted
copies and then set self.backstore_device, so that failures in mdv.set_table or
set_data_table do not leave live DM tables partially updated while segment
vectors or backstore_device remain stale.
- Around line 1938-1940: The early return in set_device() incorrectly treats
same backstore_device alone as idempotent; change the condition to also compare
offset and offset_direction so the function only returns Ok(false) when
backstore_device == self.backstore_device AND offset == self.backstore_offset
AND offset_direction == self.backstore_offset_direction (or the actual field
names used for stored offset/direction). Update the if that currently checks
only backstore_device to include those additional equality checks so offset-only
remaps are not turned into no-ops.
In `@src/engine/structures/lock.rs`:
- Around line 355-360: The Upgrade branch deadlocks when multiple readers on the
same UUID call upgrade() because each upgrader's held read guard counts toward
read_locked; fix by tracking per-UUID pending upgraders and excluding them from
the blocking reader count: add a field like upgrade_waiters: HashMap<Uuid,
usize> (increment in upgrade(), decrement when the upgrade wait ends), then in
the WaitType::Upgrade check compute effective_readers =
read_locked.get(uuid).unwrap_or(&0).saturating_sub(upgrade_waiters.get(uuid).unwrap_or(&0))
and use effective_readers > 1 (or == 0/1 depending on current logic) instead of
raw read_locked; also ensure wake/cleanup paths update upgrade_waiters
consistently so only non-upgrader readers block upgrades.
- Around line 667-688: The upgrade wake path currently calls add_write_lock
which revalidates the waiter as WaitType::SomeWrite and causes a panic in
woken_or_new; instead, add a dedicated add_upgrade_lock method (e.g.,
add_upgrade_lock(&mut self, uuid: U, idx: u64)) that inserts the uuid into
write_locked and calls pre_acquire_assertion(&WaitType::Upgrade(uuid), idx), and
replace the call to add_write_lock(...) in the Upgrade wake branch (the branch
using WaitType::Upgrade and self.1 .1) with add_upgrade_lock(self.1 .1, self.2)
so the woken entry stays validated as Upgrade rather than SomeWrite.
---
Nitpick comments:
In `@src/engine/strat_engine/pool/v1.rs`:
- Around line 2152-2167: The test currently restarts the pool but never verifies
the persisted last_reencrypt timestamp; after the restart (after
test_async!(engine.start_pool(...)).unwrap()), read pool.last_reencrypt() and
assert it matches the previously observed value (e.g. capture key_info or call
pool.last_reencrypt() before stop_pool and compare that to pool.last_reencrypt()
after start_pool), ensuring record()/setup() persistence is validated; use the
pool API (pool.last_reencrypt()) and appropriate Option handling (is_some() /
equals) to make the assertion.
In `@src/engine/strat_engine/thinpool/thinpool.rs`:
- Around line 2964-2965: Add a regression test that exercises the
offset-adjustment path by calling pool.set_device with a non-zero offset and
with the same device (i.e., reuse new_device) instead of Sectors(0);
specifically, replace the current Sectors(0) call with a non-zero Sectors value
(e.g., Sectors(>0)) in the set_device invocation on pool and assert the
resulting device/offset state (or behavior) expected by the new implementation
so the same-device/non-zero-offset branch is covered; reference set_device,
Sectors, OffsetDirection::Forwards, pool and new_device when locating where to
add this case.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 19524bcb-a651-4b50-9cba-6531040d5189
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (36)
.github/workflows/main.yml.github/workflows/ubuntu.yml.github/workflows/valgrind.ymlCargo.tomlsrc/dbus/pool/pool_3_9/methods.rssrc/dbus/pool/pool_3_9/mod.rssrc/dbus/pool/pool_3_9/props.rssrc/dbus/util.rssrc/engine/engine.rssrc/engine/mod.rssrc/engine/sim_engine/engine.rssrc/engine/sim_engine/mod.rssrc/engine/sim_engine/pool.rssrc/engine/sim_engine/shared.rssrc/engine/strat_engine/backstore/backstore/v1.rssrc/engine/strat_engine/backstore/backstore/v2.rssrc/engine/strat_engine/backstore/blockdev/v1.rssrc/engine/strat_engine/backstore/devices.rssrc/engine/strat_engine/backstore/mod.rssrc/engine/strat_engine/cmd.rssrc/engine/strat_engine/crypt/handle/v1.rssrc/engine/strat_engine/crypt/handle/v2.rssrc/engine/strat_engine/crypt/shared.rssrc/engine/strat_engine/engine.rssrc/engine/strat_engine/keys.rssrc/engine/strat_engine/pool/dispatch.rssrc/engine/strat_engine/pool/v1.rssrc/engine/strat_engine/pool/v2.rssrc/engine/strat_engine/serde_structs.rssrc/engine/strat_engine/thinpool/dm_structs.rssrc/engine/strat_engine/thinpool/filesystem.rssrc/engine/strat_engine/thinpool/thinpool.rssrc/engine/structures/lock.rssrc/engine/types/actions.rssrc/engine/types/keys.rssrc/engine/types/mod.rs
| pub fn reencrypt( | ||
| &self, | ||
| pool_uuid: PoolUuid, | ||
| key_info: Vec<(u32, SizedKeyMemory, u32)>, | ||
| ) -> StratisResult<()> { | ||
| for (bd, (slot, key, new_slot)) in self | ||
| .blockdevs() | ||
| .into_iter() | ||
| .map(|(_, _, bd)| bd) | ||
| .zip(key_info) | ||
| { | ||
| bd.do_reencrypt(pool_uuid, slot, key, new_slot)?; | ||
| } | ||
| Ok(()) |
There was a problem hiding this comment.
Fail fast when the reencryption tuple count doesn't match the blockdev count.
zip truncates silently. If key_info is stale or the device set changes between prepare_reencrypt() and reencrypt(), this returns Ok(()) after reencrypting only a prefix of the pool.
Suggested fix
pub fn reencrypt(
&self,
pool_uuid: PoolUuid,
key_info: Vec<(u32, SizedKeyMemory, u32)>,
) -> StratisResult<()> {
- for (bd, (slot, key, new_slot)) in self
- .blockdevs()
+ let blockdevs = self.blockdevs();
+ if blockdevs.len() != key_info.len() {
+ return Err(StratisError::Msg(format!(
+ "Expected reencryption info for {} block devices, got {}",
+ blockdevs.len(),
+ key_info.len()
+ )));
+ }
+
+ for (bd, (slot, key, new_slot)) in blockdevs
.into_iter()
.map(|(_, _, bd)| bd)
.zip(key_info)
{
bd.do_reencrypt(pool_uuid, slot, key, new_slot)?;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pub fn reencrypt( | |
| &self, | |
| pool_uuid: PoolUuid, | |
| key_info: Vec<(u32, SizedKeyMemory, u32)>, | |
| ) -> StratisResult<()> { | |
| for (bd, (slot, key, new_slot)) in self | |
| .blockdevs() | |
| .into_iter() | |
| .map(|(_, _, bd)| bd) | |
| .zip(key_info) | |
| { | |
| bd.do_reencrypt(pool_uuid, slot, key, new_slot)?; | |
| } | |
| Ok(()) | |
| pub fn reencrypt( | |
| &self, | |
| pool_uuid: PoolUuid, | |
| key_info: Vec<(u32, SizedKeyMemory, u32)>, | |
| ) -> StratisResult<()> { | |
| let blockdevs = self.blockdevs(); | |
| if blockdevs.len() != key_info.len() { | |
| return Err(StratisError::Msg(format!( | |
| "Expected reencryption info for {} block devices, got {}", | |
| blockdevs.len(), | |
| key_info.len() | |
| ))); | |
| } | |
| for (bd, (slot, key, new_slot)) in blockdevs | |
| .into_iter() | |
| .map(|(_, _, bd)| bd) | |
| .zip(key_info) | |
| { | |
| bd.do_reencrypt(pool_uuid, slot, key, new_slot)?; | |
| } | |
| Ok(()) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/engine/strat_engine/backstore/backstore/v1.rs` around lines 832 - 845,
The reencrypt function currently zips blockdevs() with key_info which silently
truncates if lengths differ; before iterating, count the blockdevs() (e.g. via
self.blockdevs().map(|(_,_,bd)| bd).count() or collect into a vec) and compare
to key_info.len(), and if they differ return a StratisResult::Err (use an
appropriate error variant/message) so the function fails fast; then iterate
using the collected blockdev list and key_info to call bd.do_reencrypt(...) as
before. Include reference to prepare_reencrypt() in the error message or comment
to indicate stale key_info mismatch.
There was a problem hiding this comment.
Ok. Let's do this one. The error hypothesized seems very unlikely, but the extra error return is very little additional trouble. If we're convinced the lengths must be the same, an assertion is the appropriate thing.
There was a problem hiding this comment.
@mulkieran, agreed — since prepare_reencrypt and reencrypt are always called sequentially while holding the write lock (no blockdevs can be added or removed in between), the lengths are guaranteed to match. A mismatch would be a programming bug, not a recoverable runtime condition, so assert_eq! is more appropriate here:
pub fn reencrypt(
&self,
pool_uuid: PoolUuid,
key_info: Vec<(u32, SizedKeyMemory, u32)>,
) -> StratisResult<()> {
let blockdevs = self.blockdevs();
assert_eq!(
blockdevs.len(),
key_info.len(),
"reencryption key info length ({}) does not match blockdev count ({}); \
key_info from prepare_reencrypt() must correspond 1-to-1 with blockdevs()",
key_info.len(),
blockdevs.len(),
);
for (bd, (slot, key, new_slot)) in blockdevs
.into_iter()
.map(|(_, _, bd)| bd)
.zip(key_info)
{
bd.do_reencrypt(pool_uuid, slot, key, new_slot)?;
}
Ok(())
}The same pattern applies to the identical reencrypt method in src/engine/strat_engine/backstore/backstore/v2.rs — worth updating that one consistently.
🧠 Learnings used
Learnt from: jbaublitz
Repo: stratis-storage/stratisd PR: 3651
File: src/dbus_api/pool/pool_3_9/methods.rs:173-176
Timestamp: 2025-08-20T02:37:22.477Z
Learning: The encrypt_pool, reencrypt_pool, and decrypt_pool methods in src/dbus_api/pool/pool_3_9/methods.rs have intentionally different lock handling patterns due to different workflow requirements, and this is not an inconsistency that needs to be fixed.
Learnt from: jbaublitz
Repo: stratis-storage/stratisd PR: 3898
File: src/engine/sim_engine/pool.rs:934-948
Timestamp: 2025-09-05T14:32:42.528Z
Learning: In src/engine/sim_engine/pool.rs, the start_encrypt_pool method intentionally creates a SizedKeyMemory with 1 byte allocated but 0 length reported - this is to avoid NULL pointer issues while simulating an empty key slice in the sim engine.
There was a problem hiding this comment.
@jbaublitz Code Rabbit makes a good point here, either an assertion or an error return depending on what we believe seems like the correct course.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| thinpool.suspend()?; | ||
| let set_device_res = get_devno_from_path( | ||
| &PathBuf::from(DEVICEMAPPER_PATH) | ||
| .join(format_crypt_backstore_name(&pool_uuid).to_string()), | ||
| ) | ||
| .and_then(|devno| thinpool.set_device(devno, offset, offset_direction)); | ||
| thinpool.resume()?; | ||
| self.shift_alloc_offset(offset, offset_direction); | ||
| set_device_res?; |
There was a problem hiding this comment.
Only shift the allocation map after the thinpool remap succeeds.
Both methods call shift_alloc_offset() before checking set_device_res. If get_devno_from_path() or thinpool.set_device() fails, the function returns an error with self.cap_device.allocs already moved into the new coordinate space while the thinpool still points at the old device.
Proposed fix
thinpool.resume()?;
- self.shift_alloc_offset(offset, offset_direction);
set_device_res?;
+ self.shift_alloc_offset(offset, offset_direction);
Ok((sector_size, key_info))
@@
let set_device_res =
get_devno_from_path(&PathBuf::from(DEVICEMAPPER_PATH).join(dm_name.to_string()))
.and_then(|devno| thinpool.set_device(devno, offset, direction));
thinpool.resume()?;
- self.shift_alloc_offset(offset, direction);
set_device_res?;
+ self.shift_alloc_offset(offset, direction);
self.cap_device.enc = None;
Ok(())Also applies to: 1409-1416
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/engine/strat_engine/backstore/backstore/v2.rs` around lines 1307 - 1315,
The bug is that self.shift_alloc_offset(...) is called before checking the
result of the thinpool remap, so allocs are moved even if get_devno_from_path()
or thinpool.set_device() fails; change the sequence in the function(s) using
thinpool.suspend()/thinpool.resume() and set_device_res (e.g., the block around
thinpool.suspend(), get_devno_from_path(...).and_then(|devno|
thinpool.set_device(...)), thinpool.resume(), self.shift_alloc_offset(...),
set_device_res) so that you only call self.shift_alloc_offset(offset,
offset_direction) after set_device_res? succeeds (i.e., after
unwrapping/returning on error), leaving allocs untouched on failure; apply the
same fix to the analogous block around lines where shift_alloc_offset is invoked
at 1409-1416.
| match offset_direction { | ||
| OffsetDirection::Forwards => params.start_offset + offset, | ||
| OffsetDirection::Backwards => params.start_offset - offset, | ||
| }, | ||
| )) | ||
| } | ||
| LinearDevTargetParams::Flakey(ref params) => { | ||
| let feature_args = params.feature_args.iter().cloned().collect::<Vec<_>>(); | ||
| LinearDevTargetParams::Flakey(FlakeyTargetParams::new( | ||
| device, | ||
| match offset_direction { | ||
| OffsetDirection::Forwards => params.start_offset + offset, | ||
| OffsetDirection::Backwards => params.start_offset - offset, | ||
| }, |
There was a problem hiding this comment.
Guard the backwards offset adjustment.
OffsetDirection::Backwards subtracts offset from every segment start without checking that the segment actually begins that far into the device. A larger shift here will build a table with the wrong physical offsets instead of failing fast.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/engine/strat_engine/thinpool/dm_structs.rs` around lines 122 - 135, The
Backwards branch currently computes params.start_offset - offset without
checking bounds; update the code paths that construct
LinearDevTargetParams::Flakey (and the analogous Linear branch using
FlakeyTargetParams::new and the Forwards/Backwards match) to validate that
params.start_offset >= offset when OffsetDirection::Backwards and return an
error (or otherwise fail fast) instead of building an invalid table; locate the
match on offset_direction around LinearDevTargetParams::Flakey /
FlakeyTargetParams::new and add the guard using params.start_offset and offset,
returning an appropriate Err result if the check fails.
There was a problem hiding this comment.
@jbaublitz Maybe set_target_device function could just be migrated into set_device method and it and fn apply_offset direction could handle these subtractions uniformly.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| if backstore_device == self.backstore_device { | ||
| return Ok(false); | ||
| } |
There was a problem hiding this comment.
Don't treat same-device/non-zero-offset calls as idempotent.
Line 1938 ignores offset and offset_direction, so set_device() becomes a no-op whenever the DM device stays the same. That defeats the new contract for offset-only remaps and can break online reencryption flows where the upper device is unchanged but the data start moves.
🛠️ Minimal fix
- if backstore_device == self.backstore_device {
+ if backstore_device == self.backstore_device && offset == Sectors(0) {
return Ok(false);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if backstore_device == self.backstore_device { | |
| return Ok(false); | |
| } | |
| if backstore_device == self.backstore_device && offset == Sectors(0) { | |
| return Ok(false); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/engine/strat_engine/thinpool/thinpool.rs` around lines 1938 - 1940, The
early return in set_device() incorrectly treats same backstore_device alone as
idempotent; change the condition to also compare offset and offset_direction so
the function only returns Ok(false) when backstore_device ==
self.backstore_device AND offset == self.backstore_offset AND offset_direction
== self.backstore_offset_direction (or the actual field names used for stored
offset/direction). Update the if that currently checks only backstore_device to
include those additional equality checks so offset-only remaps are not turned
into no-ops.
There was a problem hiding this comment.
@jbaublitz I think Code Rabbit's code suggestion is a good one.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| self.thin_pool.set_meta_table(get_dm(), meta_table)?; | ||
| if let Err(e) = self.thin_pool.set_data_table(get_dm(), data_table) { | ||
| match self.thin_pool.set_meta_table(get_dm(), original_meta.table) { | ||
| Ok(_) => return Err(StratisError::from(e)), | ||
| Err(rollback_e) => { | ||
| return Err(StratisError::RollbackError { | ||
| causal_error: Box::new(StratisError::from(e)), | ||
| rollback_error: Box::new(StratisError::from(rollback_e)), | ||
| level: ActionAvailability::NoPoolChanges, | ||
| }) | ||
| } | ||
| } | ||
| } | ||
| if let Err(e) = self.mdv.set_table(mdv_table) { | ||
| if let Err(rollback_e) = self.thin_pool.set_meta_table(get_dm(), original_meta.table) { | ||
| return Err(StratisError::RollbackError { | ||
| causal_error: Box::new(e), | ||
| rollback_error: Box::new(StratisError::from(rollback_e)), | ||
| level: ActionAvailability::NoPoolChanges, | ||
| }); | ||
| } | ||
| match self.thin_pool.set_data_table(get_dm(), original_data.table) { | ||
| Ok(_) => return Err(e), | ||
| Err(rollback_e) => { | ||
| return Err(StratisError::RollbackError { | ||
| causal_error: Box::new(e), | ||
| rollback_error: Box::new(StratisError::from(rollback_e)), | ||
| level: ActionAvailability::NoPoolChanges, | ||
| }) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| for (start, _) in self.segments.mdv_segments.iter_mut() { | ||
| apply_offset(start, offset, offset_direction)?; | ||
| } | ||
| for (start, _) in self.segments.data_segments.iter_mut() { | ||
| apply_offset(start, offset, offset_direction)?; | ||
| } | ||
| for (start, _) in self.segments.meta_segments.iter_mut() { | ||
| apply_offset(start, offset, offset_direction)?; | ||
| } | ||
| for (start, _) in self.segments.meta_spare_segments.iter_mut() { | ||
| apply_offset(start, offset, offset_direction)?; | ||
| } | ||
|
|
||
| self.backstore_device = backstore_device; |
There was a problem hiding this comment.
Validate segment shifts before touching live DM tables.
apply_offset() can still fail after Lines 1964-1977 have already rewritten the meta/data/MDV tables. On that path the caller gets a plain error even though live mappings have changed, self.backstore_device is still stale, and some segment vectors may already be partially updated. The next check() will then hit the invariant at Lines 1230-1235. Please precompute shifted segment copies first, then commit them only after all table updates succeed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/engine/strat_engine/thinpool/thinpool.rs` around lines 1964 - 2010,
Precompute and validate all segment shifts by creating local copies of the
segment start values and running apply_offset on those copies (e.g., iterate
self.segments.mdv_segments, .data_segments, .meta_segments, .meta_spare_segments
into temp collections and call apply_offset on the temps) before calling
thin_pool.set_meta_table, thin_pool.set_data_table, or self.mdv.set_table; only
after set_meta_table/set_data_table/mdv.set_table all succeed, replace the
actual segment collections with the precomputed shifted copies and then set
self.backstore_device, so that failures in mdv.set_table or set_data_table do
not leave live DM tables partially updated while segment vectors or
backstore_device remain stale.
There was a problem hiding this comment.
@coderabbitai Your pre-computation suggestion is probably a good one.
There was a problem hiding this comment.
@jbaublitz Let's take Code Rabbit's advice on this one.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
@mulkieran, agreed — here's the concrete fix. The idea is to clone all four segment vectors and validate all the offset arithmetic before any DM table is touched. Only after every set_meta_table / set_data_table / set_table call succeeds do we swap in the pre-computed segment vectors and update self.backstore_device.
/// Set the device on all DM devices
pub fn set_device(
&mut self,
backstore_device: Device,
offset: Sectors,
offset_direction: OffsetDirection,
) -> StratisResult<bool> {
fn apply_offset(
start: &mut Sectors,
offset: Sectors,
offset_direction: OffsetDirection,
) -> StratisResult<()> {
match offset_direction {
OffsetDirection::Forwards => {
*start = start.checked_add(offset).ok_or_else(|| {
StratisError::Msg(format!(
"Allocation shift would overflow integer, start: {start}, offset: {offset}"
))
})?;
}
OffsetDirection::Backwards => {
*start = Sectors(start.checked_sub(*offset).ok_or_else(|| {
StratisError::Msg(format!(
"Allocation shift would underflow integer, start: {start}, offset: {offset}"
))
})?);
}
}
Ok(())
}
if backstore_device == self.backstore_device {
return Ok(false);
}
// Pre-compute all shifted segment copies and validate arithmetic BEFORE
// touching any live DM table. A failure here is a clean error with no
// side-effects.
let mut new_mdv_segments = self.segments.mdv_segments.clone();
let mut new_data_segments = self.segments.data_segments.clone();
let mut new_meta_segments = self.segments.meta_segments.clone();
let mut new_meta_spare_segments = self.segments.meta_spare_segments.clone();
for (start, _) in new_mdv_segments.iter_mut() {
apply_offset(start, offset, offset_direction)?;
}
for (start, _) in new_data_segments.iter_mut() {
apply_offset(start, offset, offset_direction)?;
}
for (start, _) in new_meta_segments.iter_mut() {
apply_offset(start, offset, offset_direction)?;
}
for (start, _) in new_meta_spare_segments.iter_mut() {
apply_offset(start, offset, offset_direction)?;
}
let original_meta = self.thin_pool.meta_dev().table().clone();
let original_data = self.thin_pool.data_dev().table().clone();
let meta_table = linear_table::set_target_device(
self.thin_pool.meta_dev().table(),
backstore_device,
offset,
offset_direction,
);
let data_table = linear_table::set_target_device(
self.thin_pool.data_dev().table(),
backstore_device,
offset,
offset_direction,
);
let mdv_table = linear_table::set_target_device(
self.mdv.device().table(),
backstore_device,
offset,
offset_direction,
);
self.thin_pool.set_meta_table(get_dm(), meta_table)?;
if let Err(e) = self.thin_pool.set_data_table(get_dm(), data_table) {
match self.thin_pool.set_meta_table(get_dm(), original_meta.table) {
Ok(_) => return Err(StratisError::from(e)),
Err(rollback_e) => {
return Err(StratisError::RollbackError {
causal_error: Box::new(StratisError::from(e)),
rollback_error: Box::new(StratisError::from(rollback_e)),
level: ActionAvailability::NoPoolChanges,
})
}
}
}
if let Err(e) = self.mdv.set_table(mdv_table) {
if let Err(rollback_e) = self.thin_pool.set_meta_table(get_dm(), original_meta.table) {
return Err(StratisError::RollbackError {
causal_error: Box::new(e),
rollback_error: Box::new(StratisError::from(rollback_e)),
level: ActionAvailability::NoPoolChanges,
});
}
match self.thin_pool.set_data_table(get_dm(), original_data.table) {
Ok(_) => return Err(e),
Err(rollback_e) => {
return Err(StratisError::RollbackError {
causal_error: Box::new(e),
rollback_error: Box::new(StratisError::from(rollback_e)),
level: ActionAvailability::NoPoolChanges,
})
}
}
}
// All DM table updates succeeded; commit the pre-computed segment
// vectors and the new backstore device atomically from the caller's
// perspective.
self.segments.mdv_segments = new_mdv_segments;
self.segments.data_segments = new_data_segments;
self.segments.meta_segments = new_meta_segments;
self.segments.meta_spare_segments = new_meta_spare_segments;
self.backstore_device = backstore_device;
Ok(true)
}Key changes:
- All four segment vectors are cloned and every
apply_offsetcall is executed before any DM table is touched — arithmetic errors are surfaced cleanly with no side-effects. - The original four
for … apply_offsetloops at the bottom of the function are removed. - After all three DM table updates succeed, the pre-computed vectors replace the originals and
self.backstore_deviceis updated in one logical step.
🧠 Learnings used
Learnt from: mulkieran
Repo: stratis-storage/stratisd PR: 3953
File: src/engine/strat_engine/backstore/blockdev/v1.rs:410-415
Timestamp: 2026-03-20T20:00:36.372Z
Learning: In `src/engine/strat_engine/backstore/blockdev/v1.rs`, the `setup_reencrypt` and `do_reencrypt` methods intentionally use `.expect("Checked that pool is encrypted")` on `crypt_handle()` rather than returning a `StratisError`. The crypt handle is obtained programmatically and all block devices in the pool are guaranteed to be encrypted before these methods are called, making the `expect` a valid programmer assertion rather than a recoverable runtime error.
Learnt from: mulkieran
Repo: stratis-storage/stratisd PR: 3953
File: src/engine/strat_engine/keys.rs:35-41
Timestamp: 2026-03-20T23:01:09.032Z
Learning: In Rust code that uses `Result<Option<T>>`, do not rely on the `?` operator to handle `Ok(None)` implicitly. `?` will treat `Ok(None)` as a successful value and may incorrectly skip required error handling. Instead, explicitly match on `Option` (e.g., `match res? { Some(v) => ..., None => return Err(/* missing key */) }`) so missing-key cases fail fast when they should.
Learnt from: jbaublitz
Repo: stratis-storage/stratisd PR: 3651
File: src/dbus_api/pool/pool_3_9/methods.rs:173-176
Timestamp: 2025-08-20T02:37:22.477Z
Learning: The encrypt_pool, reencrypt_pool, and decrypt_pool methods in src/dbus_api/pool/pool_3_9/methods.rs have intentionally different lock handling patterns due to different workflow requirements, and this is not an inconsistency that needs to be fixed.
| WaitType::Upgrade(uuid) => { | ||
| self.read_locked.get(uuid).unwrap_or(&0) > &1 | ||
| || self.write_locked.contains(uuid) | ||
| || self.all_read_locked > 0 | ||
| || self.all_write_locked | ||
| } |
There was a problem hiding this comment.
Concurrent upgrades on the same UUID will deadlock.
Each upgrade keeps its read guard alive while waiting, but this branch only proceeds once the per-UUID read count drops to 1. If two readers on the same pool both call upgrade(), both futures park with the count stuck at 2 and no task can make progress.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/engine/structures/lock.rs` around lines 355 - 360, The Upgrade branch
deadlocks when multiple readers on the same UUID call upgrade() because each
upgrader's held read guard counts toward read_locked; fix by tracking per-UUID
pending upgraders and excluding them from the blocking reader count: add a field
like upgrade_waiters: HashMap<Uuid, usize> (increment in upgrade(), decrement
when the upgrade wait ends), then in the WaitType::Upgrade check compute
effective_readers =
read_locked.get(uuid).unwrap_or(&0).saturating_sub(upgrade_waiters.get(uuid).unwrap_or(&0))
and use effective_readers > 1 (or == 0/1 depending on current logic) instead of
raw read_locked; also ensure wake/cleanup paths update upgrade_waiters
consistently so only non-upgrader readers block upgrades.
There was a problem hiding this comment.
@coderabbitai I believe that this should induce a hang when two re-encryption operations on the same pool are scheduled to happen, but I can not induce a hang.
There was a problem hiding this comment.
🧩 Analysis chain
Let me check whether the higher-level start_reencrypt_pool logic prevents two concurrent upgrade attempts from both reaching the lock layer.
🏁 Script executed:
#!/bin/bash
# Check start_reencrypt_pool implementations in v1 and v2 pool files
rg -n "start_reencrypt_pool" --type rust -A 20
Length of output: 12296
@mulkieran, you're right that a hang cannot be induced, and here's the reason why: the method's lock sequence prevents two upgraders from ever co-existing.
Looking at reencrypt_pool_method (lines 156–165 in src/dbus/pool/pool_3_9/methods.rs), the full sequence is:
- Acquire a write guard (
engine.get_mut_pool(...)) - Call
start_reencrypt_pool()(write guard held) - Downgrade to a read guard (
guard.downgrade()) - Call
do_reencrypt_pool(...)(read guard held — the long-running step) - Upgrade back to a write guard (
upgrade_pool(guard)) - Call
finish_reencrypt_pool(...)
The key is step 1. A second concurrent request for the same pool UUID will block at step 1 waiting for the write lock — it never holds a read guard of its own during this time. By the time it eventually acquires the write lock and downgrades to a read guard, Task A has already fully completed and released. So when upgrade_pool is called at step 5, read_locked[uuid] is always exactly 1 (only the caller's own downgraded read guard), satisfying > &1 == false, and the upgrade proceeds without waiting.
The deadlock scenario I flagged — two tasks simultaneously holding read guards and both calling upgrade() — would require two tasks to each independently get through step 3 at the same time, which the serializing write lock at step 1 prevents entirely. My original comment overstated the risk for this specific usage pattern.
🧠 Learnings used
Learnt from: jbaublitz
Repo: stratis-storage/stratisd PR: 3651
File: src/dbus_api/pool/pool_3_9/methods.rs:173-176
Timestamp: 2025-08-20T02:37:22.477Z
Learning: The encrypt_pool, reencrypt_pool, and decrypt_pool methods in src/dbus_api/pool/pool_3_9/methods.rs have intentionally different lock handling patterns due to different workflow requirements, and this is not an inconsistency that needs to be fixed.
Learnt from: mulkieran
Repo: stratis-storage/stratisd PR: 3953
File: src/engine/strat_engine/backstore/blockdev/v1.rs:410-415
Timestamp: 2026-03-20T20:00:36.372Z
Learning: In `src/engine/strat_engine/backstore/blockdev/v1.rs`, the `setup_reencrypt` and `do_reencrypt` methods intentionally use `.expect("Checked that pool is encrypted")` on `crypt_handle()` rather than returning a `StratisError`. The crypt handle is obtained programmatically and all block devices in the pool are guaranteed to be encrypted before these methods are called, making the `expect` a valid programmer assertion rather than a recoverable runtime error.
Learnt from: mulkieran
Repo: stratis-storage/stratisd PR: 3953
File: src/engine/strat_engine/keys.rs:35-41
Timestamp: 2026-03-20T23:01:09.032Z
Learning: In Rust code that uses `Result<Option<T>>`, do not rely on the `?` operator to handle `Ok(None)` implicitly. `?` will treat `Ok(None)` as a successful value and may incorrectly skip required error handling. Instead, explicitly match on `Option` (e.g., `match res? { Some(v) => ..., None => return Err(/* missing key */) }`) so missing-key cases fail fast when they should.
There was a problem hiding this comment.
@jbaublitz I can not induce a deadlock, but I think this is worth checking into.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| let wait_type = WaitType::Upgrade(self.1 .1); | ||
| let poll = if lock_record.should_wait(&wait_type, self.2) { | ||
| lock_record.add_waiter( | ||
| &AtomicBool::new(true), | ||
| wait_type, | ||
| cxt.waker().clone(), | ||
| self.2, | ||
| ); | ||
| Poll::Pending | ||
| } else { | ||
| lock_record.add_write_lock(self.1 .1, Some(self.2)); | ||
| let (_, rf) = unsafe { inner.get().as_mut() } | ||
| .expect("cannot be null") | ||
| .get_mut_by_uuid(self.1 .1) | ||
| .expect("Checked above"); | ||
| Poll::Ready(SomeLockWriteGuard( | ||
| Arc::clone(&self.1 .0), | ||
| self.1 .1, | ||
| self.1 .2.clone(), | ||
| rf, | ||
| true, | ||
| )) |
There was a problem hiding this comment.
Don't revalidate a woken upgrade as SomeWrite.
A queued upgrade is recorded in woken as WaitType::Upgrade, but this path goes through add_write_lock(..., Some(idx)), which validates it as WaitType::SomeWrite. The first contended upgrade that gets woken will hit the assert_eq! in woken_or_new() and panic.
Suggested fix
- lock_record.add_write_lock(self.1 .1, Some(self.2));
+ lock_record.add_upgrade_lock(self.1 .1, self.2);fn add_upgrade_lock(&mut self, uuid: U, idx: u64) {
self.write_locked.insert(uuid);
self.pre_acquire_assertion(&WaitType::Upgrade(uuid), idx);
trace!("Lock record after acquisition: {self}");
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/engine/structures/lock.rs` around lines 667 - 688, The upgrade wake path
currently calls add_write_lock which revalidates the waiter as
WaitType::SomeWrite and causes a panic in woken_or_new; instead, add a dedicated
add_upgrade_lock method (e.g., add_upgrade_lock(&mut self, uuid: U, idx: u64))
that inserts the uuid into write_locked and calls
pre_acquire_assertion(&WaitType::Upgrade(uuid), idx), and replace the call to
add_write_lock(...) in the Upgrade wake branch (the branch using
WaitType::Upgrade and self.1 .1) with add_upgrade_lock(self.1 .1, self.2) so the
woken entry stays validated as Upgrade rather than SomeWrite.
There was a problem hiding this comment.
@coderabbitai Which invocation instance of woken_or_new() do you think could panic?
There was a problem hiding this comment.
@mulkieran, the panic would occur in the woken_or_new() call made transitively through add_write_lock, on a contended upgrade (one that was first put to sleep, then woken). The call chain is:
Upgrade::poll (else branch)
→ add_write_lock(self.1.1, Some(self.2))
→ pre_acquire_assertion(&WaitType::SomeWrite(uuid), idx)
→ woken_or_new(Some(&WaitType::SomeWrite(uuid)), idx)
// ↑ panics here via assert_eq!
Here's the reasoning step by step:
- The
Upgradefuture has to wait (there's a conflicting read lock from another task), so it gets pushed towaitingasWaitType::Upgrade(uuid). - The conflict is released;
wake()moves it fromwaitingtowoken— stored asWaitType::Upgrade(uuid). - On the next
poll(),should_wait()returnsfalsebecausewoken.contains_key(&idx)is true, so execution falls into the else (acquire) branch. - That branch calls
add_write_lock(self.1.1, Some(self.2)), which internally callspre_acquire_assertion(&WaitType::SomeWrite(uuid), idx). pre_acquire_assertioncallswoken_or_new(Some(&WaitType::SomeWrite(uuid)), idx).- Inside
woken_or_new,woken.contains_key(&idx)istrue, so it removes the entry (which isWaitType::Upgrade(uuid)) and asserts:assert_eq!(Some(&WaitType::Upgrade(uuid)), Some(&WaitType::SomeWrite(uuid))); // ^^^ PANIC: mismatch
On a non-contended first-pass acquisition (idx never in woken), woken_or_new returns early without reaching the assert_eq!, so no panic occurs there. The bug only surfaces when an upgrade has to wait and is then woken.
🧠 Learnings used
Learnt from: mulkieran
Repo: stratis-storage/stratisd PR: 3953
File: src/engine/strat_engine/keys.rs:35-41
Timestamp: 2026-03-20T23:01:09.032Z
Learning: In Rust code that uses `Result<Option<T>>`, do not rely on the `?` operator to handle `Ok(None)` implicitly. `?` will treat `Ok(None)` as a successful value and may incorrectly skip required error handling. Instead, explicitly match on `Option` (e.g., `match res? { Some(v) => ..., None => return Err(/* missing key */) }`) so missing-key cases fail fast when they should.
Learnt from: mulkieran
Repo: stratis-storage/stratisd PR: 3953
File: src/engine/strat_engine/backstore/blockdev/v1.rs:410-415
Timestamp: 2026-03-20T20:00:36.372Z
Learning: In `src/engine/strat_engine/backstore/blockdev/v1.rs`, the `setup_reencrypt` and `do_reencrypt` methods intentionally use `.expect("Checked that pool is encrypted")` on `crypt_handle()` rather than returning a `StratisError`. The crypt handle is obtained programmatically and all block devices in the pool are guaranteed to be encrypted before these methods are called, making the `expect` a valid programmer assertion rather than a recoverable runtime error.
Learnt from: jbaublitz
Repo: stratis-storage/stratisd PR: 3651
File: src/dbus_api/pool/pool_3_9/methods.rs:173-176
Timestamp: 2025-08-20T02:37:22.477Z
Learning: The encrypt_pool, reencrypt_pool, and decrypt_pool methods in src/dbus_api/pool/pool_3_9/methods.rs have intentionally different lock handling patterns due to different workflow requirements, and this is not an inconsistency that needs to be fixed.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| return ( | ||
| default_return, | ||
| DbusErrorEnum::ERROR as u16, | ||
| "No unlock methods provided".to_string(), |
There was a problem hiding this comment.
This reads a little oddly as an error string. Could you try something like "No encryption methods provided"?
| search_key(keyring_id, key_desc.to_system_string()) | ||
| } | ||
|
|
||
| /// Validate that all key descriptions |
There was a problem hiding this comment.
This is just a fragmentary sentence...
| fn serialize_date_time<S>( | ||
| timestamp: &Option<DateTime<Utc>>, | ||
| serializer: S, | ||
| ) -> Result<S::Ok, S::Error> | ||
| where | ||
| S: Serializer, | ||
| { | ||
| serializer.serialize_i64(timestamp.expect("is some").timestamp()) | ||
| } |
There was a problem hiding this comment.
The fact that it is logically entailed by our code justifies the expect. It is just the name that makes it seem more general. The function is private to this module.
| pub fn reencrypt( | ||
| &self, | ||
| pool_uuid: PoolUuid, | ||
| key_info: Vec<(u32, SizedKeyMemory, u32)>, | ||
| ) -> StratisResult<()> { | ||
| for (bd, (slot, key, new_slot)) in self | ||
| .blockdevs() | ||
| .into_iter() | ||
| .map(|(_, _, bd)| bd) | ||
| .zip(key_info) | ||
| { | ||
| bd.do_reencrypt(pool_uuid, slot, key, new_slot)?; | ||
| } | ||
| Ok(()) |
There was a problem hiding this comment.
Ok. Let's do this one. The error hypothesized seems very unlikely, but the extra error return is very little additional trouble. If we're convinced the lengths must be the same, an assertion is the appropriate thing.
573a13c to
2d1689f
Compare
Follow up on #3898
Closes #3961
Closes #3597
Summary by CodeRabbit
New Features
last_reencrypted_timestampproperty to track encryption status changes on pools.Dependencies