Update LinuxIntelRdt to align with runtime-spec#3523
Conversation
Signed-off-by: tommady <tommady@users.noreply.github.com>
Signed-off-by: tommady <tommady@users.noreply.github.com>
Signed-off-by: tommady <tommady@users.noreply.github.com>
Signed-off-by: tommady <tommady@users.noreply.github.com>
…sctrl_subdirectory Signed-off-by: tommady <tommady@users.noreply.github.com>
Signed-off-by: tommady <tommady@users.noreply.github.com>
|
Hi @nayuta723 |
|
@tommady
|
|
Hey @nayuta723
Yes, absolutely. We should definitely get the
I am sorry, I don't have L3 CAT resctrl supported hardware right now. |
There was a problem hiding this comment.
Since the scope of this PR seems quite broad, I’ve left comments on the high-level points for now. For intelRdt, even excluding enable_monitoring, we still need changes to align with v1.3.0 and additional tests. This will likely make the PR quite large, so I think it might be a good idea to split it.
Also, as an unreleased side effect, some changes to seccomp have slipped in. It might be better to cut a release before we proceed with reviewing this PR.
Lastly, since you mentioned that real-world testing hasn't been done yet, I'd like to see more rigorous tests added to the PR context. Also, it looks like the tests are being skipped in CI as well, so we might need to start by setting up a working environment. Could you check how runc handles this to see what they do?
| Ok(compare_lines(&combined, &existing)) | ||
| } | ||
|
|
||
| /// Retrieves the schemata data to be written to the resctrl filesystem. |
There was a problem hiding this comment.
If both l3CacheSchema and memBwSchema are set, runtimes MUST write the values to the schemata file in that sub-directory discussed in closID. The runtimes MUST write l3CacheSchema first and memBwSchema last.
If either l3CacheSchema or memBwSchema is set, runtimes MUST write the value to the schemata file in the that sub-directory discussed in closID.
If schemata field is set, runtimes MUST write the value to the schemata file in the that sub-directory discussed in closID. If also l3CacheSchema or memBwSchema is set the value of schemata field must be written last, after the values from l3CacheSchema and memBwSchema has been written.
Please fix this, as it behaves differently from what is described above.
There was a problem hiding this comment.
Good catch!
I misunderstood the specification regarding how the legacy fields should interact with the new schemata list.
Currently, get_schemata_data returns early if the schemata list is present, completely ignoring l3CacheSchema and memBwSchema.
I will update the logic to combine all of them in the strict order mandated by the spec (L3 first, then MB, and finally the generic schemata list elements).
https://specs.opencontainers.org/runtime-spec/config-linux/?v=v1.3.0
If schemata field is set, runtimes MUST write the value to the schemata file in the that sub-directory discussed in closID.
If also l3CacheSchema or memBwSchema is set the value of schemata field must be written last, after the values from l3CacheSchema and memBwSchema has been written.
There was a problem hiding this comment.
This configuration is out of scope for this PR, so I'd like to handle it in a separate PR.
There was a problem hiding this comment.
Agreed, I will revert the changes.
| pid_file: self.pid_file.to_owned(), | ||
| }; | ||
|
|
||
| let (init_pid, need_to_clean_up_intel_rdt_dir) = |
There was a problem hiding this comment.
What is the motivation behind changing the return value here?
There was a problem hiding this comment.
My motivation for changing the return value to pass intel_rdt_dir and intel_rdt_monitoring_dir and storing these paths directly in state.json was to closely mirror how runc implements this.
runc explicitly stores both IntelRdtPath and IntelRdtMonPath directly in its container state object, and relies on those saved paths when it needs to clean up the directories during teardown.
You can see runc storing the paths in the state here:
https://github.com/opencontainers/runc/blob/main/libcontainer/container_linux.go#L79
| } | ||
|
|
||
| if let Some(path) = container.intel_rdt_dir() { | ||
| if let Err(e) = delete_resctrl_subdirectory(path) { |
There was a problem hiding this comment.
Since it throws an error when the file doesn't exist, it might be better not to return an error for ENOENT.
There was a problem hiding this comment.
I will update the logic to handle the ErrorKind::NotFound case gracefully so that if the directory is already gone, it silently succeeds instead of throwing an error during teardown.
| /// This function includes a safety check (`parent == dir`) to ensure that only directories | ||
| /// immediately under the resctrl mount point are deleted, preventing accidental deletion | ||
| /// of nested kernel files or other sensitive directories. | ||
| pub fn delete_resctrl_subdirectory(path: &Path) -> Result<()> { |
There was a problem hiding this comment.
There doesn't seem to be a strong motivation to switch to path here.
| pub fn delete_resctrl_subdirectory(path: &Path) -> Result<()> { | |
| pub fn delete_resctrl_subdirectory(id: &str) -> Result<()> { |
There was a problem hiding this comment.
Like the above.
#3523 (comment)
runc explicitly stores both the IntelRdtPath and IntelRdtMonPath directly in its linuxContainer state structure.
Reconstructing the exact paths dynamically during container deletion is complex (it requires re-parsing the spec for closID, container_id, and enable_monitoring), and storing the literal paths ensures the teardown phase always knows exactly which directories it created and needs to remove, matching the reference implementation's behavior.
| pub use_systemd: bool, | ||
| // Specifies if the Intel RDT subdirectory needs be cleaned up. | ||
| pub clean_up_intel_rdt_subdirectory: Option<bool>, | ||
| // Specifies the Intel RDT subdirectory path that needs to be cleaned up. |
There was a problem hiding this comment.
The logic here implies that this path exists when a deletion is required. Could you rename it to something that makes this intent clearer? Also, what was the motivation behind using a path instead of a bool as the state's responsibility?
There was a problem hiding this comment.
To answer the second part first regarding the motivation for using a path instead of a boolean flag:
This was done to match runc's approach. runc stores the explicit filesystem paths (IntelRdtPath and IntelRdtMonPath) directly in its linuxContainer state structure.
Since youki aims for compatibility, saving the explicit paths avoids complex dynamic reconstruction of those paths during container deletion (which requires re-evaluating the spec, closID, etc.).
Regarding the naming: I just followed how runc named them (intel_rdt_path and intel_rdt_mon_path).
| } | ||
| } | ||
|
|
||
| let mut config_opt = None; |
There was a problem hiding this comment.
What is the motivation behind changing config_opt? Personally, I don't see the necessity for this change.
There was a problem hiding this comment.
The motivation behind introducing config_opt was simply to decouple the Intel RDT cleanup logic from the YoukiConfig processing block.
I wanted to group the RDT teardown operations clearly before the cgroup and hooks teardown, similar to runc's teardown order.
| @@ -33,16 +35,46 @@ fn create_spec( | |||
|
|
|||
| pub fn test_intel_rdt() -> TestResult { | |||
There was a problem hiding this comment.
Since the scope of this PR is strictly for v1.3.0 support, could you please add a note to the PR context describing the other behavioral changes (outside of enable_monitoring) and what the ideal behavior for v1.3.0 should look like?
There was a problem hiding this comment.
Looking at the code, it seems necessary to include support for schemata.
| @@ -10,11 +10,13 @@ fn create_spec( | |||
| maybe_l3_cache: Option<&str>, | |||
| maybe_mem_bw: Option<&str>, | |||
| maybe_clos_id: Option<&str>, | |||
There was a problem hiding this comment.
Regarding maybe_clos_id: Option<&str>, do we actually need this arg in create_spec under this context?
There was a problem hiding this comment.
You're right, I will remove the maybe_clos_id argument from create_spec to simplify the test setup.
|
@utam0k |
There are no blockers at the moment. Could you please open the PR for the new release? |
I understand the concern regarding the scope of the PR and the changes to ContainerState. To address this and ensure existing deployments are unaffected, I will put all the new OCI v1.3.0 intel_rdt features (including enable_monitoring, the schemata list handling, and the path caching in ContainerState which aligns us with runc) behind a new Cargo feature flag (e.g., OCI_v1_3_0). WDYT? Regarding the real-world testing and CI environment, I investigated how runc handles this since it's the reference implementation. It turns out that runc intentionally does not run any Intel RDT integration tests in CI or against a real kernel (you can check their tests/integration/ folder, there are zero RDT tests). Instead, runc solely relies on unit tests that mock the /sys/fs/resctrl filesystem by creating a temporary directory (t.TempDir()) and writing plain text files to simulate the kernel interface (e.g., in libcontainer/intelrdt/util_test.go). This is because the Linux kernel strictly enforces physical hardware capabilities (via MSRs) when interacting with the real /sys/fs/resctrl. If the host doesn't physically have Xeon L3 CAT or MBA hardware (which GitHub Actions VMs and most consumer machines do not), the kernel outright rejects the hardcoded L3 or MB schema writes, causing integration tests to fail. That being said, I completely agree we need real-world validation for a change this large. Once the review is complete and the test cases and code changes are all approved, I am willing to manually provision a supported bare-metal instance such as a GCP Intel C3, AMD C3D, or Google Axion ARM instance to prove the implementation works end-to-end against a real kernel. Alternatively, if you agree that we should follow runc's approach of mocking the filesystem entirely for unit testing instead of relying on integration tests, I am happy to implement that as well ( to save my money lol ). |
Description
update the RDT (Resource DesAllocation Technology) implementation to align with runtime-spec v1.3.0
Type of Change
Testing
Related Issues
Fixes #3405
Additional Context