NRI Resctrl Plugin: Events, Counters, Retries, and Cleanup
Overview
The NRI resctrl plugin manages per-pod resctrl groups under /sys/fs/resctrl
and assigns container tasks into the appropriate pod group. It emits events to report pod group state and reconciliation progress.
Event Model
- AddOrUpdate
- Payload:
{ pod_uid, group_state, total_containers, reconciled_containers }
group_state
:Exists(path)
: resctrl pod group exists atpath
Failed
: group creation failed (e.g., ENOSPC/RMID exhaustion)
- Removed
- Payload:
{ pod_uid }
- Emitted when a pod is removed; the plugin deletes its resctrl group (best effort).
Events are emitted on:
- Initial synchronize: one AddOrUpdate
per pod (initially reporting total_containers = 0
and reconciled_containers = 0
), followed by an event whenever either counter changes.
- Pod sandbox lifecycle events (RUN_POD_SANDBOX
, REMOVE_POD_SANDBOX
).
- Container create/update/remove transitions that change the counters or the pod group state.
Per-Pod Counters
total_containers
: number of known containers for the podreconciled_containers
: number of containers whose PIDs have been assigned to the pod group
Counters increment whenever a container becomes known, increment again when it reconciles successfully, and decrement on removal.
Retries
retry_group_creation(pod_uid)
- Attempts to create the failed group again
- On success, transitions
group_state
toExists(path)
and emits AddOrUpdate - On capacity error (ENOSPC), returns
Error::Capacity
and emits no event retry_container_reconcile(container_id)
- Re-runs PID assignment for a specific container
- Emits AddOrUpdate only if the container transitions to Reconciled (improving counts)
retry_all_once()
- Attempts a single pass across all failed pods and partial containers
- Stops group-creation retries on the first capacity error encountered in this pass
Cleanup Behavior
- On startup synchronize, when
cleanup_on_start=true
and resctrl is mounted: - Removes only resctrl groups at the root that start with the configured
group_prefix
- Removes only top-level
mon_groups
under the resctrl root that start with the prefix - Does not traverse into per-group
mon_groups
- Emits no pod events for cleanup-only activity
Testing and CI
- Mocked unit/integration tests (default):
- Use an in-memory filesystem provider and test PID source for determinism
- Cover startup cleanup, synchronize counts, late-container reconcile, sandbox lifecycle, capacity failures, and retry flows
- Hardware/KIND E2E (optional):
- Run on an EC2 runner with a KIND cluster configured for NRI
- Validates that groups are created, tasks are assigned, counts improve, and cleanup occurs on pod removal
- Tests are
#[ignore]
by default; CI enables them withcargo test -- --ignored
- CI wiring uses existing
.github/workflows/test-resctrl.yaml
to build, run mocked tests on GitHub-hosted runners, and execute the KIND + hardware jobs when the EC2 runner is available.
Test Inventory and Deltas
src/lib.rs
mocked tests (existing) cover auto-mount true/false, startup cleanup, retry flows, and now verify PID assignment vialist_group_tasks
in addition to event sequencing.tests/integration_test.rs::test_plugin_full_flow
(new) drives a live KIND cluster: pre/post registration pods,kubectl debug
ephemeral container addition, and pod removal cleanup.src/lib.rs::test_cleanup_on_start_removes_only_prefix
(existing) covers startup cleanup with a mocked filesystem.tests/integration_test.rs::test_startup_cleanup_e2e
(new) validatescleanup_on_start
behavior against the real resctrl filesystem.tests/integration_test.rs::test_capacity_retry_e2e
(new) exercises RMID exhaustion, retry flows, and PID verification on hardware.
Runbooks
Local (mocked) tests
cargo test -p nri-resctrl-plugin -- --nocapture
Hardware E2E on a resctrl-capable host
1) Ensure /sys/fs/resctrl
is mountable and you have CAP_SYS_ADMIN
2) Create a KIND cluster with NRI enabled and host mounts:
kind create cluster --config - <<'EOF'
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
extraMounts:
- hostPath: /tmp/nri
containerPath: /var/run/nri
- hostPath: /tmp/nri-conf
containerPath: /etc/nri/conf.d
- hostPath: /tmp/nri-plugins
containerPath: /opt/nri/plugins
containerdConfigPatches:
- |-
[plugins."io.containerd.nri.v1.nri"]
disable = false
disable_connections = false
socket_path = "/var/run/nri/nri.sock"
plugin_config_path = "/etc/nri/conf.d"
plugin_path = "/opt/nri/plugins"
EOF
3) Build and run integration tests against the KIND NRI socket:
export NRI_SOCKET_PATH=/tmp/nri/nri.sock
cargo test -p nri-resctrl-plugin --test integration_test -- --ignored --nocapture
The test test_plugin_full_flow
creates pods before and after plugin registration, exercises an ephemeral container via kubectl debug
(when available), and deletes pods to validate Removed
events and cleanup.