NRI Resctrl Plugin: Events, Counters, Retries, and Cleanup
Overview
The NRI resctrl plugin manages per-pod resctrl groups under /sys/fs/resctrl and assigns container tasks into the appropriate pod group. It emits events to report pod group state and reconciliation progress.
Event Model
- AddOrUpdate
- Payload:
{ pod_uid, group_state, total_containers, reconciled_containers } group_state:Exists(path): resctrl pod group exists atpathFailed: group creation failed (e.g., ENOSPC/RMID exhaustion)
- Removed
- Payload:
{ pod_uid } - Emitted when a pod is removed; the plugin deletes its resctrl group (best effort).
Events are emitted on:
- Initial synchronize: one AddOrUpdate per pod (initially reporting total_containers = 0 and reconciled_containers = 0), followed by an event whenever either counter changes.
- Pod sandbox lifecycle events (RUN_POD_SANDBOX, REMOVE_POD_SANDBOX).
- Container create/update/remove transitions that change the counters or the pod group state.
Per-Pod Counters
total_containers: number of known containers for the podreconciled_containers: number of containers whose PIDs have been assigned to the pod group
Counters increment whenever a container becomes known, increment again when it reconciles successfully, and decrement on removal.
Retries
retry_group_creation(pod_uid)- Attempts to create the failed group again
- On success, transitions
group_statetoExists(path)and emits AddOrUpdate - On capacity error (ENOSPC), returns
Error::Capacityand emits no event retry_container_reconcile(container_id)- Re-runs PID assignment for a specific container
- Emits AddOrUpdate only if the container transitions to Reconciled (improving counts)
retry_all_once()- Attempts a single pass across all failed pods and partial containers
- Stops group-creation retries on the first capacity error encountered in this pass
Cleanup Behavior
- On startup synchronize, when
cleanup_on_start=trueand resctrl is mounted: - Removes only resctrl groups at the root that start with the configured
group_prefix - Removes only top-level
mon_groupsunder the resctrl root that start with the prefix - Does not traverse into per-group
mon_groups - Emits no pod events for cleanup-only activity
Testing and CI
- Mocked unit/integration tests (default):
- Use an in-memory filesystem provider and test PID source for determinism
- Cover startup cleanup, synchronize counts, late-container reconcile, sandbox lifecycle, capacity failures, and retry flows
- Hardware/KIND E2E (optional):
- Run on an EC2 runner with a KIND cluster configured for NRI
- Validates that groups are created, tasks are assigned, counts improve, and cleanup occurs on pod removal
- Tests are
#[ignore]by default; CI enables them withcargo test -- --ignored - CI wiring uses the orchestrator
.github/workflows/ci.yaml, which invokes the reusable workflow.github/workflows/reusable-test-resctrl.yamlto run mocked tests on GitHub-hosted runners and execute the hardware/K3s jobs on EC2 when enabled.
Test Inventory and Deltas
src/lib.rsmocked tests (existing) cover auto-mount true/false, startup cleanup, retry flows, and now verify PID assignment vialist_group_tasksin addition to event sequencing.tests/integration_test.rs::test_plugin_full_flow(new) drives a live KIND cluster: pre/post registration pods,kubectl debugephemeral container addition, and pod removal cleanup.src/lib.rs::test_cleanup_on_start_removes_only_prefix(existing) covers startup cleanup with a mocked filesystem.tests/integration_test.rs::test_startup_cleanup_e2e(new) validatescleanup_on_startbehavior against the real resctrl filesystem.tests/integration_test.rs::test_capacity_retry_e2e(new) exercises RMID exhaustion, retry flows, and PID verification on hardware.
Runbooks
Local (mocked) tests
cargo test -p nri-resctrl-plugin -- --nocapture
Hardware E2E on a resctrl-capable host
1) Ensure /sys/fs/resctrl is mountable and you have CAP_SYS_ADMIN
2) Create a KIND cluster with NRI enabled and host mounts:
kind create cluster --config - <<'EOF'
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
extraMounts:
- hostPath: /tmp/nri
containerPath: /var/run/nri
- hostPath: /tmp/nri-conf
containerPath: /etc/nri/conf.d
- hostPath: /tmp/nri-plugins
containerPath: /opt/nri/plugins
containerdConfigPatches:
- |-
[plugins."io.containerd.nri.v1.nri"]
disable = false
disable_connections = false
socket_path = "/var/run/nri/nri.sock"
plugin_config_path = "/etc/nri/conf.d"
plugin_path = "/opt/nri/plugins"
EOF
3) Build and run integration tests against the KIND NRI socket:
export NRI_SOCKET_PATH=/tmp/nri/nri.sock
cargo test -p nri-resctrl-plugin --test integration_test -- --ignored --nocapture
The test test_plugin_full_flow creates pods before and after plugin registration, exercises an ephemeral container via kubectl debug (when available), and deletes pods to validate Removed events and cleanup.