Skip to content

NRI Resctrl Plugin: Events, Counters, Retries, and Cleanup

Overview

The NRI resctrl plugin manages per-pod resctrl groups under /sys/fs/resctrl and assigns container tasks into the appropriate pod group. It emits events to report pod group state and reconciliation progress.

Event Model

  • AddOrUpdate
  • Payload: { pod_uid, group_state, total_containers, reconciled_containers }
  • group_state:
    • Exists(path): resctrl pod group exists at path
    • Failed: group creation failed (e.g., ENOSPC/RMID exhaustion)
  • Removed
  • Payload: { pod_uid }
  • Emitted when a pod is removed; the plugin deletes its resctrl group (best effort).

Events are emitted on: - Initial synchronize: one AddOrUpdate per pod (initially reporting total_containers = 0 and reconciled_containers = 0), followed by an event whenever either counter changes. - Pod sandbox lifecycle events (RUN_POD_SANDBOX, REMOVE_POD_SANDBOX). - Container create/update/remove transitions that change the counters or the pod group state.

Per-Pod Counters

  • total_containers: number of known containers for the pod
  • reconciled_containers: number of containers whose PIDs have been assigned to the pod group

Counters increment whenever a container becomes known, increment again when it reconciles successfully, and decrement on removal.

Retries

  • retry_group_creation(pod_uid)
  • Attempts to create the failed group again
  • On success, transitions group_state to Exists(path) and emits AddOrUpdate
  • On capacity error (ENOSPC), returns Error::Capacity and emits no event
  • retry_container_reconcile(container_id)
  • Re-runs PID assignment for a specific container
  • Emits AddOrUpdate only if the container transitions to Reconciled (improving counts)
  • retry_all_once()
  • Attempts a single pass across all failed pods and partial containers
  • Stops group-creation retries on the first capacity error encountered in this pass

Cleanup Behavior

  • On startup synchronize, when cleanup_on_start=true and resctrl is mounted:
  • Removes only resctrl groups at the root that start with the configured group_prefix
  • Removes only top-level mon_groups under the resctrl root that start with the prefix
  • Does not traverse into per-group mon_groups
  • Emits no pod events for cleanup-only activity

Testing and CI

  • Mocked unit/integration tests (default):
  • Use an in-memory filesystem provider and test PID source for determinism
  • Cover startup cleanup, synchronize counts, late-container reconcile, sandbox lifecycle, capacity failures, and retry flows
  • Hardware/KIND E2E (optional):
  • Run on an EC2 runner with a KIND cluster configured for NRI
  • Validates that groups are created, tasks are assigned, counts improve, and cleanup occurs on pod removal
  • Tests are #[ignore] by default; CI enables them with cargo test -- --ignored
  • CI wiring uses existing .github/workflows/test-resctrl.yaml to build, run mocked tests on GitHub-hosted runners, and execute the KIND + hardware jobs when the EC2 runner is available.

Test Inventory and Deltas

  • src/lib.rs mocked tests (existing) cover auto-mount true/false, startup cleanup, retry flows, and now verify PID assignment via list_group_tasks in addition to event sequencing.
  • tests/integration_test.rs::test_plugin_full_flow (new) drives a live KIND cluster: pre/post registration pods, kubectl debug ephemeral container addition, and pod removal cleanup.
  • src/lib.rs::test_cleanup_on_start_removes_only_prefix (existing) covers startup cleanup with a mocked filesystem.
  • tests/integration_test.rs::test_startup_cleanup_e2e (new) validates cleanup_on_start behavior against the real resctrl filesystem.
  • tests/integration_test.rs::test_capacity_retry_e2e (new) exercises RMID exhaustion, retry flows, and PID verification on hardware.

Runbooks

Local (mocked) tests

cargo test -p nri-resctrl-plugin -- --nocapture

Hardware E2E on a resctrl-capable host

1) Ensure /sys/fs/resctrl is mountable and you have CAP_SYS_ADMIN 2) Create a KIND cluster with NRI enabled and host mounts:

kind create cluster --config - <<'EOF'
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraMounts:
  - hostPath: /tmp/nri
    containerPath: /var/run/nri
  - hostPath: /tmp/nri-conf
    containerPath: /etc/nri/conf.d
  - hostPath: /tmp/nri-plugins
    containerPath: /opt/nri/plugins
containerdConfigPatches:
- |-
  [plugins."io.containerd.nri.v1.nri"]
    disable = false
    disable_connections = false
    socket_path = "/var/run/nri/nri.sock"
    plugin_config_path = "/etc/nri/conf.d"
    plugin_path = "/opt/nri/plugins"
EOF

3) Build and run integration tests against the KIND NRI socket:

export NRI_SOCKET_PATH=/tmp/nri/nri.sock
cargo test -p nri-resctrl-plugin --test integration_test -- --ignored --nocapture

The test test_plugin_full_flow creates pods before and after plugin registration, exercises an ephemeral container via kubectl debug (when available), and deletes pods to validate Removed events and cleanup.