2025-02-18: Collecting Intel CMT Measurements
Context
We need to collect Intel CMT (Cache Monitoring Technology) measurements at millisecond granularity for containers in a cloud-native environment.
Decision
We will build a kernel module that interacts directly with Intel RDT MSRs (Model Specific Registers) to configure and read CMT measurements.
Rationale
We considered three approaches for collecting CMT measurements:
-
Using Linux perf counters: After investigating the Intel CMT-CAT repository and relevant Linux kernel code, we found that although the Intel software repository seemed to support perf counters for CMT, the Linux kernel did not actually implement this. Therefore, using perf was not a viable option.
-
Using the resctrl filesystem interface: The Linux kernel's resctrl subsystem provides a filesystem-based interface for configuring Intel RDT and reading measurements. However, this approach has several drawbacks:
- Collecting measurements at millisecond granularity through the filesystem interface for all containers would be complex and potentially inefficient due to the overhead of system calls.
-
Resctrl is based on tasks and processes rather than containers. To use resctrl, we would need to build a system to monitor container lifecycle events and configure resctrl accordingly, which would add complexity and potential gaps in measurement.
-
Building a kernel module to interact with MSRs directly: This approach offers several advantages:
- By interacting with MSRs directly, the kernel module can read CMT information with very low overhead, without the layers of the filesystem interface.
- The kernel module can probe container lifecycle tracepoints to allocate RMIDs (Resource Monitoring IDs) and assign them to containers automatically.
- This approach enables a cloud-native solution that seamlessly measures containers as they are created.
Given these considerations, we chose to build a kernel module that interacts with Intel RDT MSRs directly. This approach provides the best performance, flexibility, and compatibility with a cloud-native container environment.
Consequences
Building a kernel module for CMT measurement has the following consequences:
- We will need to maintain the kernel module code and ensure compatibility with different Linux kernel versions.
- Users will need to load the kernel module to enable CMT measurement collection.
- We will have tight integration with container lifecycle events, enabling seamless measurement of containers.
- We can achieve low-overhead, millisecond-granularity measurement collection, meeting our performance requirements.
Status
Accepted
Appendix A: intel-cmt-cat summary
The perf_monitoring.c
file:
- Checks if perf is available by checking if
/proc/sys/kernel/perf_event_paranoid
exists. - Checks if RDT exists by reading
/sys/devices/intel_cqm/type
- if it exists, its value (as integer) is the perf
type
field - traverses
/sys/devices/intel_cqm/events
for eventsllc_occupancy
,local_bytes
,total_bytes
- their value is parsed to get the
config
field of the perf struct - the same file with extension
.scale
is used to read adouble
scale
- if it exists, its value (as integer) is the perf
Mentions of using the perf command line
Here are references from the web for monitoring RDT using perf. However note that we found that the patches discussed in these references were not present in the Linux kernel whose code we checked (6.13.2) and appear to have not been merged into the kernel originally.
A 2017 forum post was able to view events with perf stat
as events:
intel_cqm/llc_occupancy , intel_cqm/llc_local_bytes/,intel_cqm_total_bytes/
(the last value seems to have a typo replacing /
with _
)
An Intel/Kanaka Juvva presentation at LinuxCon'2015 shows per-application memory bandwidth monitoring with perf
(slide 11):
Two perf events are exported to userland - LOCAL_BW - perf stat –e intel_cqm/llc_local_bw/ -a “my_application” - TOTAL_BW - perf stat –e intel_cqm/llc_total_bw/ -a “my_application”
A 2016 Kanaka Juuva presentation: - further mentions LLC Occupancy - shows memory bandwidth benchmark results - shows more process-based CLI examples, by PID:
- LLC_OCCUPANCY
- perf stat –e intel_cqm/llc_occupancy/ -p “pid of my_application”
- discusses cgroups-based measurements. This might have been before the switch from cgroup to resctrl.
Appendix B: A journey through intel-cmt-cat
The intel-cmt-cat repo documentation suggests perf can read CMT data as well (table 5 in README).
In this section, we look into how intel-cmt-cat uses perf, and document its usage so we can support that alongside the other counters.
We start with the pqos
CLI tool. Its command line parameters set up calls into the library in lib/
:
main
callsselfn_monitor_cores
on the-m
command line option.parse_monitor_cores
parses the-m
command line option.parse_monitor_group
parses a string from the command line to a list of cores or pids, and callsgrp_add
on each.grp_add
allocates astruct mon_group
callednew_grp
on the stack, then adds the core/pid/channel/etc. to the group usinggrp_set_*
, and then appends it to a global variablesel_monitor_group
.- later,
main
callsmonitor_setup
monitor_setup
calls the library API depending on the type of monitor. For cores, it callspqos_mon_start_cores
.
Going into the library:
pqos_mon_start_cores
callspqos_mon_start_cores_ext
(which also has an opt parameter)pqos_mon_start_cores_ext
checks input validity and then makes anAPI_CALL(mon_start_cores...)
API_CALL
is a macro that accesses a virtual table of monitoring operations calledapi
inapi.c
.- This
api
variable is initialized inapi_init
to either the OS interface or MSR interface (these are mentioned in the repo's README). - In the OS interface, the
mon_start_cores
function pointer is initialized to point toos_mon_start_cores
.
- This
os_mon_start_cores
validates the input, the available monitoring capabilities, and ensures the monitoring hadn't already started, and callsos_mon_start_events
.os_mon_start_events
:- runs
perf_mon_is_event_supported
on every event, and if so, callsperf_mon_start
. - otherwise, checks
resctrl_mon_is_event_supported
and if so performsresctrl_mon_start
.
- runs
Let's explore the flow that checks perf for supported events:
perf_mon_is_event_supported
callsget_supported_event
.get_supported_event
looks up the event in a globalevents_tab
.- the first event in
events_tab
isllc_occupancy
.
- the first event in
Initialization of perf monitoring in perf_mon_init
:
- if
/proc/sys/kernel/perf_event_paranoid
exists, enables the PMU events (cycles, instructions, IPC, LLC misses, LLC references). set_arch_event_attrs
sets theattr
field on PMU events. Theattr
field is astruct perf_event_attr
(from the linux API).set_mon_type
reads/sys/devices/intel_cqm/type
as an integer into the global variableos_mon_type
. This int is then used in the perf attr as itstype
field inset_rdt_event_attrs
.set_mon_events
then traverses the directory/sys/devices/intel_cqm/events
.- For each file, it tries to find an entry in
events_tab
whosename
field is the same as the file name. - For every match, it calls
set_rdt_event_attrs
.
- For each file, it tries to find an entry in
set_rdt_event_attrs
- reads the file
- assumes the contents has a
=
, discards everything before the first=
and parses the rest as an integer. this will beattrs.config
- reads another file filename+
.scale
suffix - parses it as a double. this will be the event's
scale
So far, we covered initialization and checking event availability. Now let's see how the library configures the kernel to start monitoring:
perf_event_open
makes the syscall viasyscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags)