GPU components
accelerator-nvidia-clock
: Monitors NVIDIA GPU clock events of all GPUs, such as HW Slowdown events.accelerator-nvidia-clock-speed
: Tracks the per-GPU clock speed.accelerator-nvidia-ecc
: Tracks the NVIDIA per-GPU ECC errors.accelerator-nvidia-error
: Tracks NVIDIA GPU errors real-time in the SMI queries -- likely requires host restarts.accelerator-nvidia-error-sxid
: Tracks the NVIDIA GPU SXid errors scanning the dmesg -- see fabric manager documentation.accelerator-nvidia-error-xid
: Tracks the NVIDIA GPU Xid errors scanning the dmesg and using the NVIDIA Management Library (NVML) -- see Xid messages.accelerator-nvidia-fabric-manager
: Tracks the fabric manager version and its activeness.accelerator-nvidia-infiniband
: Monitors the infiniband status of the system. Optional, enabled if the host has NVIDIA GPUs.accelerator-nvidia-info
: Serves relatively static information about the NVIDIA accelerators (e.g., GPU product names).accelerator-nvidia-memory
: Monitors the NVIDIA per-GPU memory usage.accelerator-nvidia-gpm
: Monitors the NVIDIA per-GPU GPM metrics.accelerator-nvidia-nvlink
: Monitors the NVIDIA per-GPU nvlink devices.accelerator-nvidia-peermem
: Monitors the peermem module status. Optional, enabled if the host has NVIDIA GPUs.accelerator-nvidia-power
: Tracks the NVIDIA per-GPU power usage.accelerator-nvidia-processes
: Tracks the NVIDIA per-GPU processes.accelerator-nvidia-temperature
: Tracks the NVIDIA per-GPU temperatures.accelerator-nvidia-utilization
: Tracks the NVIDIA per-GPU utilization.
General Hardware components
cpu
: Tracks the combined usage of all CPUs (not per-CPU).disk
: Tracks the disk usage of all the mount points specified in the configuration.memory
: Tracks the memory usage of the host.network-latency
: Tracks global network connectivity statistics.power-supply
: Tracks the power supply/usage on the host.
System components
info
: Provides static information about the host (e.g., labels, IDs).os
: Queries the host OS information (e.g., kernel version).systemd
: Tracks the systemd state and unit files.dmesg
: Scans and watches dmesg outputs for errors,, as specified in the configuration (e.g., regex match NVIDIA GPU errors).file-descriptor
: Tracks the number of file descriptors used on the host.
Misc. components
containerd-pod
: Tracks the current pods from the containerd CRI.k8s-pod
: Tracks the current pods from the kubelet read-only port.docker-container
: Tracks the current containers from the docker runtime.tailscale
: Tracks the tailscale state (e.g., version) if available.