The Missing GPU Management for AI

GPUd automates monitoring, diagnostics, and issue identification for GPUs

Simplify GPU machine management

One simple tool to monitor your GPU/CPU machines and run workloads.

Why GPUd

GPUd is designed to be self-contained and to integrate seamlessly with other systems such as Docker, containerd, Kubernetes, and Nvidia ecosystems.

First-class GPU support

GPUd is GPU-centric, providing a unified view of critical GPU metrics and issues.

Easy to run at scale

GPUd is a self-contained binary that runs on any machine with a low footprint.

Production grade

GPUd is used in Lepton AI production infrastructure.

Key features

Monitor metrics
Monitor critical GPU and GPU fabric metrics (power, temperature).
Report status
Reports GPU and GPU fabric status (nvidia-smi parser, error checking).
Detect errors
Detects critical GPU and GPU fabric errors (dmesg, hardware slowdown, NVML Xid event, DCGM).
Monitor system
Monitor overall system metrics (CPU, memory, disk).

Built by AI and Cloud Experts

caffepytorchonnxetcdk8s