The Missing GPU Management for AI

GPUd automates monitoring, diagnostics, and issue identification for GPUs

Simplify GPU machine management

One simple tool to monitor your GPU/CPU machines and run workloads.

Collect

Machines

GPU Machines

Process / Transform

Metrics

Structured Data

Control Plane (lepton.ai)

Fully Managed Machines

GPUd is designed to be self-contained and to integrate seamlessly with other systems such as Docker, containerd, Kubernetes, and Nvidia ecosystems.

GPUd is GPU-centric, providing a unified view of critical GPU metrics and issues.

GPUd is a self-contained binary that runs on any machine with a low footprint.

GPUd is used in Lepton AI production infrastructure.

Monitor metrics
Monitor critical GPU and GPU fabric metrics (power, temperature).: Report status
Reports GPU and GPU fabric status (nvidia-smi parser, error checking).: Detect errors
Detects critical GPU and GPU fabric errors (dmesg, hardware slowdown, NVML Xid event, DCGM).: Monitor system
Monitor overall system metrics (CPU, memory, disk).