Optimizing GPU Clusters for Generative AI Model Training: A Comprehensive Guide
Training generative AI models requires clusters of expensive, cutting-edge hardware such as H100 GPUs and fast storage, interconnected through multi-network topologies involving Infiniband links, switches, transceivers, and ethernet connections. While high-performance computing (HPC) and AI cloud services offer these specialized clusters, they come with substantial capital commitments. However, not all clusters are created equal, according to together.ai.
Introduction to GPU Cluster Testing
Reliability of GPU clusters varies significantly, with issues ranging from minor to critical. For instance, Meta reported that during their 54-day training run of the Llama 3.1 model, GPU issues accounted for 58.7% of all unexpected problems. Together AI, serving many AI startups and Fortune 500 companies, has developed a robust validation framework to ensure hardware quality before deployment.
The Process of Testing Clusters at Together AI
The goal of acceptance testing is to ensure that hardware infrastructure meets specified requirements and delivers the reliability and performance necessary for demanding AI/ML workloads.
1. Preparation and Configuration
The initial phase involves configuring new hardware in a GPU cluster environment, mimicking end-use scenarios. This includes installing NVIDIA drivers, OFED drivers for Infiniband, CUDA, NCCL, HPCX, and configuring SLURM cluster and PCI settings for performance.
2. GPU Validation
Validation begins with ensuring the GPU type and count match expectations. Stress testing tools like DCGM Diagnostics and gpu-burn are used to measure power consumption and temperature under load. These tests help identify issues like NVML driver mismatches or “GPU fell off the bus” errors.
3. NVLink and NVSwitch Validation
After individual GPU validation, tools like NCCL tests and nvbandwidth measure GPU-to-GPU communication over NVLink. These tests help diagnose problems like a bad NVSwitch or down NVLinks.
4. Network Validation
For distributed training, network configuration is validated using Infiniband or RoCE networking fabrics. Tools like ibping, ib_read_bw, ib_write_bw, and NCCL tests are used to ensure optimal performance. A good result in these tests indicates the cluster will perform well for distributed training workloads.
5. Storage Validation
Storage performance is crucial for machine learning workloads. Tools like fio measure different storage configurations’ performance characteristics, including random reads, random writes, sustained reads, and sustained writes.
6. Model Build
The final phase involves running reference tasks tailored to customer use cases. This ensures the cluster can achieve expected end-to-end performance. A popular task is building a model with frameworks like PyTorch’s Fully Sharded Data Parallel (FSPD) to evaluate training throughput, model flops utilization, GPU utilization, and network communication latencies.
7. Observability
Continuous monitoring for hardware failures is essential. Together AI uses Telegraf to collect system metrics, ensuring maximum uptime and reliability. Monitoring includes cluster-level and host-level metrics, such as CPU/GPU usage, available memory, disk space, and network connectivity.
Conclusion
Acceptance testing is indispensable for AI/ML startups delivering top-tier computational resources. A comprehensive and structured approach ensures stable and reliable infrastructure, supporting the intended GPU workloads. Companies are encouraged to run acceptance testing on delivered GPU clusters and report any issues for troubleshooting.
Image source: Shutterstock