Scaling high-performance, critical bare metal Kubernetes environments is often seen as black magic. While it’s certainly not simple, it is possible. And, the end results speak for themselves: A supercomputer with 3,500 NVIDIA H100 Tensor Core GPUs, 400 miles of networking fabric, and 40,000 fiber connections that trained MLPerf’s new GPT-3 LLM benchmark test in under 11 minutes.
This supercomputer is just one of many that CoreWeave has built this year, all of which run Kubernetes on bare metal.
CoreWeave is a newer cloud than some of your traditional hyperscalers. We started building our stack in 2018, and unlike some of the legacy clouds, we had some very new and interesting technologies available to us, namely containerization and Kubernetes. We also realized that the major hyperscalers around the world were not running directly on bare metal; they leveraged a hypervisor layer to run virtual machines (VMs) on top of bare metal.
So, the choice for us to run Kubernetes on bare metal was a pretty easy one. There would be much less overhead, and we could take advantage of really great patterns to manage and deploy software.
As you’ll see in the video recording, the road to running Kubernetes on bare metal is not as easy as it seems. But the results are record-breaking, and we believe it’s the best architecture for AI training and inference.
The Cluster That Set a New Record for MLPerf’s LLM Training Benchmark
This presentation focuses on a specific cluster we built as part of our cloud fleet, and it’s the one we used in June to set an MLPerf record for the LLM training benchmarks. The project started in earnest in January when we started building one of the first NVIDIA H100 Tensor Core GPU training clusters together with our customer, Inflection AI.
The MLPerf benchmark was run over 3,500 NVIDIA H100 GPUs and finished the benchmark in approximately 11 minutes, which was 29x faster than the next leading competitor at the time of the June MLPerf submission. The benchmark is a GPT-free model structure and model size of 5 billion parameters. It’s not trained to convergence—so it takes a bit longer than 11 minutes on this cluster, but probably faster for the next—but it gives you an approximation for a true LLM training workload.
The GPUs within the cluster are all interconnected in the supercomputer using NVIDIA Quantum-2 InfiniBand, a high-speed 400 Gb/s end-to-end network technology. There are 400 miles of this fiber inside this supercomputer, all housed inside one sector of a data center, so it’s a lot of fiber in a small area.
There are 40,000 fiber connections between the systems. You have fiber that goes into a switch, which then goes into another switch, and each of these fibers connect to optics. They all need to be cleaned before you put them in, and if any of these fail, you’ll have performance degradation in your cluster.
Let’s dig into how this cluster was built, what were the components and potential fail points, and how we handled it.
How CoreWeave Built the Supercomputer
Our servers feature a standard 8-rail (or “rail-optimized”) configuration. Each server has 8 NVIDIA GPUs with a total of 10 fibers coming out of each system. All ten could fail, and each failure would be catastrophic to a job.
Because all 3,500 NVIDIA H100 GPUs are working together in one single job, a single failure on any of these components will cause the job to fail. The job then has to restart from the last checkpoint, and you can lose a lot of your training time. So, ensuring that your nodes and your entire fabric are healthy is critical to not losing performance on these very expensive AI training machines.
So to build this supercomputer, get it running, and make it reliable and fast, CoreWeave made a few key decisions and created some core components out of the gate.
- Stateless Kubernetes Nodes on Bare Metal
As discussed before, we chose Kubernetes to be very flexible in how we deploy our software, and we run instances on bare metal to skip the virtualization overhead. We also wanted to boot everything stateless to take full advantage of a high-performance, multi-tenant Kubernetes cluster at scale.
- CoreWeave Node Lifecycle Controller and Cloud-Native Observability Tools
After the nodes are booted, there’s a full suite of validations we run both during bring-up and continuously to gather all the metrics and act on them immediately if needed.
- ML Workload Scheduling with SUNK
Once the nodes are up and healthy, we then need to run workloads. To do this, we build SUNK, an implementation of Slurm on Kubernetes that allows customers to schedule workloads using either Kubernetes or Slurm, a more traditional HPC scheduler.
To learn more about these topics in depth, please take a look at the video recording from KubeCon. If you’d like to discuss any of these topics with our team, reach out to us to set up a meeting.