Machine Learning & AI

How CoreWeave and Loft Labs Leverage vCluster to Run Virtual Clusters in Kubernetes at Scale

How CoreWeave and Loft Labs Leverage vCluster to Run Virtual Clusters in Kubernetes at Scale

Running and managing a large number of Kubernetes clusters on bare metal poses significant challenges, from security to GPU provisioning to scalability. As a specialized cloud provider, CoreWeave experienced these challenges first-hand, operating 3,000+ Kubernetes clusters on top of 5,000 bare metal nodes with massive amounts of GPUs to power modern AI applications at scale. 

To help solve these challenges, CoreWeave partnered with Loft Labs, the maintainers of vCluster, to create a serverless Kubernetes experience for numerous companies running AI workloads at scale in multitenant environments. 

vCluster is an open-source tool for creating and managing virtual Kubernetes clusters—and, it’s the only certified Kubernetes distro for creating virtual K8s clusters. Since its launch in 2021, vCluster has seen over 40 million virtual clusters created. Take a look at the video recording from KubeCon’s AI & HPC Day 2023 to learn more. 

“Virtualized Kubernetes”

When you look at Kubernetes clusters today, you see a lot of replication. Each cluster has its own Cert manager, policy agent, ISO, etc. Many enterprises today are essentially spinning up Kubernetes cluster after Kubernetes cluster on generalized clouds, resulting in thousands of clusters to maintain. 

This was the problem Loft Labs sought to address. vCluster allows you to create multitenant clusters. You have one cluster that runs your platform stack, and then you launch new virtual clusters within that instead of handing out namespaces. These virtual clusters run as a pod in a Kubernetes control plane that you can then make available to ingress, load balancer, etc.

This “virtualized Kubernetes” allows you to use the underlying platform stack across these virtual clusters.

What makes CoreWeave’s implementation different?

The standard vCluster runs tenant workloads (the virtual cluster workloads) inside the vCluster and the control plane alongside each other, in one namespace, in the same Kubernetes cluster. This is the default.

CoreWeave does things a little differently. CoreWeave combines vCluster Pro Syncer with our own Kubernetes Control Planes, essentially running a control plane only inside one Kubernetes cluster, inside the multi-tenant cluster. This is called an isolated control plane

Isolated control planes are an advanced feature of vCluster.Pro Distro where the vCluster Control Plane runs in one cluster and syncs to workloads in other clusters. (FYI: vCluster.Pro Distro is Loft Lab’s commercial distro; it’s open-sourced vCluster with advanced features.)  CoreWeave doesn't leverage the complete vCluster.Pro Distro because we use our in-house one and run vCluster Pro Syncer to allow us to create these vClusters at scale. That said, it’s still an incredibly advanced tool.

What does this mean for enterprises running on CoreWeave? More security and resilience. The control planes cannot be affected by faulty workloads or noisy neighbors, making it easier to ensure SLAs (service level agreements) for control planes. It also allows for advanced workload topologies. That means you can sync the workloads to different locations (it doesn’t have to be in the same cluster). 

Essentially, we customized the traditional Kubernetes experience to allow clients to deploy their own full cluster experience without needing to provision nodes. That’s really the power that vCluster gave people.

We then took it a step further to give you full customization across your cluster for greater visibility and control with maximum performance. This solution gives you the best of both worlds: 

  1. The ability to access compute on demand from a multi-tenant data center, and/or 
  2. Their own dedicated, isolated environment. 

Whichever clients choose, or both, they get the same cluster experience. This new Kubernetes control plane enhancement is discussed in the video, but we’ll share more details when we launch it in 2024. 

How CoreWeave Cloud serves machine learning & AI workloads

More and more companies are expanding their engineering teams to build AI and ML applications, and many of them use or plan to use Kubernetes as a target environment for deploying these services in production. 

Given the specific nature of these workloads and the requirements that they have towards the production environment (on-demand spikes in compute, GPU-heavy workloads, etc.), it’s key to build smart Kubernetes architectures designed for AI and ML workloads. 

If you haven’t noticed already, we love Kubernetes. We don’t think you should have to learn everything new in order to use our cloud, and our teams are very open to collaborating with you to help you leverage the tools you need. 

CoreWeave hosts some of the biggest AI and ML applications today, we can provide a unique perspective to others in the ecosystem on what they need to pay attention to when building their Kubernetes architectures and help others gain insights from our learnings from our Kubernetes journey.

For example, it’s a common assumption that users can’t customize their experience if they run on multi-tenant clusters. Our new Kubernetes control plane enhancement changes that, and it’s why CoreWeave is excited to launch it next year.

But it’s not the only differentiator that makes CoreWeave a more performant and resilient cloud for AI and ML applications.

  1. CoreWeave leverages Kubernetes on top of bare metal infrastructure. We don’t do hypervisor layers at all. As a CoreWeave Cloud user, you can access the GPU directly, scaling from zero to however many GPUs you need in a matter of minutes. Our users see faster spin-up times and more responsive autoscaling for inference thanks to Knative, and you can see how those benchmarks compare to a generalized cloud.
  1. Our tech stack is GPU-optimized. Everything you need to run GPU workloads is managed by us… the drivers, the health checks, etc. You just come with your applications, download your model weights, and you’re good to go.
  1. And, we’re open-source friendly! This year we announced Tensorizer for serving inference (faster model loading) and are working on an update to support more resilient model training. The second project is SUNK (Slurm on Kubernetes), which is coming out in early 2024.

To learn more about what we’re working on and chat with our team, reach out to us anytime. You can also visit vcluster.com to learn more about vCluster and Loft Labs, or schedule a vCluster.Pro demo today.

Connect with us

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.