Slurm on Kubernetes (SUNK)

The industry’s first unified training system for the most demanding AI workloads—delivering production-grade reliability and operational visibility for large, long-running training jobs.

Play video

Introducing guided self-service
for SUNK (in preview)

Bring SUNK clusters online through a guided, opinionated setup experience based on CoreWeave best practices. SUNK allows teams to get started faster while maintaining the control expert users need. Now available as a preview for existing CoreWeave customers.

Redefining the AI research cluster for production-grade training

SUNK is built for AI research teams running large, long-running training jobs, where predictability, reliability, and operational visibility matter as much as raw performance. SUNK preserves the Slurm workflows researchers rely on while reducing infrastructure complexity that can bog down platform teams.

Lifecycle unity

Unify how researchers run Slurm and how platform teams operate clusters—without requiring weeks of bespoke setups. SUNK User Provisioning (SUP) automates secure onboarding and reduces identity/config drift, so teams reach “time to science” faster with consistent cluster behavior.

Reliability

Run large, long-running training jobs with production-grade reliability. CoreWeave Mission Control monitors cluster health end-to-end, detects silent hardware issues and GPU stragglers, and mitigates failures before they stall synchronous training.

Performance

Maximize productive training time with topology-aware scheduling and predictable cluster behavior tuned for distributed training. Keep multi-day runs moving forward by reducing disruption, retries, and wasted GPU time, to help ensure more GPU-hours translate into real model progress.

Observability

Get operational visibility from infrastructure health to job-level behavior. Correlate Slurm metrics with GPU, network, and storage signals to spot bottlenecks fast, validate performance, and keep training outcomes predictable at scale.

Left
Right

Proven by leading pioneers at production scale

Streamline secure access with SUNK

Discover how Automated User Provisioning in SUNK automates identity management for AI research clusters. Reduce setup time, improve security, and keep teams focused on innovation.

Play video

Run on industry-leading Cloud infrastructure services

SUNK runs on CoreWeave infrastructure services built for AI training performance, scale, and operational consistency.

Compute services

Get the latest GPU compute you need for your most complex AI workloads through a Kubernetes-native environment.

Storage services

Flexible, purpose-built, high-performance storage solutions that are purpose-built for AI.

Networking services

High-performance networking designed for optimal cluster scale-out and connectivity.

Supercomputing scale and enterprise-grade security

With massive megaclusters, CoreWeave GPU clusters help support multi-trillion parameter model training.

Left
Right

Technical partnership and direct-to-expert support

Our team of experienced solution architects will get SUNK up and running for you in a matter of hours.

A partnership mindset

Experience a highly supported, automated onboarding process focusing on rapid GPU deployment and AI workload optimization.

Direct-to-expert support

When customers need deeper assistance, direct-to-expert support routes requests to the same engineers who build and operate the platform to ensure fast, accurate resolution.

Enhanced observability

Gain better visibility into critical hardware, Kubernetes, and Slurm job metrics via intuitive dashboards.

Left
Right

Frequently asked questions

How does CoreWeave’s Automated User Provisioning (AUP) work with my existing identity provider?

AUP connects directly to enterprise Identity Providers like Google Workspace, Okta, or Microsoft Entra using the SCIM protocol. It automatically syncs users and groups from your existing directory into CoreWeave IAM, keeping access policies consistent across environments without manual setup or custom scripts.

What’s the difference between Automated User Provisioning (AUP) and SUNK User Provisioning (SUP)?

AUP handles identity federation. It brings users and groups from your enterprise IdP into CoreWeave IAM. SUP handles access provisioning. It automatically creates and manages accounts inside Slurm-on-Kubernetes (SUNK) clusters. AUP and SUP work together to automate the full lifecycle from identity to cluster access, eliminating manual onboarding and offboarding. Learn more here.

Does AUP or SUP help with access control and compliance?

Yes. AUP and SUP ensure every access change made in your IdP is reflected across CoreWeave IAM and SUNK in real time. That means instant deprovisioning when users leave and auditable, policy-driven access control for compliance and security reviews.

Left
Right

See what SUNK can do for you

Experience the resource flexibility your teams need to build, train, and deploy new models.

Request access to the SUNK self-service preview

Preview enrollment is now open to select CoreWeave customers.

Fill out the form below to express interest. Our team will review your request and follow up with enrollment details.