Real Cloud Infrastructure for Real AI Workloads: Training and Inference at Production Scale

Event details

Chen Goldberg

EVP, Product & Engineering

CoreWeave

Corey Sanders

SVP of Product

CoreWeave

—

30 minutes

Infrastructure Built for Production-Scale AI

Today’s frontier and mixture-of-experts models weren’t small. They spanned multi-trillion parameters and required precise coordination across thousand-GPU clusters.

Traditional cloud environments simply weren’t built for this scale. To move from experimentation to real-world deployment, teams needed infrastructure purpose-built for sustained, large-scale workloads.

In this session, CoreWeave detailed how we optimized every layer of the AI stack—from infrastructure to orchestration to observability—to efficiently run large-scale training and inference workloads. We also examined the architectural breakthroughs that enabled rack-scale systems to operate with ultra-low latency and high reliability.

These were the essential cloud components that powered the next generation of agentic AI. The question was: How did your infrastructure stack up?

What you’ll learn in this on-demand session

How infrastructure requirements change when scaling to trillion-parameter and mixture-of-experts models
How full-stack optimization across infrastructure, orchestration, and observability improves performance and efficiency
Architectural innovations enabling ultra-low latency, rack-scale AI systems
Best practices for running production-grade AI workloads, including agentic AI systems

Speakers

Chen Goldberg

CoreWeave

EVP, Product & Engineering

Corey Sanders

CoreWeave

SVP of Product

Upcoming events

Related webinars

No events found.

Real Cloud Infrastructure for Real AI Workloads: Training and Inference at Production Scale

Event details

Infrastructure Built for Production-Scale AI

What you’ll learn in this on-demand session

Speakers

Upcoming events

More on-demand webinars

Related webinars

Upcoming events

More on-demand webinars

Strategies for Maximizing GPU Performance

AI Fleet Management 101

Feeding A 22,000 GPU Cluster with Data

The Zero Trust AI (Data) Cloud

The Best of Both Worlds: Slurm on Kubernetes

Accelerating HPC and AI with Slurm and SchedMD

Create a Self-Serve Platform for Kubernetes

Why Bare Metal is Better

Three AI Inference Execution Paths: How to Match Your Workload to a Solution

When AI Training Runs Fail: What Recovery Actually Costs You

From Experimentation to Production: Why Inference Is the Defining Layer of AI

Stragglers, Stalls, and Restarts: Why AI Training Throughput Breaks Down at Scale

SUNK: Scale AI Training Without Breaking Your Infrastructure

NVIDIA HGX B300 on CoreWeave: What Changes for Agentic AI at Scale

Inside CoreWeave ARENA: Proving AI Production Readiness

Inside the Rack Scale Revolution: How CoreWeave and NVIDIA Are Building the Foundation for the Next Leap in AI

Unlock Agentic Breakthroughs with a Purpose-Built AI Cloud

How to Maximize Resiliency with AI-Native Observability

How to Move Beyond Tiers, Tradeoffs, and Runaway Costs in AI Storage

On-demand Webinar: How to measure and optimize AI infrastructure for large-scale training

Decoding the Economics of AI Infrastructure

Why NVIDIA Blackwell on CoreWeave

Products

Solutions

AI Infrastructure

Why CoreWeave

Resources

About