How to measure and optimize AI infrastructure for large-scale trainingHow to measure and optimize AI infrastructure for large-scale trainingHow to measure and optimize AI infrastructure for large-scale training
CoreWeave

How to measure and optimize AI infrastructure for large-scale training

Event details

Webinar: How to measure and optimize AI infrastructure for large-scale training

Location
Wes Brown
Distinguished Engineer
,
CoreWeave
Location
Deok Filho
Product Manager
,
CoreWeave
Location
Schedule

Aug 28, 2025

2:00 pm

EDT

August

28

 — 

Location
30 min

Is purpose-built for AI training at scale really better? If so, how much better?

Our engineering team set out to answer this question, which led to months of research, testing, and even our own trained AI model, all captured in our latest performance benchmarking whitepaper.

Join Distinguished Engineer Wes Brown and Product Manager Deok Filho as they pull back the curtain on the methodology, the surprises, and the hard-won optimizations that delivered up to 20% more throughput, 10x longer uptime, and 97–98% utilization.

In this session, you’ll learn:

  • The hard data, charts, and benchmarks that prove an AI-first cloud outperforms industry training benchmarks
  • How we measured MFU, MTTF, and ETTR at massive scale—and why those metrics matter
  • What optimizations move the needle, from high-throughput tokenization to async checkpointing and automated recovery
  • Actionable next steps for applying these measurement and optimization techniques to your own AI training workflows

Speakers

Wes Brown
Wes Brown
CoreWeave
Distinguished Engineer
Deok Filho
Deok Filho
CoreWeave
Product Manager

CAIOS,
CKS,
CoreWeave Networking,
GPU Compute,
GPUs,
Mission Control,
Observability,
SUNK,
Support,
Home v3,
Home v2,
Product - GPU Compute,
Product - Virtual Servers,
Solution - Pixel Streaming,
Solution - Machine Learning,
Product - VFX,
Product - Kubernetes,
Product - Concierge Render,
Home,