CoreWeave is the Essential Cloud for AI. CoreWeave Mission Control offloads cluster health management from your team to ours. Get industry-leading reliability and resiliency for AI infrastructure—with a typical node goodput of 96%. Each component of Mission Control is purpose-built to provide your teams with highly performant, resilient, and reliable AI Infrastructure. Unlock higher performance and usage out of your clusters for faster time to market
1
00:00:04,080 --> 00:00:09,440
At CoreWeave we're not just building a company. We're intending to build the foundation of the AI
2
00:00:09,440 --> 00:00:14,640
era. Our pioneering AI cloud platform is the only cloud purpose-built for AI and accelerated
3
00:00:14,640 --> 00:00:20,240
workloads that's operating at hypers scale. At CoreWeave, we aim to help customers unlock more value
4
00:00:20,240 --> 00:00:25,440
from their infrastructure. We continue to push the GPUs closer to their theoretical maximum model
5
00:00:25,440 --> 00:00:31,120
flop utilization. In recent runs to benchmark our training capabilities, a CoreWeave cluster achieved
6
00:00:31,120 --> 00:00:38,800
an MFU that was 20% greater than the baseline which we believe to be 35 to 45%. That incremental
7
00:00:38,800 --> 00:00:44,560
20% flop utilization improvement is enormously valuable to customers. It means they are able
8
00:00:44,560 --> 00:00:49,360
to get their models to market faster. It means significantly higher performance for the inference
9
00:00:49,360 --> 00:00:54,400
compute bringing down the total cost of ownership. The enhanced performance and greater efficiency of
10
00:00:54,400 --> 00:01:00,160
our core cloud platform impacts every step of the development process. You see it on day one
11
00:01:00,160 --> 00:01:05,600
when developers are able to run workloads within hours on infrastructure without needing to spend
12
00:01:05,600 --> 00:01:12,240
time burning in or testing their clusters and you see it on an ongoing basis as they experience less
13
00:01:12,240 --> 00:01:19,280
disruptions and better throughput. It extends to both training and inference. For training,
14
00:01:19,280 --> 00:01:25,040
we compare the publicly available data of Llama 3 training job performance to a comparable workload.
15
00:01:25,040 --> 00:01:32,240
On our platform, you can save 3.1 million GPU hours and experience 50% field interruptions per
16
00:01:32,240 --> 00:01:39,600
day. For inference, you can experience five times faster model download speeds and 10 times faster
17
00:01:39,600 --> 00:01:46,560
spin up times. Your inference at scale is more performant and costs less. How do we achieve this?
18
00:01:46,560 --> 00:01:52,000
We achieve it by being purpose-built for AI at every layer and we reinforce
19
00:01:52,000 --> 00:01:56,880
our performance and efficiency advantage and make it durable through our consistent track
20
00:01:56,880 --> 00:02:02,080
record of innovation as we continuously solve challenges for our customers and innovate for
21
00:02:02,080 --> 00:02:07,680
them to deliver a better high-performance cloud solution. At the infrastructure level,
22
00:02:07,680 --> 00:02:14,000
we deliver cutting edge GPU and CPU components with our proprietary DPU architecture Nimbus
23
00:02:14,000 --> 00:02:18,880
that enables greater networking and storage efficiency. Our networking is optimized for
24
00:02:18,880 --> 00:02:23,920
high performance throughput and our storage is specifically built to handle high performance
25
00:02:23,920 --> 00:02:30,400
workloads. At the managed software service level, our propriety orchestration framework CKS is built
26
00:02:30,400 --> 00:02:36,480
on Kubernetes and is designed specifically to make it easier to schedule and run high performance
27
00:02:36,480 --> 00:02:42,400
workloads with significant scaling demands. At the application level, we've built AI specific
28
00:02:42,400 --> 00:02:48,240
tools for our customers that help them load models faster and run their clusters more efficiently,
29
00:02:48,240 --> 00:02:54,240
as well as dedicated capabilities for inference to handle burst workloads and reduce latency for
30
00:02:54,240 --> 00:02:59,840
end users. To bring this all together cohesively, our workload monitoring and infrastructure life
31
00:02:59,840 --> 00:03:06,240
cycle management capabilities that we call mission control is designed to ensure that our fleets run
32
00:03:06,240 --> 00:03:12,640
at peak performance by driving transparency, proactive remediation and automation across
33
00:03:12,640 --> 00:03:17,760
our platform. Purpose-built also extends to our data centers. We have created one of the most
34
00:03:17,760 --> 00:03:22,400
sophisticated high performance data center footprints in the world which incorporates
35
00:03:22,400 --> 00:03:28,640
advanced cooling technologies to efficiently utilize space and deliver greater power per rack.
36
00:03:28,640 --> 00:03:35,280
Our approach is truly one of AI specialization in depth. Our entire technology stack and everything
37
00:03:35,280 --> 00:03:41,680
we do has AI and high performance workloads in mind. This focus at each layer of our platform
38
00:03:41,680 --> 00:03:47,600
delivers improved MFU overall. We bring it all together into a comprehensive and highly flexible
39
00:03:47,600 --> 00:03:53,440
platform. Our CoreWeave cloud platform built with our customers top of mind. Our platform's
40
00:03:53,440 --> 00:03:59,760
composability allows AI teams to benefit from customizable solutions, deployment flexibility,
41
00:03:59,760 --> 00:04:05,760
and the ability to integrate with the emerging ecosystem of our AI partners. And crucially,
42
00:04:05,760 --> 00:04:11,760
it is highly secure with zero trust design at the core of everything we do. Our adherence
43
00:04:11,760 --> 00:04:17,840
to stringent security standards across our stack enables us to serve the most discerning
44
00:04:17,840 --> 00:04:24,080
customers and to build relationships with leading enterprises who trust us with their IP and most
45
00:04:24,080 --> 00:04:30,080
sensitive data. A fundamental moat that I want to stress which differentiates us is our ability to
46
00:04:30,080 --> 00:04:35,920
monitor aggregated performance statistics across our expanding infrastructure fleet. Purpose-built
47
00:04:35,920 --> 00:04:41,520
for AI means we see exactly the right metrics. We see more of them and we're able to take action on
48
00:04:41,520 --> 00:04:48,160
them. These unique insights enable us to reinforce our solution to continuously optimize, innovate
49
00:04:48,160 --> 00:04:54,400
and improve performance across our platform. These metrics are only used internally to improve our
50
00:04:54,400 --> 00:05:00,480
services for our customers and this will only increase and sustain the durability and depth of
51
00:05:00,480 --> 00:05:06,400
our moat. Our track record is one of relentless innovation for the most pressing challenges our
52
00:05:06,400 --> 00:05:12,720
customers face in a fast-moving market. I'll give you some specific examples. Each time new
53
00:05:12,720 --> 00:05:17,520
infrastructure technologies have been rolled out and changed the infrastructure paradigm,
54
00:05:17,520 --> 00:05:22,720
CoreWeave has been on the leading edge of enabling that technology for our customers.
55
00:05:22,720 --> 00:05:30,640
We are among the first to market with the NVIDIA H100s and H200s at production scale. And
56
00:05:30,640 --> 00:05:36,960
recently we were acknowledged as the first cloud service provider to make Blackwell GB200 clusters
57
00:05:36,960 --> 00:05:42,960
generally available. When power became a major constraint, we identified it early and secured
58
00:05:42,960 --> 00:05:49,040
future capacity to enable growth. When one of our key customers needed high storage throughput,
59
00:05:49,040 --> 00:05:54,640
we developed an innovative new object storage solution to deliver that performance. When
60
00:05:54,640 --> 00:06:00,800
customers experience challenges in interoperating between slurm and kubernetes orchestration
61
00:06:00,800 --> 00:06:07,280
frameworks, we gave them the capability through our sunk services that integrate these frameworks.
62
00:06:07,280 --> 00:06:12,880
This allows both training and inference to work on the same infrastructure which is a massive
63
00:06:12,880 --> 00:06:19,360
efficiency unlock for our customers. And when GPUs started being pushed to their limits with AI,
64
00:06:19,360 --> 00:06:25,040
we innovated on our mission control to minimize and predict performance failures.
65
00:06:25,040 --> 00:06:31,120
Our recent acquisition of Weights and Biases is a testament to the direction we are headed and
66
00:06:31,120 --> 00:06:37,280
accelerates our roadmap, enabling AI developers on our platform to access more innovative tools
67
00:06:37,280 --> 00:06:43,120
and environments to build and monitor model performances, create applications, leverage
68
00:06:43,120 --> 00:06:49,440
proprietary or open-source models, and put these applications into production use cases, including
69
00:06:49,440 --> 00:06:55,600
for inference workloads within inference. Specifically, the Weave solution extends our
70
00:06:55,600 --> 00:07:02,400
current offering by adding on critical debugging capabilities and evaluation frameworks. These
71
00:07:02,400 --> 00:07:08,960
tools help to ensure models work seamlessly when deployed into production environments and to allow
72
00:07:08,960 --> 00:07:17,440
engineers to drive higher accuracy and better user experiences with their inference workloads.