AI Insights
Video

The CoreWeave Cloud Platform

CoreWeave is the Essential Cloud for AI. CoreWeave Mission Control offloads cluster health management from your team to ours. Get industry-leading reliability and resiliency for AI  infrastructure—with a typical node goodput of 96%. Each component of Mission Control is purpose-built to provide your teams with highly performant, resilient, and reliable AI Infrastructure. Unlock higher performance and usage out of your clusters for faster time to market

1
00:00:04,080 --> 00:00:09,440
At CoreWeave we're not just building a company. We're intending to build the foundation of the AI

2
00:00:09,440 --> 00:00:14,640
era. Our pioneering AI cloud platform is the only cloud purpose-built for AI and accelerated

3
00:00:14,640 --> 00:00:20,240
workloads that's operating at hypers scale. At CoreWeave, we aim to help customers unlock more value

4
00:00:20,240 --> 00:00:25,440
from their infrastructure. We continue to push the GPUs closer to their theoretical maximum model

5
00:00:25,440 --> 00:00:31,120
flop utilization. In recent runs to benchmark our training capabilities, a CoreWeave cluster achieved

6
00:00:31,120 --> 00:00:38,800
an MFU that was 20% greater than the baseline which we believe to be 35 to 45%. That incremental

7
00:00:38,800 --> 00:00:44,560
20% flop utilization improvement is enormously valuable to customers. It means they are able

8
00:00:44,560 --> 00:00:49,360
to get their models to market faster. It means significantly higher performance for the inference

9
00:00:49,360 --> 00:00:54,400
compute bringing down the total cost of ownership. The enhanced performance and greater efficiency of

10
00:00:54,400 --> 00:01:00,160
our core cloud platform impacts every step of the development process. You see it on day one

11
00:01:00,160 --> 00:01:05,600
when developers are able to run workloads within hours on infrastructure without needing to spend

12
00:01:05,600 --> 00:01:12,240
time burning in or testing their clusters and you see it on an ongoing basis as they experience less

13
00:01:12,240 --> 00:01:19,280
disruptions and better throughput. It extends to both training and inference. For training,

14
00:01:19,280 --> 00:01:25,040
we compare the publicly available data of Llama 3 training job performance to a comparable workload.

15
00:01:25,040 --> 00:01:32,240
On our platform, you can save 3.1 million GPU hours and experience 50% field interruptions per

16
00:01:32,240 --> 00:01:39,600
day. For inference, you can experience five times faster model download speeds and 10 times faster

17
00:01:39,600 --> 00:01:46,560
spin up times. Your inference at scale is more performant and costs less. How do we achieve this?

18
00:01:46,560 --> 00:01:52,000
We achieve it by being purpose-built for AI at every layer and we reinforce

19
00:01:52,000 --> 00:01:56,880
our performance and efficiency advantage and make it durable through our consistent track

20
00:01:56,880 --> 00:02:02,080
record of innovation as we continuously solve challenges for our customers and innovate for

21
00:02:02,080 --> 00:02:07,680
them to deliver a better high-performance cloud solution. At the infrastructure level,

22
00:02:07,680 --> 00:02:14,000
we deliver cutting edge GPU and CPU components with our proprietary DPU architecture Nimbus

23
00:02:14,000 --> 00:02:18,880
that enables greater networking and storage efficiency. Our networking is optimized for

24
00:02:18,880 --> 00:02:23,920
high performance throughput and our storage is specifically built to handle high performance

25
00:02:23,920 --> 00:02:30,400
workloads. At the managed software service level, our propriety orchestration framework CKS is built

26
00:02:30,400 --> 00:02:36,480
on Kubernetes and is designed specifically to make it easier to schedule and run high performance

27
00:02:36,480 --> 00:02:42,400
workloads with significant scaling demands. At the application level, we've built AI specific

28
00:02:42,400 --> 00:02:48,240
tools for our customers that help them load models faster and run their clusters more efficiently,

29
00:02:48,240 --> 00:02:54,240
as well as dedicated capabilities for inference to handle burst workloads and reduce latency for

30
00:02:54,240 --> 00:02:59,840
end users. To bring this all together cohesively, our workload monitoring and infrastructure life

31
00:02:59,840 --> 00:03:06,240
cycle management capabilities that we call mission control is designed to ensure that our fleets run

32
00:03:06,240 --> 00:03:12,640
at peak performance by driving transparency, proactive remediation and automation across

33
00:03:12,640 --> 00:03:17,760
our platform. Purpose-built also extends to our data centers. We have created one of the most

34
00:03:17,760 --> 00:03:22,400
sophisticated high performance data center footprints in the world which incorporates

35
00:03:22,400 --> 00:03:28,640
advanced cooling technologies to efficiently utilize space and deliver greater power per rack.

36
00:03:28,640 --> 00:03:35,280
Our approach is truly one of AI specialization in depth. Our entire technology stack and everything

37
00:03:35,280 --> 00:03:41,680
we do has AI and high performance workloads in mind. This focus at each layer of our platform

38
00:03:41,680 --> 00:03:47,600
delivers improved MFU overall. We bring it all together into a comprehensive and highly flexible

39
00:03:47,600 --> 00:03:53,440
platform. Our CoreWeave cloud platform built with our customers top of mind. Our platform's

40
00:03:53,440 --> 00:03:59,760
composability allows AI teams to benefit from customizable solutions, deployment flexibility,

41
00:03:59,760 --> 00:04:05,760
and the ability to integrate with the emerging ecosystem of our AI partners. And crucially,

42
00:04:05,760 --> 00:04:11,760
it is highly secure with zero trust design at the core of everything we do. Our adherence

43
00:04:11,760 --> 00:04:17,840
to stringent security standards across our stack enables us to serve the most discerning

44
00:04:17,840 --> 00:04:24,080
customers and to build relationships with leading enterprises who trust us with their IP and most

45
00:04:24,080 --> 00:04:30,080
sensitive data. A fundamental moat that I want to stress which differentiates us is our ability to

46
00:04:30,080 --> 00:04:35,920
monitor aggregated performance statistics across our expanding infrastructure fleet. Purpose-built

47
00:04:35,920 --> 00:04:41,520
for AI means we see exactly the right metrics. We see more of them and we're able to take action on

48
00:04:41,520 --> 00:04:48,160
them. These unique insights enable us to reinforce our solution to continuously optimize, innovate

49
00:04:48,160 --> 00:04:54,400
and improve performance across our platform. These metrics are only used internally to improve our

50
00:04:54,400 --> 00:05:00,480
services for our customers and this will only increase and sustain the durability and depth of

51
00:05:00,480 --> 00:05:06,400
our moat. Our track record is one of relentless innovation for the most pressing challenges our

52
00:05:06,400 --> 00:05:12,720
customers face in a fast-moving market. I'll give you some specific examples. Each time new

53
00:05:12,720 --> 00:05:17,520
infrastructure technologies have been rolled out and changed the infrastructure paradigm,

54
00:05:17,520 --> 00:05:22,720
CoreWeave has been on the leading edge of enabling that technology for our customers.

55
00:05:22,720 --> 00:05:30,640
We are among the first to market with the NVIDIA H100s and H200s at production scale. And

56
00:05:30,640 --> 00:05:36,960
recently we were acknowledged as the first cloud service provider to make Blackwell GB200 clusters

57
00:05:36,960 --> 00:05:42,960
generally available. When power became a major constraint, we identified it early and secured

58
00:05:42,960 --> 00:05:49,040
future capacity to enable growth. When one of our key customers needed high storage throughput,

59
00:05:49,040 --> 00:05:54,640
we developed an innovative new object storage solution to deliver that performance. When

60
00:05:54,640 --> 00:06:00,800
customers experience challenges in interoperating between slurm and kubernetes orchestration

61
00:06:00,800 --> 00:06:07,280
frameworks, we gave them the capability through our sunk services that integrate these frameworks.

62
00:06:07,280 --> 00:06:12,880
This allows both training and inference to work on the same infrastructure which is a massive

63
00:06:12,880 --> 00:06:19,360
efficiency unlock for our customers. And when GPUs started being pushed to their limits with AI,

64
00:06:19,360 --> 00:06:25,040
we innovated on our mission control to minimize and predict performance failures.

65
00:06:25,040 --> 00:06:31,120
Our recent acquisition of Weights and Biases is a testament to the direction we are headed and

66
00:06:31,120 --> 00:06:37,280
accelerates our roadmap, enabling AI developers on our platform to access more innovative tools

67
00:06:37,280 --> 00:06:43,120
and environments to build and monitor model performances, create applications, leverage

68
00:06:43,120 --> 00:06:49,440
proprietary or open-source models, and put these applications into production use cases, including

69
00:06:49,440 --> 00:06:55,600
for inference workloads within inference. Specifically, the Weave solution extends our

70
00:06:55,600 --> 00:07:02,400
current offering by adding on critical debugging capabilities and evaluation frameworks. These

71
00:07:02,400 --> 00:07:08,960
tools help to ensure models work seamlessly when deployed into production environments and to allow

72
00:07:08,960 --> 00:07:17,440
engineers to drive higher accuracy and better user experiences with their inference workloads.