AI Insights

Mar 28, 2025

AI Insights

Video

The CoreWeave Cloud Platform

CoreWeave is the Essential Cloud for AI. CoreWeave Mission Control offloads cluster health management from your team to ours. Get industry-leading reliability and resiliency for AI infrastructure—with a typical node goodput of 96%. Each component of Mission Control is purpose-built to provide your teams with highly performant, resilient, and reliable AI Infrastructure. Unlock higher performance and usage out of your clusters for faster time to market

00:00

1
00:00:04,080 --> 00:00:09,440
At CoreWeave we're not just building a company. We're intending to build the foundation of the AI

2
00:00:09,440 --> 00:00:14,640
era. Our pioneering AI cloud platform is the only cloud purpose-built for AI and accelerated

3
00:00:14,640 --> 00:00:20,240
workloads that's operating at hypers scale. At CoreWeave, we aim to help customers unlock more value

4
00:00:20,240 --> 00:00:25,440
from their infrastructure. We continue to push the GPUs closer to their theoretical maximum model

5
00:00:25,440 --> 00:00:31,120
flop utilization. In recent runs to benchmark our training capabilities, a CoreWeave cluster achieved

6
00:00:31,120 --> 00:00:38,800
an MFU that was 20% greater than the baseline which we believe to be 35 to 45%. That incremental

7
00:00:38,800 --> 00:00:44,560
20% flop utilization improvement is enormously valuable to customers. It means they are able

8
00:00:44,560 --> 00:00:49,360
to get their models to market faster. It means significantly higher performance for the inference

9
00:00:49,360 --> 00:00:54,400
compute bringing down the total cost of ownership. The enhanced performance and greater efficiency of

10
00:00:54,400 --> 00:01:00,160
our core cloud platform impacts every step of the development process. You see it on day one

11
00:01:00,160 --> 00:01:05,600
when developers are able to run workloads within hours on infrastructure without needing to spend

12
00:01:05,600 --> 00:01:12,240
time burning in or testing their clusters and you see it on an ongoing basis as they experience less

13
00:01:12,240 --> 00:01:19,280
disruptions and better throughput. It extends to both training and inference. For training,

14
00:01:19,280 --> 00:01:25,040
we compare the publicly available data of Llama 3 training job performance to a comparable workload.

15
00:01:25,040 --> 00:01:32,240
On our platform, you can save 3.1 million GPU hours and experience 50% field interruptions per

16
00:01:32,240 --> 00:01:39,600
day. For inference, you can experience five times faster model download speeds and 10 times faster

17
00:01:39,600 --> 00:01:46,560
spin up times. Your inference at scale is more performant and costs less. How do we achieve this?

18
00:01:46,560 --> 00:01:52,000
We achieve it by being purpose-built for AI at every layer and we reinforce

19
00:01:52,000 --> 00:01:56,880
our performance and efficiency advantage and make it durable through our consistent track

20
00:01:56,880 --> 00:02:02,080
record of innovation as we continuously solve challenges for our customers and innovate for

21
00:02:02,080 --> 00:02:07,680
them to deliver a better high-performance cloud solution. At the infrastructure level,

22
00:02:07,680 --> 00:02:14,000
we deliver cutting edge GPU and CPU components with our proprietary DPU architecture Nimbus

23
00:02:14,000 --> 00:02:18,880
that enables greater networking and storage efficiency. Our networking is optimized for

24
00:02:18,880 --> 00:02:23,920
high performance throughput and our storage is specifically built to handle high performance

25
00:02:23,920 --> 00:02:30,400
workloads. At the managed software service level, our propriety orchestration framework CKS is built

26
00:02:30,400 --> 00:02:36,480
on Kubernetes and is designed specifically to make it easier to schedule and run high performance

27
00:02:36,480 --> 00:02:42,400
workloads with significant scaling demands. At the application level, we've built AI specific

28
00:02:42,400 --> 00:02:48,240
tools for our customers that help them load models faster and run their clusters more efficiently,

29
00:02:48,240 --> 00:02:54,240
as well as dedicated capabilities for inference to handle burst workloads and reduce latency for

30
00:02:54,240 --> 00:02:59,840
end users. To bring this all together cohesively, our workload monitoring and infrastructure life

31
00:02:59,840 --> 00:03:06,240
cycle management capabilities that we call mission control is designed to ensure that our fleets run

32
00:03:06,240 --> 00:03:12,640
at peak performance by driving transparency, proactive remediation and automation across

33
00:03:12,640 --> 00:03:17,760
our platform. Purpose-built also extends to our data centers. We have created one of the most

34
00:03:17,760 --> 00:03:22,400
sophisticated high performance data center footprints in the world which incorporates

35
00:03:22,400 --> 00:03:28,640
advanced cooling technologies to efficiently utilize space and deliver greater power per rack.

36
00:03:28,640 --> 00:03:35,280
Our approach is truly one of AI specialization in depth. Our entire technology stack and everything

37
00:03:35,280 --> 00:03:41,680
we do has AI and high performance workloads in mind. This focus at each layer of our platform

38
00:03:41,680 --> 00:03:47,600
delivers improved MFU overall. We bring it all together into a comprehensive and highly flexible

39
00:03:47,600 --> 00:03:53,440
platform. Our CoreWeave cloud platform built with our customers top of mind. Our platform's

40
00:03:53,440 --> 00:03:59,760
composability allows AI teams to benefit from customizable solutions, deployment flexibility,

41
00:03:59,760 --> 00:04:05,760
and the ability to integrate with the emerging ecosystem of our AI partners. And crucially,

42
00:04:05,760 --> 00:04:11,760
it is highly secure with zero trust design at the core of everything we do. Our adherence

43
00:04:11,760 --> 00:04:17,840
to stringent security standards across our stack enables us to serve the most discerning

44
00:04:17,840 --> 00:04:24,080
customers and to build relationships with leading enterprises who trust us with their IP and most

45
00:04:24,080 --> 00:04:30,080
sensitive data. A fundamental moat that I want to stress which differentiates us is our ability to

46
00:04:30,080 --> 00:04:35,920
monitor aggregated performance statistics across our expanding infrastructure fleet. Purpose-built

47
00:04:35,920 --> 00:04:41,520
for AI means we see exactly the right metrics. We see more of them and we're able to take action on

48
00:04:41,520 --> 00:04:48,160
them. These unique insights enable us to reinforce our solution to continuously optimize, innovate

49
00:04:48,160 --> 00:04:54,400
and improve performance across our platform. These metrics are only used internally to improve our

50
00:04:54,400 --> 00:05:00,480
services for our customers and this will only increase and sustain the durability and depth of

51
00:05:00,480 --> 00:05:06,400
our moat. Our track record is one of relentless innovation for the most pressing challenges our

52
00:05:06,400 --> 00:05:12,720
customers face in a fast-moving market. I'll give you some specific examples. Each time new

53
00:05:12,720 --> 00:05:17,520
infrastructure technologies have been rolled out and changed the infrastructure paradigm,

54
00:05:17,520 --> 00:05:22,720
CoreWeave has been on the leading edge of enabling that technology for our customers.

55
00:05:22,720 --> 00:05:30,640
We are among the first to market with the NVIDIA H100s and H200s at production scale. And

56
00:05:30,640 --> 00:05:36,960
recently we were acknowledged as the first cloud service provider to make Blackwell GB200 clusters

57
00:05:36,960 --> 00:05:42,960
generally available. When power became a major constraint, we identified it early and secured

58
00:05:42,960 --> 00:05:49,040
future capacity to enable growth. When one of our key customers needed high storage throughput,

59
00:05:49,040 --> 00:05:54,640
we developed an innovative new object storage solution to deliver that performance. When

60
00:05:54,640 --> 00:06:00,800
customers experience challenges in interoperating between slurm and kubernetes orchestration

61
00:06:00,800 --> 00:06:07,280
frameworks, we gave them the capability through our sunk services that integrate these frameworks.

62
00:06:07,280 --> 00:06:12,880
This allows both training and inference to work on the same infrastructure which is a massive

63
00:06:12,880 --> 00:06:19,360
efficiency unlock for our customers. And when GPUs started being pushed to their limits with AI,

64
00:06:19,360 --> 00:06:25,040
we innovated on our mission control to minimize and predict performance failures.

65
00:06:25,040 --> 00:06:31,120
Our recent acquisition of Weights and Biases is a testament to the direction we are headed and

66
00:06:31,120 --> 00:06:37,280
accelerates our roadmap, enabling AI developers on our platform to access more innovative tools

67
00:06:37,280 --> 00:06:43,120
and environments to build and monitor model performances, create applications, leverage

68
00:06:43,120 --> 00:06:49,440
proprietary or open-source models, and put these applications into production use cases, including

69
00:06:49,440 --> 00:06:55,600
for inference workloads within inference. Specifically, the Weave solution extends our

70
00:06:55,600 --> 00:07:02,400
current offering by adding on critical debugging capabilities and evaluation frameworks. These

71
00:07:02,400 --> 00:07:08,960
tools help to ensure models work seamlessly when deployed into production environments and to allow

72
00:07:08,960 --> 00:07:17,440
engineers to drive higher accuracy and better user experiences with their inference workloads.

‍

The CoreWeave Cloud Platform

Products

Solutions

AI Infrastructure

Why CoreWeave

Resources

About

Related videos