Observability | CoreWeave

Trusted by leading AI labs, platforms, and enterprises

More visibility, fewer issues

The larger the cluster, the more opportunities there are for interruptions or unexpected issues. More interruptions lead to less time spent on training and inference, ultimately increasing the cost of building GenAI.

AI innovators need next-level visibility to achieve peak reliability, performance, and cost-effectiveness.

From AI model development to training to inference, CoreWeave Observe™ gives deep insights into granular performance metrics. That’s why CoreWeave customers experience up to 96% goodput, compared to the industry average of 90%.

Actionable insights

With CoreWeave Observe™, an array of detailed metrics and data are available for your clusters out-of-the box, with no setup or extra charges.

Learn more

Comprehensive hardware dashboards

Visualize your entire fleet of NVIDIA GPUs in one place, complete with details about every node.

SUNK integration

Overlay telemetry from hardware, Kubernetes, and Slurm jobs to quickly identify problem areas.

Cluster health management

Instead of worrying about infrastructure, offload GPU cluster health management to CoreWeave so that you can focus on shipping cutting-edge AI applications.

Network backbone visibility

See real-time ingress and egress traffic throughput from each node in your cluster to external Internet endpoints such as external model weight data sources to identify under-optimized workloads.

Easy visualization

CoreWeave makes it easy to correlate training job interruptions all the way down to a networking or infrastructure problem. Optimize your workloads to take full advantage of CoreWeave’s blazing-fast Kubernetes-on-bare-metal.

Faster debugging for training jobs

When your training jobs are interrupted or your system’s training performance unexpectedly declines, pinpointing the root cause often requires hours combing through cluster or infrastructure logs, consulting AI platform engineers, and worrying about whether or not the run should be restarted.

CoreWeave’s infrastructure observability is directly integrated into Weights & Biases workspaces, providing differentiated and seamless debugging experience. Infrastructure-level alerts, such as node failures and network timeouts, are embedded in training metric plots and run tables. This allows AI engineers to instantly determine whether an issue originates from their infrastructure or their model training routine, saving valuable engineering time and GPU resources.

Learn more

Unprecedented transparency

‍

GPU performance metrics

Get metrics from individual NVIDIA GPUs and network interfaces on temperature, power consumption, and more, to help ensure effective utilization of your clusters. All with zero setup.

Logs and metrics

See logs and metrics from all levels of the software stack to present a holistic view of your entire setup, Including audit logs from all services to help ensure full compliance and auditability.

You see what we see

Down-to-the-metal metrics give you the same visibility that we have into performance, server-level status, node lifecycle orchestration, and storage and network metrics. Unlike other CSPs, your view is our view.

Fast access to granular metrics

With CoreWeave Observe™, an array of detailed metrics and data are available for your clusters out-of-the box, with no setup or extra charges.

CoreWeave Grafana

Fully managed Grafana experience with curated dashboards based on CoreWeave’s expertise in operating massive-scale supercomputing fleets.

CoreWeave Metrics

Fully managed VictoriaMetrics API, one of the largest deployments in the world.

CoreWeave Logs

Fully managed Loki API with data available for immediate access—no need to wait to rehydrate logs or retrieve them from cold storage.

Fully managed telemetry forwarding

Experience fully managed telemetry forwarding to external data platforms or on-prem endpoints via CoreWeave Telemetry Relay.

A holistic view of your workloads, backed by a dedicated team

With CoreWeave Observe™, you'll get the support of a dedicated team in addition to detailed metrics.

FleetOps

Our FleetOps team monitors common signs of deterioration across the entire fleet, leveraging extensive know-how around cluster health and status.

24/7 support

Around-the-clock support keeps your clusters online and gets them back online as soon as interruptions happen.

A deep technical partnership

‍When building on CoreWeave, your teams will never feel like they’re on their own. Dedicated teams monitor your entire cluster environment—helping ensure you get the most out of CoreWeave Cloud.

Responsive and collaborative

Jay Shin, CEO and Co-Founder of Trillion Labs

Ready to get started?

See more with CoreWeave Observe™.

Let's connect