Transforming AI Fabric with ONES: Enhanced Observability for GPU Performance
Explore the latest in AI network management with our ONES 3.0 series
Future of Intelligent Networking for AI Fabric Optimization
If you’re operating a high-performance data center or managing AI/ML workloads, ONES 3.0 offers advanced features that ensure your network remains optimized and congestion-free, with lossless data transmission as a core priority.
In today’s fast-paced, AI-driven world, network infrastructure must evolve to meet the growing demands of high-performance computing, real-time data processing, and seamless communication. As organizations build increasingly complex AI models, the need for low-latency, lossless data transmission, and sophisticated scheduling of network traffic has become crucial. ONES 3.0 is designed to address these requirements by offering cutting-edge tools for managing AI fabrics with precision and scalability.
Building on the solid foundation laid by ONES 2.0, where RoCE (RDMA over Converged Ethernet) support enabled lossless communication and enhanced proactive congestion management, ONES 3.0 takes these capabilities to the next level. We’ve further improved RoCE features with the introduction of PFC Watchdog (PFCWD) for enhanced fault tolerance, Scheduler for optimized traffic handling, and WRED for intelligent queue management, ensuring that AI workloads remain highly efficient and resilient, even in the most demanding environments.
Why RoCE is Critical for Building AI Models
As the next generation of AI models requires vast amounts of data to be transferred quickly and reliably across nodes, RoCE becomes an indispensable technology. By enabling remote direct memory access (RDMA) over Ethernet, RoCE facilitates low-latency, high-throughput, and lossless data transmission — all critical elements in building and training modern AI models.
In AI workloads, scheduling data packets effectively ensures that model training is not delayed due to network congestion or packet loss. RoCE’s ability to prioritize traffic and ensure lossless data movement allows AI models to operate at optimal speeds, making it a perfect fit for today’s AI infrastructures. Whether it’s transferring large datasets between GPU clusters or ensuring smooth communication between nodes in a distributed AI system, RoCE ensures that critical data flows seamlessly without compromising performance.
Enhancing RoCE Capabilities from ONES 2.0 to ONES 3.0
In ONES 3.0, we’ve taken RoCE management even further, enhancing the ability to monitor and optimize Priority Flow Control (PFC) and ensuring lossless RDMA traffic under heavy network loads. The new PFC Watchdog (PFCWD) ensures that any misconfiguration or failure in flow control is detected and addressed in real-time, preventing traffic stalls or congestion collapse in AI-driven environments.
Additionally, ONES 3.0’s Scheduler allows for more sophisticated data packet scheduling, ensuring that AI tasks are executed with precision and efficiency. Combined with WRED (Weighted Random Early Detection), which intelligently manages queue drops to prevent buffer overflow in congested networks, ONES 3.0 provides a holistic solution for RoCE-enabled AI fabrics.
The Importance of QoS and RoCE in AI Networks
Quality of Service (QoS) and RoCE are pivotal in ensuring that AI networks can handle the rigorous demands of real-time processing and massive data exchanges without performance degradation. In environments where AI workloads must process large amounts of data between nodes, QoS ensures that critical tasks receive the required bandwidth, while RoCE ensures that this data is transmitted with minimal latency and no packet loss.
With AI workloads demanding real-time responsiveness, any network inefficiency or congestion can slow down AI model training, leading to delays and sub-optimal performance. The advanced QoS mechanisms in ONES 3.0, combined with enhanced RoCE features, provide the necessary tools to prioritize traffic, monitor congestion, and optimize the network for the low-latency, high-reliability communication that AI models depend on.
In ONES 3.0, QoS features such as DSCP mapping, WRED, and scheduling profiles allow customers to:
- 1. Prioritize AI-related traffic over other types of traffic, ensuring faster model training and lower latency.
- 2. Avoid congestion and packet loss, especially during periods of high traffic.
By leveraging QoS in combination with RoCE, ONES 3.0 creates an optimized environment for AI networks, allowing customers to confidently build and train next-generation AI models without worrying about data bottlenecks.
1. Comprehensive Interface and Performance Metrics
The UI showcases essential network performance indicators such as In/Out packet rates, errors, and discards, all displayed in real time. These metrics give customers the ability to:
- 1. Track traffic patterns and identify congestion points.
- 2. Quickly detect and troubleshoot network anomalies, ensuring smooth data transmission.
By having access to real-time and historical data, customers can make data-driven decisions to optimize network performance without sacrificing the quality of their AI workloads.
2. RoCE Config Visualization
RoCE (RDMA over Converged Ethernet) is a key technology used to achieve high-throughput and low-latency communication, especially when training AI models, where data packets must flow without loss. In ONES 3.0, the RoCE tab within the UI offers full transparency into how data traffic is managed:
- 1. DSCP Mapping: Differentiated Service Code Point (DSCP) values are mapped to specific traffic queues, ensuring proper prioritization of packets.
- 2. 802.1p Mapping: Allows customers to map Layer 2 traffic to different queues based on priority, which helps in optimizing scheduling for time-sensitive traffic.
- 3. WRED and Scheduler Profiles: The Weighted Random Early Detection (WRED) profile and Scheduler profile work together to prevent congestion, ensuring that traffic is queued and forwarded efficiently, and that no packet is lost, which is critical for AI data pipelines.
- 4. PFC and PFC Watchdog: Priority Flow Control (PFC) is a cornerstone feature that allows customers to create lossless data paths for high-priority traffic, ensuring that important data is never dropped, even during network congestion. The PFC Watchdog monitors and ensures that flow control remains active, preventing data bottlenecks.
3. Visual Traffic Monitoring: A Data-Driven Experience
The UI doesn’t just give you raw data — it helps you visualize it. With multiple graphing options and real-time statistics, customers can easily monitor:
- 1. Transmit Packets: Keep an eye on both RoCE and normal traffic, and make sure that high-priority AI data packets are transmitted efficiently across the network.
- 2. PFC Counters: Get detailed insights into PFC activity, including inbound and outbound traffic, ensuring that flow control mechanisms are functioning as intended.
- 3. Queue Drop Counters: Understand where and when packet drops happen. By tracking packet discards, you can identify and address congestion issues, improving your network’s overall performance.
- 4. Congestion Notification Packets (CNP): RoCE relies heavily on lossless data transmission, and CNPs play a crucial role in signaling congestion. Monitoring CNP activity ensures that your network responds dynamically to traffic demands, minimizing packet loss and delay.
4. Flexible Time-Based Monitoring and Analysis
Customers have the option to track metrics over various time periods, from live updates (1 hour) to historical views (12 hours, 2 weeks, etc.). This flexibility allows customers to:
- 1. Analyze short-term network behavior for immediate troubleshooting.
- 2. Review long-term trends for capacity planning and optimization.
This feature is especially valuable for customers running AI workloads, where consistent performance over extended periods is vital for the accuracy and efficiency of model training.
Centralized QoS View
ONES 3.0 offers a unified interface for all QoS configurations, including DSCP to TC mappings, WRED/ECN, and scheduler profiles, making traffic management simpler for network admins.
- 1. Convenient UI Access: Eliminates the need for manual CLI commands by providing a UI-based platform to view all QoS configurations across switches, saving time and effort.
- 2. Real-Time Monitoring: Enables immediate detection and resolution of traffic issues with live updates on queue and profile status, reducing troubleshooting time.
This page provides administrators with comprehensive insights into how traffic flows through the network, allowing them to fine-tune and optimize their configurations to meet the unique demands of modern workloads.

Comprehensive Topology View
ONES offers a comprehensive, interactive map of network devices and their connectivity, ideal for monitoring AI/ML and RoCE environments. It provides an actionable overview that simplifies network management.

Key features include:
1. Real-Time Device Status: Users can monitor individual devices, such as servers and switches, viewing critical details like availability, region, port speeds, and telemetry, helping to quickly detect and resolve issues like non-streaming devices.
2. Fault Detection: The interface highlights issues such as faulty fans, PSUs, or downed links, enabling swift corrective action to prevent network disruptions.
3. Detailed Device Information: Upon selecting a device, detailed metadata is displayed, including hardware specifics (e.g., GPU/CPU info), agent details, uptime, and port configuration, aiding in troubleshooting and performance assessment.
4. Link Details: When clicking on a link in the topology map, users can view all devices connected via that link, providing deeper insight into traffic paths and dependencies. This feature is crucial for diagnosing link-related issues and understanding how data flows between devices.
5. Traffic Monitoring: Analyze metrics like traffic load across links to detect bottlenecks and optimize traffic flows, ensuring smooth performance for high-priority workloads like AI/ML tasks.
6. Easy Navigation and Filtering: The ability to filter by topology type, device statuses, and regions simplifies the monitoring of large, complex networks, increasing management efficiency.
Overall, the Topology Page in ONES enhances network observability and control, making it easier to optimize performance, troubleshoot issues, and ensure the smooth operation of AI/ML and RoCE workloads.
Proactive Monitoring and Alerts with the Enhanced ONES Rule Engine
The ONES Rule Engine has been a standout feature in previous releases, providing robust monitoring and alerting capabilities for network administrators. With the latest update, we’ve enhanced the usability and functionality, making rule creation and alert configuration even smoother and more intuitive. Whether monitoring RoCE metrics or AI-Fabric performance counters, administrators can now set up alerts with greater precision and ease. This new streamlined experience allows for better anomaly detection, helping prevent network congestion and data loss before they impact performance.
The ONES Rule Engine offers cutting-edge capabilities for proactive network management, enabling real-time anomaly detection and alerting. It provides deep visibility into AI-Fabric metrics like queue counters, PFC events, packet rates, and link failures, ensuring smooth performance for RoCE-based applications. By allowing users to set custom thresholds and conditions for congestion detection, the Rule Engine ensures that network administrators can swiftly address potential bottlenecks before they escalate.
With integrated alerting systems such as Slack and Zendesk, administrators can respond instantly to network anomalies. The ONES Rule Engine’s automation streamlines monitoring and troubleshooting, helping prevent data loss and maintain optimal network conditions, ultimately enhancing the overall network efficiency.
Conclusion
In an era where AI and machine learning are driving transformative innovations, the need for a robust and efficient network infrastructure has never been more critical. ONES 3.0 ensures that AI workloads can operate seamlessly, with minimal latency and no packet loss.
FAQ’s
1. Why is RoCE critical for AI infrastructure and model training?
A. RoCE (RDMA over Converged Ethernet) is essential for AI because it enables:
- Low-latency, high-throughput data transfers between GPU nodes
- Lossless communication, vital for real-time model training
- Efficient memory access without CPU involvement
This makes RoCE a foundational technology for building and scaling AI/ML workloads.
2. How does ONES 3.0 improve RoCE management and observability?
A. ONES 3.0 advances RoCE integration through:
- PFC Watchdog (PFCWD) for monitoring and recovering from flow control issues
- Advanced scheduling tools (DWRR, WRR, STRICT) to manage packet priorities
- WRED-based queue management to prevent buffer overflows
These features ensure network reliability, even under high AI traffic loads.
3. What QoS features are included in ONES 3.0 for optimizing AI network traffic?
A. Quality of Service (QoS) is crucial for prioritizing AI tasks. ONES 3.0 includes:
- DSCP and dot1p mapping for accurate traffic classification
- Priority queue configuration to handle mission-critical packets
- Real-time congestion alerts and traffic shaping for lossless AI data transmission
Together, these ensure uninterrupted, high-performance AI workloads.
4. How does ONES 3.0 visualize and monitor RoCE configurations in real time?
A. The ONES UI provides deep visibility into:
- DSCP and 802.1p mapping to queues and priority groups
- WRED and PFC stats for congestion handling
- Scheduler profiles and queue usage across switches
This empowers network admins to proactively tune RoCE traffic and avoid disruptions.
5. What role does the ONES Rule Engine play in maintaining AI network performance?
A. The enhanced ONES Rule Engine enables proactive, automated management through:
- Custom alert rules for RoCE, queue drops, and link failures
- Slack/Zendesk integration for instant anomaly notifications
- Granular threshold settings to prevent issues before they affect AI training
It turns ONES into an intelligent observability and incident response system.
Comments
Post a Comment