
As AI training and inference demands grow, AI networks have evolved into complex multi-tiered systems, often including distinct front-end networks for user access and management, back-end networks for GPU-to-GPU communication, and dedicated storage networks.
This evolution introduces significant hurdles, including the configuration and management of networking systems from diverse vendors, the need for high throughput and lossless behavior, and the critical synchronization of settings across both network devices and compute nodes.
Aviz Open Networking Enterprise Suite (ONES): Intent-Based Automation for AI Fabrics
At the AI Infrastructure Field Day, Aviz provided a deeper look into its Open Networking Enterprise Suite (ONES), an automation suite it built to facilitate the design, deployment, and monitoring of open networking platforms. Aviz engineered ONES to be hardware-agnostic, working with any networking ASIC, switch, or network operating system that can run SONiC (Software for Open Networking in the Cloud) or Cumulus Linux, thereby supporting a multi-vendor, multi-hardware environment from a single controller.
ONES utilizes an intent-based management and automation model, where users define the desired state of their entire network fabric, rather than configuring devices individually. This is crucial for efficient operation of complex AI networks with thousands of compute nodes, switches, and connections.
Network admins use ONES to define the desired state of the network—the inventory, physical connections, and desired configurations like BGP settings, IP pools, and Quality of Service (QoS) parameters. ONES maintains the desired state in user-editable YAML files which it uses to automate, orchestrate, and maintain each device’s configuration. Aviz provides an open-source repository containing over 70 tested YAML templates for various topologies, from standard IP Clos architecture to multi-tenancy with VXLAN.
ONES orchestration process includes two kinds of validation:
- Configuration validation ensures that configurations are syntactically correct and don’t throw errors during application.
- Operational checks verify control plane and data path integrity. In cases of miswiring, for instance, configuration might pass, but BGP won’t come up, and ONES will clearly indicate the failure.
For NVIDIA-based AI reference architectures, ONES can construct the YAML based on simple inputs like the number of GPUs and desired IP addresses. It can also generate NVIDIA Air DOT files, enabling the creation of a digital twin of the physical topology for simulation and testing before actual deployment. This allows users to test configurations and validate network behavior in a virtual environment that mirrors real-world conditions, accelerating the time to operation of complex AI networking fabrics.
Streamlining Day 2 Operations and Network Copilot
Aviz simplifies Day 2 operations for AI networks, offering features essential for ongoing management and maintenance, including:
- Backup and restore: ONES automatically makes a known-good baseline backup upon successful orchestration and allows for multiple backups and restores for individual switches or the entire fabric.
- Configuration comparison: Users can compare the current running configuration against the original intent, a known-good backup, or even compare configurations between different switches, to identify configuration drift or unauthorized changes.
- Direct configuration changes: Operators can pull the current running configuration, make specific changes (e.g., adding a VLAN or modifying a router ID), and apply them without re-orchestrating the entire fabric.
Further enhancing operational efficiency, Aviz offers Network Copilot, an AI-driven natural language interface that acts as a chat interface to all networking tools. Using Network Copilot, operators can ask complex questions in natural language, such as “Tell me what’s going on in my fabric?” or “Compare these two configs and summarize the differences,” receiving quick, insightful answers and summaries that would otherwise require deep UI navigation or manual log analysis.
Comprehensive Telemetry for Diagnostics and Fine-Tuning
For diagnostics and fine-tuning of AI networks, Aviz provides comprehensive telemetry supporting end-to-end monitoring across network switches, servers, network interface cards (NICs), and GPUs. This includes:
- RoCE (RDMA over Converged Ethernet) telemetry: Crucial for AI fabrics requiring lossless communication, ONES monitors utilization counters, QoS drops, Priority Flow Control (PFC), and Explicit Congestion Notification (ECN).
- GPU metrics: A lightweight agent on GPU-running servers collects per-GPU metrics for health, performance, and utilization.
- Anomaly detection: An inbuilt rule engine allows users to create custom rules for specific metrics (e.g., GPU utilization thresholds), triggering warnings or critical alerts.
By combining intent-based automation, robust Day 2 operations, AI-powered insights, and comprehensive telemetry, Aviz delivers a software-defined networking solution that helps enterprises navigate the unique challenges of building, deploying, and managing high-performance networks for AI infrastructure at scale.