
AI workloads are shifting from primarily model training to serving these models through inferencing, making network performance and stability critically important. GPUs are data hungry, demanding continuous, low-latency data streams. It’s paramount to optimize the network infrastructure before receiving expensive GPUs, ensuring readiness for new generations of hardware and complex workloads. The goal is to avoid treating users as “beta testers” by proactively validating system performance and limits.
Keysight’s Pivotal Role in Network Validation
At AI Infrastructure Field Day, Keysight positioned itself as an “emulation company” dedicated to testing and optimizing network infrastructure to ensure it “operates better than it was before.” Keysight aims to help operators understand the absolute limits of any system and its approach involves specialized tools like Keysight Cyperf for front-end networks and Keysight AI (KAI) Data Center Builder for backend networks.
Front-End Network Testing with Cyperf
The front-end network is crucial for AI workloads, managing the movement of data from external sources like storage, data lakes, and the public internet to GPU clusters for training and inference activities. Keysight’s Cyperf is a software-based traffic generator equipped with advanced features for comprehensive front-end validation, including:
- Traffic Emulation—generating many different types of application traffic patterns such as those found in AI front end networks, as well as regular application traffic including voice, video, and UDP streaming, along with low-latency traffic, to accurately simulate real-world conditions.
- Performance Measurement—measuring critical parameters such as bandwidth, latencies, connections and packets per second, and quality of service. Cyperf also assesses the efficiency impact of infrastructure complexities like NATing, proxying, encapsulations, and encryptions.
- Real-World Stress Testing—identifying the system’s breaking points by exercising worst case scenarios, such as simulating a denial-of-service (DOS) attack. This rigorous testing ensures the network can handle unexpected spikes in load and maintains stability under extreme conditions.
- Mitigating Noisy Neighbor Problems—setting and testing metering thresholds for each customer, ensuring that tenants do not bleed over and steal other’s resources in multi-tenant environments.
Backend Network Testing with KAI Data Center Builder
Backend AI networks facilitate data exchange between GPUs, primarily during model training and increasingly for inference workloads. Keysight designed KAI Data Center Builder to emulate AI workloads to benchmark, fine-tune, reproduce issues, and plan for new network designs while reducing the need for expensive hardware in lab environments.
KAI Data Center Builder can operate either as software on real servers, mimicking GPU data movements using RoCE (RDMA converged Ethernet) traffic, or by emulating network cards using Keysight’s 81 traffic generators. This allows for comprehensive testing of new network generations even before new GPUs are available. Key aspects of backend network validation include:
- Collective Operations Benchmarking—AI workloads frequently utilize collective operations to move data between GPUs repeatedly during training. Using KAI Data Center Builder, network architects can zoom in into a single transaction to fine-tune network parameters for maximum utilization and fastest completion time and can compare different network topologies, collective algorithms, and load balancing parameters.
- Workload Emulation—to help identify the critical path in workloads, especially when GPUs simultaneously compute and move data, focusing on the exposed communication time (the time GPUs wait for data movement to finish) to optimize overall performance.
- Fine-Tuning Congestion Control—effective congestion control is a critical aspect of AI network design. While architects can design a lossless fabric, it’s crucial to understand and fine-tune PFC (priority-based flow control), ECN (Explicit Congestion Notification) and DCQCN (Data Center Quantized Congestion Notification) for real-world AI workloads and traffic patterns.
- Tuning Inferencing Workloads—these workloads demand extremely low-latency data flow, particularly for continuously fetching incremental data required by techniques like retrieval-augmented generation (RAG). The unpredictable nature and potentially sudden, high-volume demands of inferencing require robust and highly optimized networks that can maintain performance and stability under pressure.
By providing comprehensive tools and methodologies, Keysight empowers AI infrastructure network architects to rigorously test, validate, and optimize their network designs. This proactive approach ensures that the data hungry” GPUs receive the necessary data efficiently, securely, and with predictable performance, ultimately delivering a world-class AI experience.