During the AI Field Day event hosted by Tech Field Day, a Futurum Group company, last week at Silicon Valley Arista Networks introduced the newly launched Arista AI Agent.
Integrated with Arista’s network configuration and monitoring software, CloudVision, the new AI agent is designed to help manage and monitor top of rack (TOR) switches as well as NIC connections, and debug issues at the server level, all through Arista’s flagship network operating system, EOS (Extensible Operating System).
The goal, Arista says, is to give users “comprehensive visibility into all the places that can impact the performance of the network” within a single package.
Explaining the unique traffic patterns in the AI fabric, software engineering lead for AI Network, Tom Emmons, said, “The timeframes that AI interacts with the network is on the microseconds and milliseconds.”
Many visibility solutions lack the fidelity to catch microbursts at such high frequency. These solutions, trained to look only at predictable phenomenon like packet drops that manifest as network problems, fail to spot other key indicators.
Another major drawback is time scale. “All of your traditional counters are on the time scale of seconds,” he noted.
Two purpose-built solutions on Arista’s portfolio provide fine-grained visibility of traffic counters in large AI clusters.
The AI Analyzer captures traffic statistics at 100 microsecond intervals recording the smallest inflections and deviations, said Emmons. When analyzed, this data reveals microscopic variations happening at nanoseconds level which aids understanding correlated events, finding sources of congestions, and understanding factors affecting job completion time (JCT).
“This gives you great insight about what’s really happening on the time scales that the AI traffic is actually operating on counters over seconds and minutes and hours.”
The new Arista AI agent is a software-based program that sits on a direct-attached NIC extending EOS to the NIC servers. EOS, that runs on all Arista’s switches and routers, comes with extensive monitoring, mirroring and time-stamping AI capabilities built into it. By bridging the network and server systems, the AI agent boosts the debugging capabilities of the EOS, allowing the fiber between the network and the servers to be monitored and managed from a single console.
“We all know the network doesn’t end with the top of rack switches. It extends all the way to the server.”
While most companies use some kind of tooling to aggregate network counters for troubleshooting, Arista takes it a step further by adding NIC counters to the mix. The EOS configuration and telemetry data that includes traditional counters, PFC counters, ECN counters, queueing statistics and discards, that are stored in a centralized database called EOS NetDL, is amplified by the NIC data that the AI agents bring in.
“Most NICs deploy a very extensive set of RDM counters. All of those can be exposed to the switches to get a more holistic view of the network as a whole.”
In effect, the EOS provides a fuller and more pinpointed view of the fabric, as well as uniform control of resources on the dashboard.
“Being able to correlate all of these network counters with statistics from the actual servers themselves really enables engineers to more accurately and faster debug their networks,” Emmons said.
Emmons highlighted the potential of the AI agent to interact at two layers – with EOS at the top, and with the NIC-specific plugin below. This allows all generic and specific counters to be mapped precisely.
“One very common issue we see in networks is that the Class of Service (CoS) configuration on the network is different from the COS configuration on the servers, and this leads to poor behavior,” he said pointing to the inconsistencies that arise from the lack of a single point of control.
By allowing configuration to happen from a single dashboard, the new feature helps avoid misalignment and ensures that all configurations are optimized and deployed across the fleet.
Using the data exported by the AI agents, Arista runs inferencing in the cloud to generate AI status reports.
Currently, Arista Networks is working on making AI Agent compatible with NVIDIA BlueField SuperNIC and ConnectX NIC, as well as the Broadcom Thor.
“We’re also actively talking to other companies who are looking to build this,” informed Emmons.
An additional new events correlation capability is also on the roadmap for the AI Agent.
Watch Arista Networks’ presentations from AI Field Day event at Techfieldday.com. Learn more about AI agent and AI Analyzer at Aristanetworks.com.