Enfabrica

In an effort to give AI computing in data centers a jolt of life, networking silicon and software provider, Enfabrica, has landed on a disruptive architecture that proposes to solve I/O bottlenecks and scaling challenges in massive AI clusters with a one-chip solution.

The company presented Accelerated Compute Fabric SuperNIC this month at AI Field Day, touting it the industry’s first SuperNIC chip to deliver 8 terabytes per second of bandwidth.

For comparison, the NICs used for high-performance computing in AI clusters offer 400-gigabits per second only, said Rochan Sankar, founder and CEO, while briefing the audience.

Code-named Millennium, the SuperNIC combines three years of research and developmental work, and heralds a new category of networking silicon.

Enfabrica’s patented technology, Accelerated Compute Fabric (ACF) architecture, unveiled last year – which this Gestalt IT article by Tech Field Day lead, Tom Hollingsworth, covers in detail – enables the Millennium SuperNIC.

AWS

“[It] is developed with the mission to enable accelerated computing and AI at unprecedented scale” with “equal parts hardware and software”, said Sankar.

Deep learning and large language models (LLM) have witnessed tremendous growth in the past couple years. “Models are already exceeding what can be housed in 8 or even 64 GPUs,” Sankar noted.

The data center compute model is changing rapidly in response to this rising demand. “Few years back we saw the first foray of accelerators – GPUs and TPUs – and the changing of the population of processing and surface of compute.”

The latest AI clusters enclose a heterogeneity of these processors – CPUs and specialized accelerators like GPUs, TPUs, FPGAs and custom ASICs. Big companies like Google, Meta and Microsoft who have most skin the game are building supersize pods loaded with thousands of these specialized hardware to run their LLMs. Futurum Intelligence projects that GPU spend will continue to escalate reaching close to a whopping $103 billion within the next 4 years.

The growing demand has ensured that FLOPS or floating-point operations per second, the unit of measurement of computing power of processors, continues to mounting. But I/O bandwidth and memory bandwidth have seen significant departure from this trend. The gap, Sankar highlighted, is the biggest barrier facing organizations looking to get higher utilization out of their hardware, and maximize model efficiency.

“We are able to pack more compute in the surface area of a server both through architecture and silicon design, but we haven’t reinvented I/O and networking in the context of AI,” he remarked. “System I/O architectures that are being used today for accelerated computing are basically the same ones that have been used for traditional computing.”

Amdahl’s law states that the performance improvements gained by upgrading or optimizing a single component within a system is limited to the amount of time it is utilized.

In other words, if the goal is to build a highly optimized system that provides the kind of holistic performance that LLMs demand, one needs not only extreme compute intensity, but a balance of compute, memory and I/O performances.

“You can try to build everything in one big chip to get rid of the bottlenecks,” he said, “but it’s pretty impossible.”

So Enfabrica developed ACF, a design that elastically binds multiple resources into one compact unit. The philosophy is that if heterogenous compute resources such as CPUs and accelerators can be linked directly to memory and storage, it can effectively replace multiple NICs, TOR switches chips and PCIe in a rack. This design slashes down the number of hops and overall congestions, speeding up data movement between GPUs.

The direct connection that ACF-S provides instantly uplifts I/O bandwidth 2x for every dollar, while reducing node-to-node latency by a staggering 75%, according to Enfabrica.

The ACF-S chip can singlehandedly move data between the processors’ native memory using “copy engines”, a job that formerly took the collective of PCIe switches, NICs and TOR switches to perform, leading to faster, high-performance communication.

Additionally, furnished with 800-gigabit network interfaces, ACF-S can tap into terabits of bandwidth to move that data across the network.

The Enfabrica system design completely transforms I/O and memory scalability within AI clusters. By sandwiching separate chips into one silicon die, it prevents the total cost of ownership from blowing up from component sprawl. The design allows communication to happen through the most optimal path, while lowering use of network power and components – a double-win for all.

To learn more about ACF-S, check out Enfabrica’s presentations from AI Field Day event at Techfieldday.com, or visit Enfabrica.net for resources on the solution.

TECHSTRONG TV

Click full-screen to enable volume control
Watch latest episodes and shows

AI Data Infrastructure Field Day

TECHSTRONG AI PODCAST

SHARE THIS STORY