
Alluxio today added the ability to better maximize usage of graphical processors units (GPUs) by adding an ability to more granularly control I/O throughput to a version of its data orchestration and virtual file system platform that is optimized for artificial intelligence (AI) applications.
In addition, version 3.2 of Alluxio Enterprise AI adds cache management capabilities to optimize data usage via a REST application programming interface (AP) along with a Python FileSystem API that makes the platform more accessible to a wider array of applications.
The ability to optimize data throughput enables organizations to maximize GPU utilizations rates in excess of 97%, says Adit Madan, director of product for Alluxio. The file system developed by Alluxio achieves that goal by providing up to 10GB/s throughput and 200K IOPS with a single client, which can fully saturate eight A100 GPUs from NVIDIA on a single node.
That’s crucial because in the absence of that capability one of the most expensive infrastructure resources IT organizations employ today will sit idle because of a mismatch between GPUs and the data access speeds enabled by the underlying storage platform, he adds.
Originally developed for high performance computing (HPC) platforms, the Alluxio storage platform is based on a unified namespace that enables its platform to access GPU resources in the cloud or in an on-premises IT environment.
Optimizing I/O to maximize usage of processors has been a tactic employed by HPC teams for decades. In the age of AI, however, it has taken on a greater sense of urgency because GPUs are an expensive resource that are often hard to find. IT teams need, as a result, to ensure the GPU resources they do have available are used to the full extent possible. “There’s a real scarcity,” says Madan.
Responsibility for acquiring and managing AI infrastructure will vary from one organization to the next. In some cases, a data engineering team is specifically tasked with managing storage platforms, while in other instances a platform engineering team has emerged. Some organizations have even gone so far as to create a dedicated AI infrastructure team, notes Madan.
Regardless of approach, there’s a much greater appreciation in the age of AI for the nuances of data storage optimization and management, he adds.
It may be a while before every organization masters those nuances, but as more AI models are trained and deployed much of the existing IT infrastructure, including storage systems, will need to be upgraded. Many of the AI models being deployed today by enterprise IT teams in a production environment are especially latency sensitive. Legacy storage systems are typically not designed to support the requirements of AI applications that often share the same performance characteristics of an HPC application.
The degree to which AI applications might make storage expertise sought after again remains to be seen. Outside of a narrow range of high-performance database applications, most legacy applications are not especially latency sensitive. Nevertheless, the overall performance of any application has always been partially determined by how quickly data can be accessed, and AI models are no exception.