Alluxio today announced Enterprise AI 3.5, the latest version of their artificial intelligence (AI)/ machine learning (ML) acceleration platform. The new update introduces significant improvements to accelerate AI model training workflows and optimize infrastructure utilization.

The releases addresses critical bottlenecks in AI training processes – particularly for organizations managing large-scale GPU deployments.

A key feature is the new CACHE_ONLY Write Mode which tackles checkpoint file creation, one of the most time-consuming aspects of model training. “These checkpoint files are large and can take hours to create, during which model training completely pauses,” explains Bill Hodak, VP of Product Marketing at Alluxio. “Our new caching mode improves checkpoint file writing performance by two to three times, significantly reducing training interruptions,” he adds.

The impact is substantial – what previously took an hour can now be completed in 20 minutes, Alluxio says. This efficiency can gain directly translate to faster end-to-end model training times as systems spend less time paused for checkpointing and more time on actual training computations.

For data scientists and machine learning (ML) engineers working with popular AI frameworks, Alluxio has enhanced its Python SDK with built-in integrations for PyTorch, PyArrow, and Ray. The integration enables interaction with cached data without requiring custom code modifications. “Infrastructure teams can now place data in our cache without data scientists necessarily needing to change their workflow,” Hodak notes. “It just works out of the box.”

The platform also introduces improved cache management features to optimize resource utilization. New TTL Cache Eviction Policies automatically manage less frequently accessed data. At the same time, Priority-based Cache Eviction Policies ensure critical data remains cached even when standard LRU algorithms would typically evict it.

A common frustration among data scientists is to have to wait extended periods to view directory contents, especially in environments with hundreds of millions of files. Alluxio claims that the new Index Service in Enterprise AI 3.5 delivers 3-5x improvement in directory listing performance for organizations managing massive datasets.

The release also includes a UFS Rate Limiter, which helps maintain stable GPU utilization by preventing individual nodes from monopolizing system resources. “Just by implementing the Alluxio cache, customers typically see GPU utilization increase from 40-50% to 80-90%,” says Hodak. “The rate limiter ensures this performance remains stable throughout the training process.”

Alluxio now supports heterogeneous worker resource configurations to provide greater deployment flexibility, allowing organizations to optimize their hardware resource utilization by incorporating nodes with varying CPU, memory, disk, and network specifications into their clusters.

In security enhancements, the platform has added TLS encryption for data traffic between Alluxio S3 API endpoints. The S3 API has been enhanced with persistent connections and multipart uploads resulting in up to 40% improvement in API latency, according to Alluxio.

These improvements come at a crucial time when organizations invest significantly in GPU infrastructure for AI training. Many companies spend over $20 million on GPUs making efficient utilization essential. By addressing data access bottlenecks and optimizing infrastructure usage, they can maximize their return on these investments while accelerating their development timelines.

TECHSTRONG TV

Click full-screen to enable volume control
Watch latest episodes and shows

Cloud Field Day

TECHSTRONG AI PODCAST

SHARE THIS STORY