NVIDIA has extended its Compute Unified Device Architecture (CUDA) to provide access to a virtual instructed set capability that makes it simpler to partition data sets running in parallel using a set of tiles.

A CUDA Tile Intermediate Representation (IR) capability added to version 13.1 of CUDA makes it possible to manage data-parallel programs at a higher level of abstraction using tiles and tile blocks, which can then be automatically mapped onto hardware resources such as threads, the memory hierarchy, and tensor cores.

As a result, it will become simpler to build higher-level hardware-specific compilers, frameworks, and domain-specific languages (DSLs) for NVIDIA hardware, says Stephen Jones, a distinguished software architect for NVIDIA. Tile-based programming enables developers to run an algorithm by specifying sets of data, also known as tiles, and then define the computations that should be performed without having to specify how the algorithm is executed at an element-by-element level. Instead, the compiler and runtime handle how the algorithm is executed.

Most developers will invoke CUDA tile programming through tools such as NVIDIA cuTile Python versus having to rely on single instruction, multiple threads (SMIT) programming at a lower kernel level. “It’s much more intuitive,” says Jones.

The overall goal is to make it easier to optimally consume IT infrastructure resources such as Tensor Cores that have been developed to optimize specific tasks running on a graphical processors unit (GPU) in a way that is more accessible by invoking a CUDA Tile IR capability either directly or via higher level frameworks that add support for CUDA Tile IR, notes Jones.

The approach provides the added benefit of making it easier to ensure backwards-compatibility as additional Tensor cores are developed, he adds.

In general, NVIDIA has been spurring adoption of CUDA by providing access to open source tools and frameworks that invoke lower-level proprietary functions. The primary focus is on building, for example, Python applications that can run anywhere. However, there are artificial intelligence (AI) applications that require access to lower level functionality in the CUDA stack, with CUDA Tile IR now providing another option to reduce the overall cognitive load on application development teams that would otherwise need to understand how to programmatically optimize GPUs at the kernel level, says Jones.

It’s not clear how many classes of AI applications have been developed using different tools and frameworks that have been developed for CUDA, but the framework NVIDIA has developed has become a de facto standard. The challenge is striking a balance between tools that optimize GPUs from NVIDIA at a deeper level versus the need to make it possible to one day run an AI application on another class of processors.

Regardless of approach, the one thing that is certain is the tools needed to optimally run AI applications on NVIDIA GPUs are becoming more accessible. The challenge and the opportunity now is determining what level of the CUDA stack to invoke based on the level of performance required and the business goals the organization actually wants to achieve.