secrets, secret, AI models,

The rise of generative artificial intelligence (GenAI) presents unique challenges in software supply chain security. As technology companies continue introducing GenAI into their applications, the attack surface widens. In particular, secrets – sensitive data like passwords, access keys, or API tokens that should be kept confidential – can be exposed through AI models. Our recent analysis of AI model repositories, developed by various Fortune 500 companies, found that the repositories contain secrets that could allow malicious actors to access valuable data.

AI Models’ Source Code 

Although models are typically stored in the well-known GitHub, in the world of AI, models are often stored in other hubs. The most popular source for AI models is Hugging Face, which stores over 500,000 models and is the primary source of open-source AI models. Hugging Face can adequately secure organizations’ models, scanning secrets, dangerous pickle code, and more.

However, Hugging Face is not the only hub to store models. Other popular hubs include the PyTorch Hub, Model Zoo, Kaggle, and more. My research team recently scanned repositories developed by various Fortune 500 companies and found that these other popular hubs are more prone to store secrets-leaking models due to the lack of internal secrets scanners. In fact, we found secrets in approximately 20% of the models when scanning the PyTorch Hub.

The hubs alone are not to blame for this untenable situation. Developers often store their machine learning models on GitHub due to its popularity and ease of version control. However, since these repositories are scattered across different accounts and organizations, there is no centralized location to organize all models.

As a result, a model hub aggregates these links, providing a convenient way for users to explore and access a wide range of machine-learning models available on GitHub. This means that they are not the ones storing the code; they are just the main pool where you can see all the models and their info.

When using GitHub as your source code management (SCM), secrets pose a significant threat, as they remain forever once introduced to the repository. Through Git history, you can access all other historical commits, so even if a developer introduced secrets accidentally and removed the file in a subsequent commit, the original secret-leaking commit is still there, waiting to be found.

Importantly, secrets can exist in various project parts beyond the model itself. These may include sensitive information stored in READMEs, configuration (config) files, environment variables, or other ancillary components. Therefore, thoroughly examining the entire project, including documentation and config files, is necessary to ensure comprehensive security.

Ultimately, the attack path for harvesting secrets to infiltrate organizations looks like this:

  1. Scan for secrets in AI model repositories
  2. Authenticate using stolen secrets
  3. Infiltrate internal networks
  4. Take over internal assets

Findings in The Wild 

Most model hubs link to a GitHub repository maintained by the model developers. When the hub is not the authority for security measures for the model’s development, it serves as a list of potentially vulnerable models. Our recent analysis led to the discovery of some big companies linking their models’ GitHub to PyTorch, like the FastPitch 2 model developed and maintained by NVIDIA.

The GitHub project provided by PyTorch shows that all models exist under a more extensive repository named DeepLearningExamples, containing much more than just a single model. After cloning the repository, we scanned its history, leading to three interesting commits. A developer in a specific README file explains that internally developed packages must be installed and provides instructions to install them using his own internal GitLab token.

For an adversary trying to attack a large organization’s software supply chain, this can serve as the holy grail for an attacker who can control and corrupt the organization’s GitLab, poison its source code, and gain access to the production environment. NVIDIA confirmed this token had been revoked after receiving notice of the disclosure.

Detecting and Securing Secrets in AI Model Repositories

As an AI model developer, safeguarding data and code is paramount, especially regarding hard-coded secrets. Although challenging, uncovering these secrets within a Git repository is crucial for robust software supply chain security.

AI model developers should regularly scan code for secrets and consider choosing an AI hub with an internal scanner. This extra layer of security ensures that potential vulnerabilities are identified early in the development process, minimizing the risk of exposed secrets.

In the unfortunate event of a secret exposure, swift action is crucial. Model developers should revoke the compromised secret promptly by changing passwords, rotating keys, or taking necessary steps to render the exposed data obsolete.

Integrating these practices into AI development workflows will create a more resilient and secure model development environment.

As AI technology becomes increasingly prevalent in software development, protecting sensitive data and secrets in code is imperative to securing organizations’ software supply chains.

TECHSTRONG TV

Click full-screen to enable volume control
Watch latest episodes and shows

Qlik Tech Field Day Showcase

TECHSTRONG AI PODCAST

SHARE THIS STORY