Last week’s news of Microsoft having unintentionally exposed 38 terabytes of AI training data — reportedly including employee endpoint backups, passwords and more — is a warning to enterprises rushing to embark on their own AI initiatives that, when it comes to securing their AI, they’d better also make sure they secure the data that feeds the model.
As Microsoft detailed in a post-mortem update, a Microsoft employee shared a URL for a trove of unstructured data stored in a public GitHub repository. The data was being used for open-source AI learning models. As has been the case with many cloud storage-related leaks, the URL the employee shared included a very permissive Shared Access Signature (SAS) token, a credential for an internal storage account. Independent security researchers were able to access the data and subsequently notified Microsoft.
Microsoft’s post-mortem was part of a coordinated disclosure of the incident with security firm Wiz, who discovered the exposure. “Data exposed in this storage account included backups of two former employees’ workstation profiles and internal Microsoft Teams messages of these two employees with their colleagues,” Microsoft said. The software and technology services giant added that no customer data was exposed, and no other internal services were put at risk because of this issue.
As Microsoft explained, SAS tokens provide a mechanism to restrict access and allow specific clients to connect to specified Azure Storage resources. “In this case, a researcher at Microsoft inadvertently included this SAS token in a blob store URL while contributing to open-source AI learning models and provided the URL in a public GitHub repository. There was no security issue or vulnerability within Azure Storage or the SAS token feature. Like other secrets, SAS tokens should be created and managed properly. Additionally, we are making ongoing improvements to harden the SAS token feature further and continue to evaluate the service to bolster our secure-by-default posture,” the company said.
“This was a human mistake as the SAS token used for access was highly permissive. The primary way to avoid that is to build a process with multiple permission checks. They may have that in place, in which case it is even more of a human mistake,” said Michael Farnum, advisory CISO at cybersecurity consultancy Trace3.
“There’s a level of control missing on correlating that data with its use in these kinds of environments, and this is just kind of the state of play with a lot of this stuff,” said Information Security Research Head, S&P Global Market Intelligence, Scott Crawford.
John Pescatore, director of emerging security trends at SANs Institute, agreed and added that the incident highlights the need for solid data governance around AI model training data. “This Microsoft data wrangling, where you’re pulling together the data to feed the model, you have to realize that if that’s really good data, it’s probably sensitive data. You have to understand how you’re protecting that data. The second thing is knowing whether or not you understand what’s in that data and sanitizing sensitive and personally identifiable information. It’s not a trivial task when you are talking about terabytes of data. The whole thing points to the need for strong governance over the data and these models,” he said.
While Microsoft provided cursory guidance on better managing SAS tokens, there are more steps enterprises should take to ensure the data is protected. In recent interviews, I gathered insights from experts regarding steps enterprise security teams should take to protect their AI/ML data:
Conduct a threat model: Evaluate the value of the data and potential threats and vulnerabilities within the data management process. Identify potential threats that would target the data. Mitigate vulnerabilities within the data pipeline.
Analyze the input data: “Ask yourselves where the data is coming from. Identify privacy and security concerns and sanitize it when possible,” advises Pescatore.
Effectively manage the AI/ML data pipeline: Look at tools such as Kubeflow Pipelines and manage and automate the data flow from well-known and controlled repositories.
Constantly identify and inventory ML assets: Your models, data sources, pipelines and supporting infrastructure. Make sure good access controls are in place.
“Putting in a cloud security tool looking for permission issues can also help. It makes sense to have a tool that can go behind the humans to ensure these mistakes weren’t made. Humans are going to be human. We must build processes and deploy technology to find these mistakes,” Farnum says.
“And as I see it, this is symptomatic of the pace at which innovation is happening,” adds Crawford. “The application of generative AI, particularly compared to security readiness, is similar to previous periods of rapid innovation. “We’ve been here before when innovation gets ahead of security,” he said.