Microsoft's 38TB Data Leak

Among exposed were secrets, code and AI training data

Blog Microsoft's 38TB Data Leak

| 5 min read

Contact us

One misconfigured shared access signature (SAS) token allowed the exposure of 38 terabytes of data on Microsoft's AI GitHub repository. The data included secrets, private keys, passwords to Microsoft services, private source code and more than 30,000 internal Microsoft Teams messages from 359 Microsoft employees. Let's see what we can learn about and from this leak.

Cause, description and threats of this massive leak

Microsoft researchers goofed up when publishing open-source AI training data on their GitHub repository named "robust-models-transfer." The repository was created with the purpose of making available open-source code and AI models for image recognition. The issue was with both the access and permission levels that were granted through the link used to share the data. The researchers created the link employing Azure's SAS tokens feature. This feature allows sharing data from Azure storage accounts. And that's just great if configured to give access to only the data that is intended to be shared. In this incident, unfortunately, the entire storage account was shared, containing both what was intended to be seen as well as what wasn't. Talk about oversharing! To add insult to injury, the link was set to share with permission to overwrite and delete files and its expiry date was set to October 6, 2051.

Among the exposed private information was a disk backup of the workstations of two employees, which included credentials and internal messages that amounted to more than 30,000. And, truly, what has been more worrying about this leak is the possibility of malicious hackers tampering with the AI training data shared by Microsoft. The repository tells readers to follow the URL, download a file and feed that AI model into a script. Since said file is formatted using pickle, a formatter prone by design to arbitrary code execution (it's possible for a file to run Python code when the file is loaded), were an attacker to inject malicious code into the AI model, they could execute commands in the machines of unsuspecting users.

Shortly after being tipped off about this leak, Microsoft invalidated the SAS token to prevent access to the storage account. Not two weeks later, they replaced the token on GitHub. Reportedly, "no customer data was exposed" (thankfully) and "no customer action is required in response to this issue."

Possible security risks when using Azure SAS tokens

Now, undoubtedly, there's the fact that more and more organizations handle massive piles of data, e.g., as they decide to implement AI, and the cloud provides the availability and scalability they need. So, the solutions that these organizations use to manage these data need to allow secure configurations. This is not to negate, though, that security is a shared responsibility: Organizations should make sure that they configure those solutions in such a way that cybercriminals' prying eyes cannot catch a glimpse of sensitive information.

That being said, let's acknowledge Azure's responsibility in an incident like this. The granularity of account SAS tokens (the kind involved in this leak) is such that, before generating them, it's possible to select the specific files to be shared, the permissions granted (among 10) and the start and expiry dates and times. So, the security features are there. However, there are a couple of nuances that should be taken into account, as they make this service not quite as perfect.

Firstly, a fact that may be problematic is that generating the account SAS token is not an Azure event, but rather something that is done on the client side: When generating the token, the client's browser is responsible for downloading the account key from Azure and signing the generated token with the key. In turn, the token is not an Azure object. This does not allow for monitoring, so a token can be issued, and an admin, were their knowledge based only on what Azure tells them, would never learn of its creation. And even if they were made aware of its issuance, where the token circulates would also remain unknown. There would be a way, though, to learn of a token as it's used to access a storage account. For this to happen, the storage account should have logging enabled, which can be costly, as prices go up according to the request volume of each account, and logging would need to be paid for per account.

Get started with Fluid Attacks' Vulnerability Management solution right now

Secondly, revoking the Account SAS token is only possible as the effect of revoking the entire account key that signed the token. Efficient management is thus impaired, as every token signed by that key would be revoked upon applying this solution.

The researchers at Wiz, who discovered the data leak, thoroughly explained the previous issues, as well as the security risks related to the service's allowing for the creation of links granting (optionally) excessive permissions and having (optionally) infinite lifetime. It's true, Azure's tool makes dangerous combinations possible; but, like we said above, the client's responsibility for a secure configuration needs to be taken into account as well. Specifically, Microsoft's data leak sends a message of caution for organizations to review their data management governance in the cloud, lest they end up having their sensitive information up for grabs. So, let's look at what the latter can do from a preventive approach to cybersecurity.

How to prevent leaks like this

The following are some recommendations for using SAS tokens securely:

  • Take a good look at the data intended to be shared and identify the ways in which they can be misused.

  • Look into leveraging service SAS tokens, instead of account ones, and establishing a server-side stored access policy. This is a combination that grants access at the resource level rather than the whole storage account level and allows managing permissions and expiry time.

  • Create SAS tokens to give access to storage accounts dedicated for external sharing.

  • Check how long data should be shared, as some portion of it might not need to be shared indefinitely.

  • Check that the permissions are just those necessary to fulfill the objective(s) of sharing the data.

  • If allowed to be paid for, enable logs that detail SAS token access, signing key and permissions assigned.

  • Scan repositories continuously with a cloud security posture management (CSPM) tool to identify SAS tokens and detect leakage and misconfigurations regarding scope and permissions.

However, if wishing to prevent the creation of SAS tokens, the recommendation has been to block the access to the operation that lists storage account access keys. The generation of user delegation SAS tokens, which rely on a user key instead of an account key, is still possible, though.

Manage your cloud security posture with Fluid Attacks

We know that many organizations need to handle progressively greater amounts of data in the cloud. If they misconfigure the security features of the cloud services they use, the sensitive data and users they are supposed to secure are at risk. We've talked elsewhere about the importance of checking with CSPM that your cloud-based systems and infrastructures comply with security requirements, prioritizing detected issues and resolving them as soon as possible. Moreover, we've argued that such activities need to be done all the time, keeping pace with development and the evolution of cyber threats (in DevSecOps fashion). That is why we offer Continuous Hacking, which performs CSPM continuously, along with other techniques, and provides the means and guidance to fulfill further vulnerability management steps. If you would like a taste of how we can help you prevent data leaks now, start your free trial.

Subscribe to our blog

Sign up for Fluid Attacks' weekly newsletter.

Recommended blog posts

You might be interested in the following related posts.

Photo by Claudio Schwarz on Unsplash

Is your financial service as secure as you think?

Photo by mitchell kavan on Unsplash

Bringing the zero trust model to life

Photo by Brian Kelly on Unsplash

We need you, but we can't give you any money

Photo by Sean Pollock on Unsplash

Data breaches that left their mark on time

Photo by Roy Muz on Unsplash

Lessons learned from black swans

Photo by Florian Schmetz on Unsplash

The best offense is a good defense

Photo by Valery Fedotov on Unsplash

A digital infrastructure issue that many still ignore

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which hundreds of organizations are already enjoying.

Start your 21-day free trial
Fluid Logo Footer

Hacking software for over 20 years

Fluid Attacks tests applications and other systems, covering all software development stages. Our team assists clients in quickly identifying and managing vulnerabilities to reduce the risk of incidents and deploy secure technology.

Copyright © 0 Fluid Attacks. We hack your software. All rights reserved.