
The quest for data is never-ending for data scientists. With machine learning underpinning large language models (LLMs) and decision-making algorithms accelerating innovation across industries, the need for data to advance their capabilities has never been more prevalent.
Real data is the gold standard, but we are so often limited by constraints when it comes to obtaining and using it, from regulations to the expense of purchasing data sets, and not to forget the expense associated with manually collecting it.
But a solution to data scarcity that is becoming an increasingly crucial element in machine learning engineering is synthetic data. At its core, synthetic data is new data that has been created from real data. It’s used to automate the data collection process, train traditional machine learning models, fill gaps where real-world data is not available and to address imbalances in datasets.
Low-risk use cases, such as training self-driving cars in testing environments, where data can be generated from real-world scenarios, are growing. This can also apply to robotics, especially for use cases where the range of tasks and level of automation required is minimal.
The Benefits and Use Cases of Synthetic Data
Synthetic data enables us to plug data gaps by generating net new data that is an accurate reflection of real-world data. Data sets used to train models will always be weighted towards real-world data but sometimes we just simply don’t have enough data to create a viable model, which is where synthetic data helps us reach the critical mass needed to create a minimum viable product.
Take for example fraud data. Fraud detection is an area that is seeing increasing research and investment in machine learning algorithms. When it comes to the amount of transactions that are fraudulent compared to the number of routine transactions, there is a great disparity between the two. But if even 1% of global transactions are fraudulent, the potential cost can be astronomical. Financial-based crime costs the global economy $1.6 trillion annually, which is a key reason for the increasing investment in AI-based fraud detection solutions. Synthetic data that mimics real fraud data can therefore be used to train fraud-detection models.
Often, the main causes of data scarcity are limitations around privacy or permissions concerns. For example, a retailer that wants to build a recommendation engine for customers may be restricted by the number of customers that have opted in to more advanced data sharing. In this context, synthetic data can help fill the gap, using the data and general demographics of the customers who have opted in. While a small margin of error could potentially be accepted when it comes to making clothing recommendations, a financial services use case, such as a loan approval model, would require a much more careful application of synthetic data.
Another benefit of synthetic data is that, in some contexts, it is more reliable than real-world data. Take the example of an LLM that is trained on open-source data and is focused on NLP tasks. When we consider the amount of bias and misinformation that exists online, synthetic data can be used to mitigate these problems in the training data.
Challenges of Synthetic Data
Of course, synthetic data is always going to be of lower quality than real data, but for use cases where data is lacking, it has enabled us to develop models that would otherwise have no chance of existing until the requisite real-world data existed.
A more serious concern that is often brought up in relation to synthetic data is the problem of model collapse, which is when models degrade due to the quality and reliability of the synthetic data they have been trained on. This is not a new phenomenon as model collapse has existed for a long time. Generative adversarial networks (GANs), for example, often suffer from the problem of model collapse, essentially reaching a limit where the same or very similar outputs are returned over and over again, leading to the model effectively topping out.
More and more data is being created every second. Articles and blogs are written, books are published and conversations are happening in real time across the web. The prevalence and availability of real data is unlikely to be challenged by synthetic data, so model collapse isn’t something we need to worry too much about. The only potential issue that may arise is if we arrive at a situation where generative models are self-publishing material online. Data that is scraped or collected from these sources for training would then increase the likelihood of poor outputs and model collapse.
The Future of Machine Learning and Synthetic Data
There will never be a time in which more data will not be preferable, so I expect that synthetic data and machine learning will go hand in hand for the foreseeable future. The largest LLM providers today are open about their usage of synthetic data to train models, but the weighting of real-world data versus synthetic data will always depend on the use case. For critical use cases in healthcare, for example, its use will be limited, but for smaller, local use cases, such as small language models (SLMs) that can be run locally on a personal device, synthetic data can be used more extensively.
As compute power increases, I can see developers, data scientists and machine learning engineers increasingly creating synthetic data and training smaller models with it. This movement will help to supercharge innovation at every level, enabling the creation of solutions that might otherwise have been out of reach.