Businesses may be sitting on a gold mine of data that’s crucial for innovation. But it can be tricky to make use of it securely – whether it’s sharing sensitive information with prospective third-party software vendors or using it to train AI. Increasingly, companies are turning to a relatively novel enterprise solution to end the stand-off between data compliance and innovation: synthetic data, which is artificially created but often based on real-world datasets.
To develop synthetic data, information from almost any source is analysed to detect structures and patterns, which are then used as the foundation for creating new datasets that mimic the core characteristics of the original. Large language models excel at generating realistic synthetic data. However, proper solutioning and validation is needed to ensure statistical fidelity while preserving privacy – because realistic doesn’t always mean reliable! Strict application of Responsible AI principles is critical.
What business problems can it address and for who?
Identifying, collecting, and structuring relevant data in ways that enable it to inform business decisions is time-consuming, expensive and potentially risky. Synthetic data can give a clear impression of something – a directional suggestion on the path forward – without exposing the underlying origins and sensitive information.
There are potential applications for synthetic data in almost every business, with CIOs, CTOs, CISOs, and the research and development, data and analytics, legal and compliance, and marketing and sales departments likely already exploring their options. Industries that deal with issues of data privacy and access – notably, healthcare, pharmaceuticals and life sciences, and financial services – are likely to see the greatest benefits.
How does it create value?
Synthetic data is often a lower-cost, faster way to access vast quantities of data than traditional data collection and curation methods. This means it has the potential to turbocharge the data-driven transformation of every industry by becoming the foundation for training machine-learning models and AI. This in turn enables the development of new products, services and ways of working – finally delivering on the promise of ‘big data’ that got us all so excited a few years back.
Beyond cost savings, synthetic data fundamentally transforms innovation cycles by removing data sourcing as a bottleneck. Traditional R&D processes often stall while waiting for sufficient real-world data collection, but synthetic data enables rapid experimentation and iteration.
Synthetic data is already being used in many industries. Amazon used synthetic data about speech patterns, syntax, and semantics to improve multilingual speech recognition in its Alexa virtual assistant.1 The UK’s National Health Service (NHS) has converted real-world data on patient admissions for accidents and emergency (A&E) treatment into a statistically similar but anonymised open-source dataset to help NHS care organisations better understand and meet the needs of patients and healthcare providers.2 This kind of health data has also been leveraged by Alphabet and US insurance company Anthem to improve insurance fraud detection.3
More advanced applications are now emerging, including truly dynamic digital twins that simulate behaviours rather than just replicating static properties. These simulations enable testing in environments that would be dangerous, costly, or impossible in the real world. There is also the AI data flywheel effect. As exemplified by Deepseek's R1 model, AI generates scalable realistic synthetic data, recursively training more advanced models and exponentially accelerating capabilities.
What are the risks?
This is still relatively early-stage tech and as with any other machine-generated information, the output is only as good as the inputs and algorithms. Anomalies and outliers in the source data can be amplified or lost altogether; either option will make the end product less representative of the real data it’s meant to replace. Synthetic datasets might also accidentally retain some personally identifiable information from the source, which could violate people’s privacy and expose organisations using the data to legal action.
Generative AI has been known to ‘hallucinate’ incorrect information, when it fails to recognise anomalies in the underlying model and draws conclusions that seem statistically likely, but are not supported by the actual data. Any synthetic datasets created from those hallucinations are then affected. Some fear that because of this phenomenon, the proliferation of synthetic data could, over time, introduce feedback loops that would make AI-generated information less reliable.