Synthetic data: The new fuel for AI?

  • Synthetic data mimics real-world data to enable secure innovation
  • It can be a lower-cost, faster way to access vast quantities of data than traditional methods
  • It is still relatively early-stage tech requiring robust human due diligence

Businesses may be sitting on a gold mine of data that’s crucial for innovation. But it can be tricky to make use of it securely – whether it’s sharing sensitive information with prospective third-party software vendors or using it to train AI. Increasingly, companies are turning to a relatively novel enterprise solution to end the stand-off between data compliance and innovation: synthetic data, which is artificially created but often based on real-world datasets.

To develop synthetic data, information from almost any source is analysed to detect structures and patterns, which are then used as the foundation for creating new datasets that mimic the core characteristics of the original. Large language models excel at generating realistic synthetic data. However, proper solutioning and validation is needed to ensure statistical fidelity while preserving privacy – because realistic doesn’t always mean reliable! Strict application of Responsible AI principles is critical. 

What business problems can it address and for who?

Identifying, collecting, and structuring relevant data in ways that enable it to inform business decisions is time-consuming, expensive and potentially risky. Synthetic data can give a clear impression of something – a directional suggestion on the path forward – without exposing the underlying origins and sensitive information.

There are potential applications for synthetic data in almost every business, with CIOs, CTOs, CISOs, and the research and development, data and analytics, legal and compliance, and marketing and sales departments likely already exploring their options. Industries that deal with issues of data privacy and access – notably, healthcare, pharmaceuticals and life sciences, and financial services – are likely to see the greatest benefits.

How does it create value?

Synthetic data is often a lower-cost, faster way to access vast quantities of data than traditional data collection and curation methods. This means it has the potential to turbocharge the data-driven transformation of every industry by becoming the foundation for training machine-learning models and AI. This in turn enables the development of new products, services and ways of working – finally delivering on the promise of ‘big data’ that got us all so excited a few years back.

Beyond cost savings, synthetic data fundamentally transforms innovation cycles by removing data sourcing as a bottleneck. Traditional R&D processes often stall while waiting for sufficient real-world data collection, but synthetic data enables rapid experimentation and iteration. 

Synthetic data is already being used in many industries. Amazon used synthetic data about speech patterns, syntax, and semantics to improve multilingual speech recognition in its Alexa virtual assistant.1 The UK’s National Health Service (NHS) has converted real-world data on patient admissions for accidents and emergency (A&E) treatment into a statistically similar but anonymised open-source dataset to help NHS care organisations better understand and meet the needs of patients and healthcare providers.2 This kind of health data has also been leveraged by Alphabet and US insurance company Anthem to improve insurance fraud detection.3

More advanced applications are now emerging, including truly dynamic digital twins that simulate behaviours rather than just replicating static properties. These simulations enable testing in environments that would be dangerous, costly, or impossible in the real world. There is also the AI data flywheel effect. As exemplified by Deepseek's R1 model, AI generates scalable realistic synthetic data, recursively training more advanced models and exponentially accelerating capabilities.

What are the risks?

This is still relatively early-stage tech and as with any other machine-generated information, the output is only as good as the inputs and algorithms. Anomalies and outliers in the source data can be amplified or lost altogether; either option will make the end product less representative of the real data it’s meant to replace. Synthetic datasets might also accidentally retain some personally identifiable information from the source, which could violate people’s privacy and expose organisations using the data to legal action.

Generative AI has been known to ‘hallucinate’ incorrect information, when it fails to recognise anomalies in the underlying model and draws conclusions that seem statistically likely, but are not supported by the actual data. Any synthetic datasets created from those hallucinations are then affected. Some fear that because of this phenomenon, the proliferation of synthetic data could, over time, introduce feedback loops that would make AI-generated information less reliable.

Cultural and societal implications

As synthetic data creation accelerates, we may soon enter a world where artificially generated content significantly outpaces human-created information. This shift raises thought-provoking questions about culture and discourse when AI increasingly shapes our information landscape.

The industrialisation of food production created unprecedented abundance but also unintended consequences, eventually necessitating more mindful consumption. Similarly, we may face an impending "data abundance" challenge where the ability to distinguish signal from noise becomes paramount. On social platforms today, AI-generated content already achieves remarkable engagement despite having no human creator.

Perhaps our greatest future challenge won't be generating more data, but rather developing sophisticated "data diets" (data qualification, filtering and remediation… rather than quantity) that help us identify what's genuinely valuable in an ocean of synthetic information.

Ensuring the value of synthetic data will require robust human due diligence. Following the guidance of PwC’s ‘Responsible AI’ toolkit can help.

This is an abridged version of an article that originally appeared in PwC’s TechEffect. To learn more about synthetic data and how it may help you, contact Matt Benwell and Murad Khan.


Contact the authors

Matt Benwell

Partner, Digital Strategy and Transformation, PwC Australia

Contact form

Murad Khan

Partner, Advisory, PwC Australia

Contact form


References

  1. https://www.amazon.science/blog/tools-for-generating-synthetic-data-helped-bootstrap-alexas-new-language-releases
  2. https://digital.nhs.uk/services/artificial-data
  3. https://www.wsj.com/articles/anthem-looks-to-fuel-ai-efforts-with-petabytes-of-synthetic-data-1165278160