Synthetic data: What you need to know

Content

What is synthetic data?
Practical applications
Controversies and criticisms
Final thoughts

Published by Forsta

July 31, 2024September 4, 2024

There’s a major buzz around AI in the market research space, and synthetic data is on the tip of everyone’s tongue. Imagine having a dataset that behaves like your target market but doesn’t involve waiting for responses or asking personal questions. That’s the magic of synthetic data.

Naturally, there are concerns and criticisms, namely whether synthetic data comes close to replicating organic human responses.

We’re going to explore what’s got everyone hyped up, and what you might need to keep in mind if you’re going to explore synthetic data.

What is synthetic data?

It’s time to understand exactly what synthetic data is. This is a wide term that refers to information that is artificially created. This can cover many types of data but for research purposes there are a few specific use cases which we will explain later. For now, we can narrow the definition down by saying it’s:

Artificially generated
Mimics the real world
Can be customized

Synthetic data, by nature is manufactured or artificial data. Instead of being collected from real world events like surveys it’s created algorithmically, by AI models. Typically, in the market research world it’s used to augment participant responses from traditionally run primary research or create digital personas (more on these later).

Tabular and structured synthetic data is the new frontier in market research AI. Computer-generated information has become indispensable in this new data-driven era. It’s cost effective, can be automatically annotated and analyzed, and gets around logistical, some ethical, and privacy issues associated with sensitive or hard to reach target audiences. It’s so powerful that Gartner estimates synthetic data will overshadow real data for training AI models by 2030.

An important distinction is that synthetic data doesn’t come from nothing. It’s informed by and supported by real-world, real human data. The algorithm you’re using or developing must first learn the patterns, correlations and statistical properties of the training data. The synthetic part expands the original data set giving you options for advanced analysis. You can even go further and test new experiences or questions and see how your target audience would react.

Synthetic data models are more flexible than human users. Unlike a human researcher the models aren’t going to get overwhelmed with a mass-delivery of data. You can create bigger, smaller, fairer or richer versions of the original data in an instant, producing new perspectives and decisions backed by evidence.

This flexibility also blurs the distinction between qualitative and quantitative data. AI’s language tools can create detailed, descriptive data and measure it accurately in real time.

Practical applications

Digital personas

The real hot topic is creating personas entirely or mostly from AI. Running these personas through traditional research methods produce new results to work with.

Ever thought “ugh, really should have included that in the survey”? Well, it might just be possible. Using existing target audience data, virtual personas will react and respond like the real thing. Used for both qualitative and quantitative data, these digital personas give you an opportunity to gather insights without relying on survey responses, particularly useful for re-working collected data on sensitive topics.

One of the most powerful use-cases of this application is in testing. Trialing hypothesis and checking research designs can slash costs and produce a ready-to-go approach you know will work to get the data you’re after.

Expanding audiences

Augmenting data is one of the things researchers are most excited about. It’s been around a while, but developments and use have accelerated recently. The idea of supplementing traditional research to reduce research time and get more out of difficult to reach audiences is thrilling.

AI learns the underlying probability distribution of your sample audience. By identifying these patterns, the models can then generate additional sample members that resemble the original audience. It’s not just analysis, it’s new data points that reflect the answers your target audience would give. This possibility is particularly useful in scenarios where traditional collection is limited or expensive. For example, a hard-to-reach audience like busy parents or collecting data on a sensitive topic like healthcare.

Privacy-safe versions of datasets for sharing

Even when you have substantial human-gathered data sharing it can be a roadblock. Instead of masking or randomizing to anonymize the results, why not create meaningful copies of sensitive data that reflects all your findings? Synthetic customer datasets can be shared and collaborated on safely without fear of privacy breaches. Because generated data is made from scratch you don’t risk identifying original subjects or losing utility by removing information. All the original patterns of correlation are present, avoiding the so-called privacy-utility trade off from traditional anonymization techniques. Typically, this means the more you anonymize your data, the less useful it becomes. You can avoid this completely with synthetic data.

Controversies and criticisms

There are some synthetic data cons.

Algorithm limitations

Large Language Models (LLMs) like ChatGPT work with data to create statistical models of text, but they don’t understand the meaning of the sample or their results. The ultimate question here is, are the results, correct? Tiny human nuances aren’t picked up by AI and sometimes it just can’t handle more complex issues or context.

It should be kept in mind that synthetic data models can only repeat patterns and likely results already found in the sample data. This isn’t to say it can’t find patterns you wouldn’t have noticed yourself or couldn’t extrapolate results into similar situations, but the output is only as good as the input, the original human-first market research results.

Bias amplification

AI ethics and bias is a vast topic by itself. In short, AI is likely to show bias, and we can’t fix this. Human-sourced content will naturally contain some bias, it’s an intrinsic part of being human. And AI learns from us. If these patterns are present in the source data, it can be repeated or amplified in AI generated results. For example, the famous case of Amazon’s scrapped recruiting tool. Despite not asking for a gender split in results, the AI accidentally learnt that male candidates were preferable. Because of the training data which represented human bias.

Confirmation bias can also result from generated data. After all, the model only has the provided training data to work with so unexpected results or deeper meanings can be missed.

Trying to use AI itself to detect bias falls short because these models have no concept of ‘right’ and ‘wrong’, so no place to start and no bias-free human-made training data to build a new algorithm. Machines aren’t great with ambiguity and bias can be subtle. The AI might not understand the data being fed into it, but you should. Inspect your data for bias or collection gaps and acknowledge the relevant issues that may occur in your research.

Reliability and validity

Two vital words in research. How does AI stack up? Some critics go as far as saying that because AI models don’t understand what they’re saying ‘synthetic users’ are useless. AI models can only replicate patterns of emotion, not express true feelings when asked for more context. Ultimately, until more studies comparing human to synthetic data are published we won’t know how far we can push AI before it’s unreliable and invalid.

Understanding that there’s a chance you aren’t necessarily getting the full width of human emotions or experiences can keep you from being too reliant on generative models or avoiding human-run research.

Overreliance

The era of synthetic data has clearly arrived, but some think we’ve been too eager to embrace convenience. With fears that the alluring potential of synthetic data will wipe out traditional research, we need to stay realistic about what it can and can’t do. If we become reliant and blind to potential errors or lack of evidence for synthesized results, decisions will fall flat, and the output could be potentially damaging.

Final thoughts

News of substituting humans makes for a great headline, but the experts using this tech understand the limitations and that traditional research isn’t going anywhere any time soon. The landscape has changed, and AI can be a powerhouse when used properly. For example, Forsta Surveys can work with synthetic panels. Forsta Surveys can collect data from real respondents as well as a synthetic panel where an AI algorithm would answer the questions, presenting this data alongside the human results.

The option to augment existing data and wring every opportunity out of a sample using synthetic data is a game changer, cutting costs, time and allowing you to discover more than ever before. But the hard work is in the setup and working with the limitations. Synthetic data is an expansion, an expression of and supportive of traditional data capture, not a replacement.

Customer experience

Unpacking zero-party data: a guide to customer-driven insights

Content Zero-party data cuts through the noise, putting the power in your customers’ hands—and yours. You’ve got mountains of data, but what’s actually helping you make decisions? Imagine data that gives you exactly what you need—clear insights straight from your customers. That’s the power of intuitive technology, enabling complex information to be distilled into simple, […]

Customer experience

Breaking the spend cycle: lower customer acquisition costs through operational excellence

Content Acquiring new customers shouldn’t feel like setting a match to your budget and watching it burn. Sure, anyone can pour cash into campaigns, reel in a few clicks, and watch the numbers go up—until they don’t. When ad costs rise, audiences shift, and loyalty feels like a thing of the past, what happens? Those […]

Customer experience

Future-proof your insight strategy by mastering data ownership and activation

Discover how to future-proof your insights strategy by mastering data ownership and leveraging technology. Learn key principles to drive strategic decisions, enhance brand perception, and navigate customer sentiment. Empower your organization with a flexible, forward-thinking data strategy.

Learn more about our industry leading platform

Request a demo

Our platform

FORSTA NEWSLETTER

Get industry insights that matter,
delivered direct to your inbox

We collect this information to send you free content, offers, and product updates. Visit our recently updated privacy policy for details on how we protect and manage your submitted data.