Is synthetic data the future of customer listening?

Voice of the customer is a foundational pillar of CX. Understanding and acting on customer needs is central to ensuring customers not only feel heard, but are receiving the experiences they expect to. However, VoC can be labor intensive. Survey design and analysis, focus groups, and sentiment analysis all require hours of work. Recent years have seen a shift to passive VoC, where behavior signals and analysis have been mined for insights at scale using AI. While this has brought benefits and new capbilities, it can still face limits. Now, CX is pivoting again and experimenting with synthetic data.

This article explores what synthetic data is, how it is being used in CX, and what practitioners need to know when adding it to their own customer listening toolkit.

What is synthetic data and why is it important to CX?

Driving innovation while protecting privacy, synthetic data is seen as a way for CX infinitely test how customers will respond to experience or product changes, an operational update, a new digital feature, or a new policy, before it goes live.

At AT&T, synthetic data and generative AI are used to predict storms and natural disasters to help the carrier manage it network infrastructure and user demand.

Synthetic customer profiles can be used to soft launch a new feature or product and iron out the creases before real customers experience it. Some organizations, like CVS Health, have taken this a step further to create an army of agentic AI twins that are used to road test customer sentiment on every customer-related decision the business makes.

The potential is so vast, Bill Staikos, founder and managing partner of Be Customer Led, says synthetic data is one of two major trends he sees shaping the future of CX.

He told CX Network: "Synthetic data will let companies test changes and predict customer reactions faster, cheaper, and with fewer privacy risks. We will also see things like synthetic NPS emerge as a new offering."

Research published by Qualtrics in 2025 found 95 percent of market researchers are already using or planning to use synthetic data within the next 12 months to generate new customer insights, fill data gaps, replace or augment traditional surveys, and simulate audience segments. Qualtrics' respondents said it allowed for improved speed (84 percent) and greater depth (79 percent) of insights.

While use cases exist, Alexandre Mahe, partner at EY Studio+, says CX is only just getting started with synthetic data. Most organizations currently prioritize investment in AI use cases, such as copilots and employee facing tools, rather than fully scaling capabilities across CX.

He explains: "In practice, synthetic data is typically deployed through targeted initiatives or pilot projects, often within data platforms. Typical use cases include scaling prototypes, simulating customer scenarios, or enriching datasets without exposing personal data. At this stage, it has not yet become a standard, enterprise wide capability."

The benefits of synthetic data in CX

When it comes to customer understanding, there are many benefits to using synthetic data over live, such as reducing research costs and enabling instant insights to potentially thousands of scenarios without the fatigue or capacity limits that plague real-life field work.

Synthetic data can help represent diverse customer personas, or reflect the views of underserved populations. It can also be used to help predict CLTV and churn, and because it is AI-generated, the data is clean and ready to use.

Mahe lists several advantages:

Reduced privacy and compliance risk: "Synthetic data mirrors behavioral patterns without exposing real individuals or personally identifiable information (PII)," Mahe says.
Scalability: Organizations can also generate large volumes of data on demand, "including rare or sensitive scenarios that may not occur frequently in real world datasets," Mahe explains.
Faster experimentation: The use of synthetic data allows teams to train, test, and validate models "without waiting for sufficient real world data to accumulate", Mahe says.
Enhanced insights: Finally, synthetic data enables the exploration of edge cases – such as low potential customers or non conversion scenarios – and, Mahe says, "helps uncover patterns that may be difficult to identify in customer behavior".

Unlike real customer data, synthetic data is not subject to the same privacy and security compliance obligations. This is because of how it is created and anonymized.

However, rules do still apply.

"While synthetic data reduces risk, it does not eliminate regulatory obligations," Mahe says. "Organizations must still comply with core data protection principles, including GDPR requirements around lawful data usage, data minimization and privacy by design."

A critical point, Mahe says, is re identification risk. "Synthetic data must not be traceable back to real individuals. Clear governance, strong controls and robust documentation are essential to ensure synthetic datasets cannot be used as a backdoor to personal data," he explains.

Consent is also key. The raw data that feeds synthetic models is based on real customers who allow their data to be captured and analyzed for specific use cases. Storing that data long-term and processing it for additional purposes for which consent has not been granted, can lead to significant legal issues.

The agentic twins created by CVS Health are modeled on data from real customers and patients – rather than averages – enabling more precise insights across different audiences, particularly harder to reach populations. To do this, it collected 2.9 million consented responses from more than 400,000 participants across more than 200 behavioral scenarios.

Getting it right: Accuracy and updates

Although there are many benefits to using synthetic over real data, not all practitioners are sold on the idea and much of the hesitation is focused on accuracy. Like all AI, rubbish in equals rubbish out.

Synthetic data can amplify any bias that exists in the source data; it can generate false positives; despite being used as a way to better represent customers that typically do not engage with VoC, it can sometimes lack nuance; and it may not always be adjusted to consider emerging developments, for example, rapidly changing world events, or even interest rate changes.

Ipsos says the "quality and reliability of synthetic data is entirely dependent on the real human data used to create and update it, as well as the expertise of the people behind it all".

In tests to evaluate whether GPT-4o could replicate consumer choice data accurately and reliably through Synthetic data generation, marketing strategy consultancy Forethought Outcomes recorded mixed results.

It found the tool was capable of mimicking "broad consumer patterns found in human data, which is reflected in a high correlation in the rank-ordering of preferences for attributes across synthetic and human data".

However, when it drilled down into the results, the synthetic data was exposed for struggling with the "nuanced behavioural insight essential for accurate business decision-making". It also lacked variability and so could not provide "meaningful market segments".

Bayesian AI consultancy PyMC Labs evaluated how synthetic consumers answered political and lifestyle questions in a bid to establish if LLMs could replace human survey respondents, based on today's available technology.

It found that performance varied across the LLMs tested and that even though some LLMs performed better than others, the same LLM didn't necessarily perform the best every time. In conclusion, PyMC Labs said the best approach to take was to apply an ensemble algorithm that combines responses from multiple LLMs. It said this "might perform better than any of them alone".

Will 2026 be the year of synthetic data?

While the technology exists and the benefits are clear, safeguards need to be in place to ensure synthetic data can be trusted to produce the immutable truths required to guide strategy.

The better and broader the data, the more chance an organization has of generating trustworthy insights. Validation must also be robust.

"Accuracy and reliability depend on systematic validation against real data," Mahe says. "Organizations need to ensure that key distributions, correlations and underlying business logic remain consistent."

In practice, Mahe says this involves a combination of:

Quantitative validation, such as statistical similarity testing and model performance comparison.
Qualitative validation, drawing on business expertise.

"If models trained on synthetic data perform consistently when applied to real world data, this provides confidence that the synthetic data is fit for purpose," Mahe adds.

Is synthetic data the future of customer listening?

What is synthetic data and why is it important to CX?

The benefits of synthetic data in CX

Getting it right: Accuracy and updates

Will 2026 be the year of synthetic data?

Quick links

RECOMMENDED

Upcoming Events

The Agentic Contact Center: Unifying Voice and AI for Scalable Resolution

CX Travel & Hospitality Exchange

CX Retail USA Exchange