AI voice clones are easier to understand than real humans, even in noisy environments

Artificial intelligence voice clones tend to be easier to understand in noisy environments than the actual human voice they imitate. This discovery provides evidence that synthetic speech technology has the potential to significantly improve communication assistance devices for people with language disorders. This study Journal of the Acoustical Society of America.

Synthetic voices are increasingly part of everyday life, from digital assistants like Siri and Alexa to automated telemarketers and answering machines. With the expansion of generative artificial intelligence, voice cloning has emerged as a new type of synthetic voice. Traditional synthetic voice requires voice actors to spend hours in a recording booth. In contrast, artificial intelligence can generate highly realistic voice clones based on just 10 seconds of recorded audio. This minimal requirement greatly expands the number of potential voices and applications.

People often worry about the social risks of this technology, such as deepfake voices being used for fraud and misinformation. However, the potential benefits of personalized speech synthesis for medical and communication purposes have received less attention. People facing degenerative diseases like Parkinson’s disease or recovering from throat cancer often rely on computers to speak for them. Having a personalized artificial voice helps maintain personal identity. These assistive devices are most useful when the speech produced can be easily understood by people around the user.

Patti Adanku, a researcher at University College London, and Han Wang, a researcher at the University of Roehampton, specialize in studying human perception of unclear speech. They were fascinated by the idea of machine-reproduced voices and wanted to see how easily these clones could be understood by ordinary people. The ease with which natural voices are understood varies greatly depending on things like speaking speed, slight hoarseness, and the strength of regional accents. The researchers thought that voice clones would be less representative of real human voices and people would have a hard time understanding them.

“At first, I thought voice cloning would be unfamiliar and I wouldn’t understand it,” Adanku said. “We found that the intelligibility of the voice clones was improved by up to 20%, which was quite shocking. A small part of our paper talks about that experiment, but the bulk of what followed is me and my collaborators desperately trying to figure out what makes the voice clones more intelligible.”

To test the initial intelligibility of these voices, the scientists set up an online experiment with 80 participants. The sample included 40 men and 40 women between the ages of 18 and 35. All participants were native speakers of British English living in the UK and wore wired headphones to ensure optimal sound quality during testing.

The scientists started with an existing database of 10 human voices from different parts of the UK. They extracted approximately 348 seconds of audio for each person. They fed these short audio clips into ElevenLab, a popular artificial intelligence speech generation program. This process created 10 perfect synthetic voice clones that matched the original human speaker.

The authors then generated 80 different test sentences designed to assess hearing and comprehension. Half of the sentences were spoken by the original humans, and the other half were generated by artificial intelligence. The researchers mixed all the audio clips with a background sound called speech-shaped noise. This type of noise resembles continuous static and effectively masks the sound of the human voice.

Background static electricity was presented at four different volume levels. These levels range from louder than the voice to softer than the voice. Participants listened to sentences and typed exactly the words they heard. The researchers scored the typed responses to measure how well listeners understood spoken words in the presence of static sounds.

Cloned voices provided significant benefits to listeners. Participants correctly identified words from the artificial speech 67.5% of the time. When listening to a real human voice, that accuracy dropped to just 54.1 percent. This 13.4 percentage point advantage for cloned speech remained consistent across all four background noise levels.

According to a press release accompanying the study, the pair repeated the experiment with different groups to see if the benefits persisted. They tested older volunteers to see if hearing loss made a difference. They conducted a test on American volunteers to determine if their British accent was playing a role. They also used a filter designed to mimic a cochlear implant. In each case, the voice clone won.

Returning to the main study group, participants also completed two subjective rating tasks for speech. They rated how clearly or crisply each voice sounded on a scale of 1 to 7. They also rated the strength of each speaker’s regional accent on a similar 7-point scale. Listeners judged the synthetic clone to be much more distinct than the original human, and also rated the clone as having a slightly stronger regional accent.

The researchers also wanted to know whether people could tell the difference between real and artificial sounds. They asked participants to listen to pairs of identical sentences and pick out real people. Listeners correctly identified real humans 70.4% of the time. This suggests that while synthetic copies are very understandable, they contain slightly unnatural qualities that indicate they are computer-generated.

To find out why clones are easier to understand, scientists analyzed 47 different acoustic properties of audio files. They used computer software to measure characteristics such as pitch, speech rate, and harmonic richness. Pitch refers to the pitch of a sound. Overtones are overlapping frequencies that give the voice its unique texture and resonance.

They looked at specific vocal instability markers known as jitter and shimmer. Jitter measures the small involuntary changes in pitch that occur naturally when humans breathe or speak. Shimmer measures minute changes in loudness from moment to moment. Analysis revealed that synthetic speech lacks these natural micro-fluctuations, resulting in a smoother, more stable sound profile.

Statistical models showed that there are differences in how the brain processes the two types of sounds. In the case of human speech, understanding relied on measuring formants. Formants are concentrated bands of acoustic energy created by the physical shape of a person’s vocal tract. Listeners relied on these physical mouth-shaped cues to decipher human speech.

For cloned speech, listener comprehension depended primarily on overall pitch and smooth harmonic structure. Artificial intelligence appears to increase intelligibility by amplifying broad structural elements of sound. We prefer these smooth, stable sound waves over exactly copying the mouth movements of the original speaker. This acoustic stabilization may make it easier for the human brain to separate speech from background static.

This study has several limitations that warrant future investigation. The experiment used highly structured, pre-written sentences that did not mimic natural everyday conversation. In real life, people tend to speak more casually, which could change how accurately artificial intelligence captures their speech patterns. The authors suggest that future studies should test conversational speech rather than text reading.

In the main experiment, we tested speech only against a specific type of stationary static noise. Real-world environments contain many different types of hearing loss. Future research should assess how well these artificial copies perform when mixed with the sounds of a crowded restaurant or multiple competing speakers. Scientists can also intentionally manipulate the settings of artificial voices to see if adding or removing roughness changes listeners’ understanding.

After examining more than 100 acoustic measurements to understand the intelligibility gap, Adanku said he plans to work with text-to-speech experts to adapt the open-source cloning system for future tests.

“I’m now trying to recreate (the effect) by researching how synthesizers work and how digital signal processing is used to generate audio, to understand a little bit about this,” Adanku said.

The findings highlight a fascinating psychological phenomenon often referred to as the “uncanny valley.” Even though computer-generated audio is mathematically optimized for easy listening, listeners still noticed something slightly artificial about the audio. As technology improves, developers must strike a balance between making voices easy to hear and making them sound authentic and human. A perfectly smooth voice may be easy to understand, but it may lack the emotional warmth of a real human.

These discoveries hold great promise for medical and assistive technologies. People suffering from diseases that rob them of their ability to speak could use artificial intelligence to preserve their voices. The resulting communication device may actually make conversation easier in noisy environments than the original physical voice. Hearing aids can also incorporate this technology to process and enhance the sounds received by the wearer.

The study, “Voice clones are easier to understand in noise than human originals: The intelligibility advantage of voice clones,” was authored by Patti Adank and Han Wang.

Source link

Visited 3 times, 3 visit(s) today