The battle for supremacy in generative AI is now centered around speech and voice technology, with numerous companies striving to create models that can comprehend and replicate natural voice patterns. While innovations like ChatGPT Voice have the potential to revolutionize storytelling, Microsoft claims it has achieved the pinnacle of speech generation: human-level accuracy.
Microsoft’s researchers assert that their VALL-E 2 text-to-speech (TTS) generator is so advanced that releasing it publicly would be irresponsible and potentially dangerous. According to a research paper highlighted by LiveScience, the generator requires only a few seconds of audio to mimic a human voice convincingly.
To illustrate, Microsoft’s scientists believe that the speech produced by VALL-E 2 matches or even surpasses the quality of human voices when compared to audio samples from speech libraries like LibriSpeech and VCTK.
“VALL-E 2 represents a significant advancement in neural codec language models, achieving human parity in zero-shot text-to-speech synthesis (TTS) for the first time,” the researchers explained. “Additionally, VALL-E 2 consistently generates high-quality speech, even for sentences that are complex or contain repetitive phrases.”
Though the researchers are not releasing the model publicly, they have provided several audio samples in a blog post. These samples showcase a speaker prompt from LibriSpeech followed by a newly generated, complex sentence using both VALL-E and VALL-E 2. While the first generation model sounds somewhat stiff, VALL-E 2 excels in replicating the speaker’s resonance and articulation.
Too Dangerous to Release?
Despite recognizing the potential benefits of an AI speech generator of this caliber, such as aiding individuals with speech impairments like aphasia or amyotrophic lateral sclerosis, Microsoft is currently keeping VALL-E 2 for research purposes only.
“We have no plans to incorporate VALL-E 2 into a product or make it publicly accessible,” the scientists stated. This decision is partly due to the risks of misuse, such as voice identification spoofing or impersonating specific speakers. In their ethical considerations, the researchers acknowledged that their creation could pose risks if misused.
Microsoft is not alone in this cautious approach. OpenAI, the creators of ChatGPT, have also imposed restrictions on some of their voice technologies and developed a deep fake detector to help users identify AI-generated images. The future availability of VALL-E 2 or its successors remains uncertain. As the AI race heats up, companies and scientists will undoubtedly face increasing pressure to innovate and push boundaries.