Wednesday, January 11, 2023

AI-driven voice simulation - do we really want to go there?

Photo by Aditya Saxena on Unsplash
The old saying that curiosity killed the cat seems to apply equally well to us. Even when we see the dangerous potential of new technology we just keep on developing it. We continued developing nuclear weapons even when we saw the devastation they caused and maybe our curiosity about artificial intelligence will lead us to new distasters. As I wrote in the last post we can't resist opening Pandora's box.

In the wake of the panic caused by ChatGPT (an excellent overview of what we know so far is in a post by Mark Brown, Ten facts about ChatGPT). I found an article in Ars technicaMicrosoft’s new AI can simulate anyone’s voice with 3 seconds of audio. Microsoft have seemingly developed an AI text-to-speech model called VALL-E that can simulate a voice based on a short recording. Presumably the more input it has the better it can simulate the voice. You can then let it read any text you wish in the voice of that person, thus enabling you to create fake statements. Even if you can certainly find beneficial uses for this, the potential for and consequences of misuse are terrifying.
Its creators speculate that VALL-E could be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript (making them say something they originally didn't), and audio content creation when combined with other generative AI models like GPT-3.
At first the fakes will be detectable but the whole point of AI is that it will improve. Combining this with tools for text, photo and video generation and the potential for governments, corporations, political parties, extremists and conspiracy theorists is enormous. Just because we can develop this technology doesn't mean that we should, to paraphrase the famous quote from Jurassic Park. Do we really want to open this box? Can't we just step back?

Microsoft try to sound reassuring in the article but I don't think we are capable of following any principles, no matter how well intentioned.
"Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models."'

So what happens when AI becomes increasingly smarter and we can no longer trust what we read, hear or see? In case you wondered, I actually wrote this myself. 

No comments:

Post a Comment