Event recap – Voice Cloning for the Media and entertainment industry
February 17, 2021We are entering an era where you can’t assume that the video or audio that you watch or listen to is real.
If you have watched a deepfake video like the one above that involves a certain personality speaking, then you have already encountered voice cloning technology. Voice cloning has become a useful technology in the commercial advertisement, media, and movie industry. It has also become a threat to democracy and security in equal measure due to malevolent use. While the current modern-day voice cloning technology is decades old, the history of voice cloning can be traced back to the 1770s. In 1779, Russian professor Christian Fratzenstein built acoustic resonators mimicking the vocal tract by means of vibrating reeds. With today’s Artificial intelligence technology, voice cloning can be used to generate synthetic speech that closely resembles a targeted human voice. With a few minutes of recorded speech, an AI-enabled voice cloning software can read any text in the target voice. This has led to the development of tools such as Google Maps that guide people to their destination by reading directions. Through this technology, tools like Google translate can now read out a translation from one language to another.
But how does voice cloning technology work? What are some of its applications, and what are some of the industrial use cases? These are some of the questions we aimed to answer during our last event on 13th February 2020 at Baraza Media Lab. In the fifth edition of Ai Kenya show and Tell event series, Ai Kenya Community hosted Grant Reaber, Head of Research at Respeecher, which is a Ukraine-based startup that provides voice cloning services.
There are two main types of speech synthesis namely; Text-to-Speech (TTS) and Speech-to-Speech (STS) (also known as Voice Conversion). TTS involves the computer reading out text in any voice which requires it to understand what is being read and how to read. For example, if we want it to read a piece of text in Morgan Freeman’s voice then we need to feed it the text and a sample of Morgan Freeman’s voice to understand how he sounds. The deep learning model is then trained on the two data sets to produce an audio reading out the text in Morgan’s voice. In theory, Voice conversion or STS is a new speech synthesis technology where an AI-enabled system accepts a speech, extracts information such as the text, style, and emotion, transcribes the text to be read by a new voice, then adds style and emotion to the new voice to produce a natural and useful voice.
In his presentation, Grant introduced the concept of voice cloning and discussed the use cases of voice cloning and conversion such as making an older movie actor sound younger. An example of an application of Respeecher’s technology is in various Hollywood projects where it is used to create voices not easily available such as those of actors who have died. One of the most recent applications saw the technology being used to recreate the voice of Vince Lombardi, a famous American football coach. Grant also touched on the challenges of voice conversion across the industry where he pointed out that lack of enough data to train the models is one of the key limiting factors to producing quality sounds. While the technology is fast improving, incorporating accents, emotion and multiple languages is still a challenge that also affects the quality of the products. However, the technology has received tremendous support through research and he is hopeful that these challenges will be overcome in the near future. Normally, a minimum of 1-hour voice recording for the source and destination voice is required to build an effective clone, Grant noted. He shared a few demos of the technology, one of which you can view in the video below.
Towards the end of his presentation, Grant highlighted some of the evil uses of this technology. It can be used in phishing scams where the victim is made to believe that they are talking to a trusted person. An example is a UK-based CEO who was tricked into transferring $240,000 over a phone call in a voice that resembled that of his boss based in German. Other evil uses include fake news, involuntary porn, and the use of actor’s likeness without their permission. From Respeecher’s point of view, he noted that part of their efforts to ensuring ethical practice is by requiring the consent of the voice owner before use. Moreover, they prohibit deceptive use of the technology in a way that violates other people’s rights. During the Question and answering session, the audience highlighted various topics such as the viability of the technology for the call center industry, open-source technology elements being used, pricing etc.
Voice cloning and conversion technology are disrupting various industries including film and education. In Kenya and Africa at large, it can be used to facilitate interactive teaching and dynamic storytelling among other uses like advertisement and movie production. It can help transform education in Africa by reducing the operational cost of professionally recorded class sessions. The technology is in its early stages and thus accessibility can be a problem for many Africans due to the relatively high cost. However, Grant is hopeful that this will be addressed with time as computational resources become more affordable. With high internet penetration in Kenya and other African countries, the same technology can be used to spread fake news which can undermine security or even democracy. Moreover, this technology can worsen the already existing problem of cyberbullying in Kenya through manipulated audio of people saying things they did not say.
We are thankful to the community members who made the event engaging and productive. We would also like to thank Baraza Media Lab, for hosting us for the event.
Follow us on our meetup page for updates on future events. Keep building!
Written by Eugene Oduma and edited by Alfred Ongere