SAN FRANCISCO, CALIFORNIA — Researchers at University of Washington (UW) have developed new algorithms for turning audio clips into a realistic, lip-synced video of the person speaking those words.
In a visual form of lip-syncing, considered a thorny challenge in the field of computer vision, the system converts audio files of an individual’s speech into mouth shapes, which are then grafted onto and blended with the head of that person from another existing video.
As detailed in a paper to be presented at SIGGRAPH 2017, a conference and exhibition in computer graphics and interactive techniques scheduled for August 2 in Los Angeles, the research team generated video of former U.S. president Barack Obama talking about terrorism, fatherhood, job creation and other topics using audio clips of those speeches and existing weekly video addresses that were originally on a different topic.
“These type of results have never been shown before,” said Ira Kemelmacher-Shlizerman, an assistant professor at the UW’s Paul G. Allen School of Computer Science & Engineering. “Realistic audio-to-video conversion has practical applications like improving video conferencing for meetings, as well as futuristic ones such as being able to hold a conversation with a historical figure in virtual reality by creating visuals just from audio.”
The team chose Obama because the machine learning technique needs available video of the person to learn from, and there were hours of presidential videos in the public domain. “In the future video, chat tools like Skype or Messenger will enable anyone to collect videos that could be used to train computer models,” Kemelmacher-Shlizerman was quoted as saying in a news release.
The machine learning tool makes significant progress in overcoming what is known as the “uncanny valley” problem, which has dogged efforts to create realistic video from audio. “People are particularly sensitive to any areas of your mouth that don’t look realistic,” explained lead author Supasorn Suwajanakorn, a recent doctoral graduate in the Allen School. “If you don’t render teeth right or the chin moves at the wrong time, people can spot it right away and it’s going to look fake. So you have to render the mouth region perfectly to get beyond the uncanny valley.”
Rather than synthesizing the final video directly from audio, the researchers tackled the problem in two steps. The first involved training a neural network to watch videos of an individual and translate different audio sounds into basic mouth shapes. By combining previous research with a new mouth synthesis technique, they were then able to realistically superimpose and blend those mouth shapes and textures on an existing reference video of that person.
Another key insight was to allow a small time shift to enable the neural network to anticipate what the speaker is going to say next.
By reversing the process, namely feeding video into the network instead of just audio, the researchers could potentially develop algorithms that could detect whether a video is real or manufactured.