AI Avatars, Virtual Assistants, and Deepfakes: A Real-Time Look

Pinscreen © 2020

Did you hear that Will Smith made an appearance at SIGGRAPH 2020? Well, actually, it was just a real-time, deep learning-based facial synthesis technology for photoreal AI avatars posing as Will Smith. But, potato/po-tah-to, right?

We caught up with the team lead behind this technology, Hao Li, who presented “AI-synthesized Avatars: From Real-time Deepfakes to Photoreal AI Virtual Assistant” during SIGGRAPH 2020’s livestreamed Real-Time Live! event. Here, Li shares the process of developing AI avatars and virtual assistants, emerging trends in the deepfakes space, and his advice for those planning to submit to a future SIGGRAPH conference.

SIGGRAPH: Real-Time Live! at SIGGRAPH 2020 looked a bit different than in years past. Tell us about your experience presenting during the livestream.

Hao Li (HL): In previous years, one of the most exciting things about Real-Time Live! was presenting a live demo in front of thousands of people, showcasing your work on multiple big screens, and, of course, having the over-the-shoulder cameraman bringing the real-time experience as close to the audience as possible. There always was an element of suspense, a cheering crowd, and the uncertainty that something could go wrong. In 2020, when the event was held virtual for the first time, we had to rethink how to perform our demos and make it as engaging as possible. One big difference is that we couldn’t see or feel the audience, and our presentation needed an element of happening while live as compared to a pre-recorded video.

Our demo was ideal for this virtual format, in the sense that we presented a cutting-edge neural rendering technology, paGAN, applied to two video-streaming related applications:

  • A real-time deepfake technology for swapping faces in video chats; and,
  • A photoreal AI chatbot based on livestreaming.

In addition to showcasing our neural rendering technology, we were demonstrating for the first time an end-to-end NLP system for freestyle conversation with a human-like virtual assistant that was fully unscripted, using an NLP based on a deep generative language model. I had no idea what the avatar would say prior to the demonstration, which meant that I had to fully improvise the conversation. From a presentation standpoint, we focused on using multiple cameras pointing to our screens and ourselves rather than pure screen sharing in order to enhance the live experience during the show.

SIGGRAPH: Not only does this technology allow a user to transform into the face of a public figure, but it also enables the creation of a virtual companion. Walk us through the process of developing both novel applications of “AI-synthesized Avatars: From Real-time Deepfakes to Photoreal AI Virtual Assistant.”

HL: The key technology of our Real-Time Live! demo is our neural face-rendering technology called paGAN (photoreal avatar GAN). We first published this technology as a Technical Paper at SIGGRAPH Asia 2018, where we found a way to generate arbitrary photorealistic facial expressions of a person in 3D given a single input photograph using a generative adversarial network (GAN). This technology was developed to digitize and animate photorealistic 3D avatars from a single photo. The idea is to train a deep neural network that can be conditioned on a rigged CG face model, which allows us to directly generate a lifelike face without the need for a traditional CG rendering pipeline. This process is called neural face rendering.

A year later, we further extended our paGAN technology to generate personalized expressions with training data obtained from a few minutes of video recordings instead of a single input photo. This approach is similar to “deepfakes” for face swapping, only that our system is real-time. Also, we can properly integrate it into state-of-the-art game engines such as Unreal and Unity, as well as disentangle 3D information, textures, and lighting. This allows us to manipulate 3D faces that are generated by neural networks in the same way as traditional CG face assets. Although our ultimate goal is to build photorealistic AI virtual assistants, we could easily modify our technology to demonstrate live face-swapping capabilities. This led to the implementation of our real-time deepfake demo.

Now, back to the photoreal AI virtual companion demo. I actually digitized my wife, Frances. We introduced a complete end-to-end system with full conversational AI capabilities. Not only can the virtual assistant understand our words, speak, and generate lifelike lip animations and gestures, it also has the ability to respond naturally without a pre-scripted conversation. In particular, we developed a deep generative language model that has the ability to generate sentences that make sense based on a history of conversation with the avatar.

SIGGRAPH: How many people were involved? How long did it take?

HL: We had 12 people working on this project, so a large part of our company was involved in its development. From its initial prototype (based on single-input photos) to our latest one (video input), as well as its integration into game engines, we spent about two years in development. The effort includes heavy R&D in GAN design, neural network and low-level GPU optimization, and extensive data collection.

SIGGRAPH: What inspired this project?

HL: Our goal at Pinscreen is to democratize the digitization of lifelike humans and to build fully autonomous virtual assistants as the next generation human-machine interface. We want to go beyond traditional chatbots and voice assistants and enable natural, face-to-face interactions between a person and an AI agent, pretty much like the virtual companion Joi in “Bladerunner 2049” or some episodes of “Black Mirror.” While other companies are advancing immersive AR/VR headsets and developing hardware systems for holographic displays, we are building the software solution for digital humans that will become essential for these systems. As we see a future where avatars can be fully personalized, one of the key features is our ability to make the digitization of highly compelling virtual avatars highly accessible and automated.

SIGGRAPH: In a recent SIGGRAPH Spotlight Podcast episode, we discussed the history and current trends of deepfakes. What trends do you notice are emerging in the field?

HL: As our neural rendering technologies for generating 3D avatars and for synthesizing realistic facial expressions are closely related to deepfakes and AI-media synthesis in general, we have an obligation for the responsible implementations of such technology from an ethical and privacy standpoint.

These techniques can be used to manipulate humans in media content to a point where it becomes indistinguishable from reality. As a result, there are risks of potential misuse for purposes such as harassment or the targeted spread of disinformation. In addition to our efforts with DARPA in developing new technologies for analyzing next-generation media manipulations, Pinscreen is actively engaged in raising awareness of these dangers and educating the public about new emerging capabilities of AI-synthesized media. In particular, we anticipate to see higher-resolution, real-time deepfakes; new video manipulation capabilities, such as changing appearances; and, content generation methods beyond the domains of faces and humans. For instance, the groundbreaking work, DALL·E, from OpenAI demonstrates how arbitrary images can be generated from textual descriptions using powerful transformer models such as GPT-3. I believe that, in the coming years, we will be able to see highly convincing video content that is generated or manipulated using intuitive high-level descriptions. Imagine if you had a YouTube that could generate the content you want instead of retrieving an existing one.

SIGGRAPH: How do you envision your project being used in the future? What’s next for it?

HL: PaGAN was the core technology showcased during our Real-Time Live! demo, for which we recently won an Epic MegaGrant. It has the ability revolutionize CG-rendered humans in real-time environments and virtual production. Virtual human faces are one of the most difficult things to create in CG, and even in high-end VFX the computations for rendering are still excessive. Our neural rendering approach has two important advantages over traditional asset creation and rendering pipelines. Instead of crafting complex 3D models, rigs and models, and without relying on expensive 3D scanning equipment, all we need is a few minutes of video footage to train a deep model that can synthesize highly accurate photorealistic 3D faces. Our integration with game engines such as Unreal and Unity further allows us to render and animate faces in real-time within arbitrary virtual environments.

We have developed paGAN to make autonomous virtual agents, game characters, and fashion influencers indistinguishable from real people. At Pinscreen, we are fully focused on this moonshot project and working with major e-commerce platforms, telecommunication companies, game studios, and fashion brands to commercialize this technology. As the ongoing pandemic is accelerating the virtualization of businesses and the need for immersive communication, we believe that AI-synthesized humans will become a core element of our society within the next decade. We may interact with them more than real humans. I think that virtual humans will not only serve as more natural human-machine interfaces, but every person may someday have their own personal set of virtual teachers, doctors, and even companions.

SIGGRAPH: What advice would you share with others considering submitting their research to a future SIGGRAPH conference?

HL: Whether it’s Technical Papers research, a Real-Time Live! demo, or an Emerging Technologies presentation, SIGGRAPH and SIGGRAPH Asia remain the most prestigious conferences to publish and showcase the latest technological advancements in computer graphics. For folks who have never submitted their work to SIGGRAPH, check out past submissions to get an idea of the expected quality, the level of innovation, and the technical depth. Don’t get discouraged if it looks overwhelming or impossible — the goal is to make the impossible possible. As long as you have drive and passion, your hard work will pay off and you can get your work shown to the world. Reach out to folks who have published at SIGGRAPH before, be inspired, and build a network. SIGGRAPH is the best place to showcase research, art, and technology for anything related to visual computing, interactive techniques, and creativity!

Ready to showcase your creativity in research, art, technology, and more? Many SIGGRAPH 2021 programs are now accepting submissions. Submit your latest work now.

Hao Li is CEO and co-founder of Pinscreen, a startup that builds cutting-edge, AI-driven, virtual avatar technologies. He also is a Distinguished Fellow of the Computer Vision Group at UC Berkeley. Before that, he was an associate professor of computer science at the University of Southern California, as well as the director of the Vision and Graphics Lab at the USC Institute for Creative Technologies. Hao’s work in computer graphics and computer vision focuses on digitizing humans and capturing their performances for immersive communication, telepresence in virtual worlds, and entertainment. His research involves the development of novel deep learning, data-driven, and geometry-processing algorithms. He is known for his seminal work in avatar creation, facial animation, hair digitization, dynamic shape processing, and his recent efforts in preventing the spread of malicious deepfakes. He was previously a visiting professor at Weta Digital, a research lead at Industrial Light & Magic / Lucasfilm, and a postdoctoral fellow at Columbia and Princeton Universities. He was named top 35 innovator under 35 by MIT Technology Review in 2013 and was awarded the Google Faculty Award, the Okawa Foundation Research Grant, and the Andrew and Erna Viterbi Early Career Chair. He won the Office of Naval Research (ONR) Young Investigator Award in 2018 and was named to the DARPA ISAT Study Group in 2019. In 2020, he won the ACM SIGGRAPH Real-Time Live! “Best in Show” award. Hao obtained his Ph.D. at ETH Zurich and his M.Sc. at the University of Karlsruhe (TH).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.