Photo © Jesús Hormigo
We checked in with the team lead behind SIGGRAPH 2020 Real-Time Live!’s “Chroma Tools: AI-enhanced Storytelling for Motor Racing Sports” to learn how they prepared for their real-time presentation and the technology behind the Chroma Tools project.
SIGGRAPH: What is unique about your work on Chroma Tools in terms of real-time technology?
Jesús Hormigo (JH): Chroma Tools is the only live television context-generation system that works by autonomously tagging the objects shown on screen in a precise frame.
Tagging different moving objects on live TV is incredibly valuable because it adds context to what the viewer watches, which is key to engaging them. Live TV production is a very demanding industry of perfectionists where creators only get one shot to do things in real time.
Chroma Tools detects the objects on screen as they move, tags them, and produces the TV graphics to point these objects, all in real time and autonomously. Effectively, it produces the full cycle pipeline much below 40ms which is equivalent to one frame per second in TV.
Anything produced after one or two frames-per-second (40-80ms) would be too late to be shown and therefore unusable in live TV.
SIGGRAPH: Walk us through the technology behind your demo of Chroma Tools during SIGGRAPH 2020.
JH: When we were invited to present, we replicated a small TV studio in our office to reproduce a larger production room. We used several components, including:
- BlackMagic ATEM TV Studio HD
- Zoom H4 audio recorder and audio mixer with some lapel microphones
- 50” TV, two monitors, some ElGato capturers to send the live feed of the output directly to the show production guys in Texas
- SDI cables, converters, etc.
- Set of AI hardware composed by a Nvidia Jetson and an RTX 2080 Ti equipped computer that serves as the inferencing computer we use live on the show
We also ended up mixing our own video feed for Real-Time Live! We provided just one Zoom channel live to the production team so that we could handle the full pipeline on our own, showing all what we wanted to show, in the six minutes provided.
Chroma Tools has three main components:
- The synthetic image generation and training system: Our custom-built system that generates our annotated datasets from pure CGI. This enables us to train our custom neural network very quickly. The neural network is built on a combination of Yolo and Keras.
- The Infererencer: A computer that takes the video feed captured from a BlackMagic SDI capture card and decomposes it into individual frames, which we then analyze and run our model on — this provides the GUI with a JSON feed with all the positioning data and relevant information.
- The GUI: A graphical user interface built with Unity that hooks to the JSON feed and produces the fill and key signals that we hand over to the producer via SDI, so it can be mixed — in real-time — with the live camera signal.
SIGGRAPH: Where did the inspiration come from for Chroma Tools?
JH: It was about four years ago, as I was heading a project to recreate a motorsport event in real time, in a CGI world, that allowed users to interact with it as it was happening. We had a close collaboration with the host TV broadcaster in charge of the world feed, who we provided with unique footage and perspectives in CGI along with other bits, that allowed viewers to see replays from different points of view using our CGI recreation and then showing that footage on live TV.
An idea sparked from this collaboration to combine CGI produced in virtual-world camera perspectives with real camera positions and footage to superimpose driver names and other details pointing the drivers and giving context. In broad terms, this required the combination of high-tech camera pan-tilt-zoom trackers that, in real-time, would transmit these details to a computer that would extrapolate the information into the virtual world and generate a similar perspective to that of the real world, which would then produce an alpha image to mix with the TV signal. The cost for producing such a prototype was in the hundreds of thousands of dollars and required levels of complexity that introduced many uncertainties.
It was then (2016/2017) where significant progress started to surface in computer vision, which made me rethink the whole approach to a fully unsupervised system based on AI that would produce the same results without over-engineering it. It wasn’t any easier to develop, but we came with a prototype that our TV partners loved and encouraged us to iterate quickly to deliver them a unique product.
SIGGRAPH: What technology (old or new) has had the greatest influence on your work?
JH: YOLO v2 had the biggest influence on this product. Seeing the work and website of Joseph Redmon, where he published a video inferring in real-time a James Bond movie in full HD with Dragonforce as the BSO, or broad system of ordering, (necessary for encouragement) definitely shifted the focus of my initial product definition efforts in that direction.
Additionally, our neural network is trained only with synthetically generated imagery from the CGI world I referred to before. This was very unique at the time of production, and I got the inspiration specifically from a SIGGRAPH talk I attended in 2016. This initial idea helped us produce high-quality results without depending on humans generating bounding boxes and tagging the datasets in the shortest possible time frame.
SIGGRAPH: What is your all-time favorite SIGGRAPH memory?
JH: It was SIGGRAPH 2017 in Los Angeles, where one of the booths showed “Meet Mike”, a live rendering demo that starred fxguide’s Mike Seymour and was done with Unreal Engine, 3Lateral, and Cubic Motion. I was mind-blown to see the process they followed to “scan” Mike to the highest levels possible at that time and the power needed to replicate it in full CGI, very near to overcoming the uncanny valley.
Little did I know that AI was becoming strong enough to create deepfakes with a portion of the effort involved in creating “Meet Mike”, and that the technology being shown at SIGGRAPH would probably include AI in the future to produce its results to perfection. Overcoming the uncanny valley is something extraordinarily difficult, and it’s very likely to be consistently solved with AI before a 3D rig.
SIGGRAPH: What advice do you have for someone looking to submit a Real-Time Live! demo for a future SIGGRAPH conference?
JH: SIGGRAPH Real-Time Live! is one of the coolest places one can showcase their work. It’s where research, technology, VR, art, and other disciplines meet. As long as you are achieving something technologically challenging, mostly never done before, and you think you have something worth showing the world, don’t fear submitting your work.
It definitely helps if you’re previously familiar with SIGGRAPH. Some submitters may have attended previous editions or watched previous years’ sessions in video, or some folks (like me) have followed SIGGRAPH since it was shown in “Metropolis”, a TV show that has aired here in Spain since I was a kid.
If you think that you have done something awesome and previously impossible that needs to be shown to the world, submit it!
Press play below to watch Jesús’s 2020 demo around minute four!
SIGGRAPH 2021 is currently accepting submissions for Real-Time Live! Learn more about submitting a demo to Real-Time Live! via the SIGGRAPH 2021 website, and be sure to submit your work by 22:00 UTC/GMT on Tuesday, 6 April.
Jesús Hormigo is a computer scientist, technologist and inventor committed to the development of information communication technology (ICT) hardware and software that deliver meaningful and sustainable impact in all facets of daily life, ranging from healthcare to sports entertainment. He has led the technical development of the ICT elements of NeuroPro’s next generation of computing infrastructure for management of healthcare data in the cloud. Additionally, he is CTO and co-founder, and has led the development and deployment, of Virtually Live: new age in immersive, interactive global media. He currently works leading a team of talented engineers in creating unique AI- and computer vision-based solutions in the cloud for the inspection industry.