New AI tech to bring human-like understanding of our 3D world

Read how AIML’s leading research is bridging the 3D / 2D domain gap.

Humans move effortlessly around our rich and detailed three-dimensional world without much second thought. But, like most mammals, our eyes actually sense the world two-dimensionally — it’s our brains that take those 2D images and interpret them into a 3D understanding of reality.

Even without the stereo visual input from our two eyes, we’re experts at looking at a flat 2D image and instantly ‘lifting’ it back to its 3D origins; we do it every time we watch TV or look at photos on our phone.

But computers and robots have a much harder time doing this ‘lifting’. It’s a problem that AI researchers are hard at work to fix.

Making computers able to understand 3D space from only 2D input is considered such an important capability—with diverse applications ranging from mobile phones to driverless vehicles—that Professor Simon Lucey, director of the Australian Institute for Machine Learning (AIML), has received a $435,000 grant from the Australian Research Council to build a geometric reasoning system that can exhibit human-like performance.

human eye up close

Contrary to popular belief, humans can't actually sense the world in 3D. Our brains interpret the stereoscopic 2D input from our eyes into a 3D understanding of the world. Photo: iStock / Mark Kuiken.

“When cameras try to sense the world, like humans do, what’s coming into the robot is still just 2D. It’s missing that component that we have in our brains that can lift it out to 3D, that’s what we’re trying to give it,” Lucey says.

If 3D understanding from normal cameras is so difficult, why not instead equip computer vision systems with proper 3D sensors like LiDAR, a sensing method that uses lasers? It’s not that easy. Building and improving hardware technology is slow and expensive, and often out of reach for the many smaller tech startups seeking to innovate AI research commercially.

“You could take ten years and billions of dollars and it would still be very, very risky to generate…but when you’re doing something in software, you can deploy it straight away, and you can continually update and make it better,” Lucey explains.

AIML researchers are among the world’s leaders in computer vision, a field of AI that enables computers to obtain meaningful information from digital images and video footage.

Building computer vision systems that can understand the real world typically requires vast troves of labeled training data using something called supervised machine learning. That means millions of images, each labeled ‘dog’, ‘strawberry’ or ‘President Obama’; or thousands of hours of driving footage where coloured boxes are drawn to mark each pedestrian, stop sign and traffic light. If you’ve ever had a website ask you to ‘click all the squares with bicycles’ to prove you’re really human, you’ve helped train a supervised machine learning model.

AI researchers are using the vast collections of labeled 2D training data, and working out how to apply it so AI systems can develop a 3D geometric understanding similar to what humans can do.

“How can I take 2D supervision that humans can easily provide,” asks Professor Lucey, “and, using some elegant math, allow it to act as 3D supervision for modern AI systems?”

One application of this kind of computer vision is something called 3D motion capture, where earlier advances brought us Gollum in The Lord of the Rings movies. It’s still a popular technique and one that’s widely used in film visual effects, video game production and even medicine and sports science. But even today it uses a number of expensive and finely calibrated cameras, and sometimes still requires people to wear special reflective dots on their body and perform in front of a greenscreen, and that’s a problem.

“People want the data they’re collecting to be realistic…they don’t want a white background. They don’t want a green screen. They would love to be out in the field, or in areas that are highly unconstrained. And the sheer cost of this limits the application of technology at the moment,” says Professor Lucey. “You can only apply it to problems where companies are willing to invest millions to build these things.”

camera lens seen up close with multiple light reflections

Traditional motion capture systems require as many as 40 to 60 finely calibrated cameras to accurately track a person's movement in 3D space. New developments in computer vision AI could reduce that to just two or three. Photo: iStock / Anake Seenadee

But in a 2021 project that saw Professor Lucey collaborate with researchers from Apple and Carnegie Mellon University[1], the team was able to demonstrate a new AI method for 3D motion capture that is sure to make the technology far more accessible and affordable.

“The work we’ve done on this paper has tried to ask the question: how few cameras could we get away with if we were willing to use AI to do this 3D lifting trick?”

The team used something called a neural prior — a mathematical way of giving an AI system an initial set of beliefs in terms of probability distribution, before any real data is provided.

As a result, the new method can perform 3D motion capture from normal video footage (no green screens or special reflective dots required) using only two or three uncalibrated camera views. It delivers similar 3D reconstruction accuracy that would otherwise require as many as 40-60 cameras using earlier methods.

Professor Lucey highlights the importance of AI research that focuses on finding efficiencies and significant cost breakthroughs as a way of bringing technology to those who’d otherwise not have been able to afford it.

“It’s democratic AI. You could be a small startup and you could use this, whereas with other methods you’d need to be very well resourced financially,” he said.

The potential applications for this are broad, and not just related to capturing humans in motion, and include everything from mobile phone filters, autonomous vehicles, wildlife conservation, improved robots and even space satellites.

[1] High Fidelity 3D Reconstructions with Limited Physical Views’ was presented at the 2021 International Conference on 3D Vision, 1 December 2021.

Tagged in computer vision