Perhaps the world’s most exciting new technology today are deep neural networks, in particular the convolutional neural networks such as “Deep Learning.” These networks are conquering some of the most well known problems in artificial intelligence and pattern matching, and since their development just a few years ago, milestones in AI have been falling as computer systems that match or surpass human capability have been demonstrated. Playing Go is just the most recent famous example.
This is particularly true in image recognition. Over the past several years, neural network systems have gotten better than humans at problems like recognizing street signs in camera images and even beating radiologists at identifying cancers in medical scans.
These networks are having their effect on robocar development. They are allowing significant progress in the use of vision systems for robotics and driving, making those progress much faster than expected. 2 years ago, I declared that the time when vision systems would be good enough to build a safe robocar without lidar was still fairly far away. That day has not yet arrived, but it is definitely closer, and it’s much harder to say it won’t be soon. At the same time, LIDAR and other sensors are improving and dropping in price. Quanergy (to whom I am an advisor) plans to ship $250 8-line LIDARs this year, and $100 high resolution LIDARS in the next couple of years.
The deep neural networks are a primary tool of MobilEye, the Jerusalem company which makes camera systems and machine-vision ASICs for the ADAS (Advanced Driver Assistance Systems) market. This is the chip used in Tesla’s autopilot, and Tesla claims it has done a great deal of its own custom development, while MobilEye claims the important magic sauce is still mostly them. NVIDIA has made a big push into the robocar market by promoting their high end GPUs as the supercomputing tool cars will need to run these networks well. The two companies disagree, of course, on whether GPUs or ASCICs are the best tool for this – more on that later.
In comes comma.ai
In February, I rode in an experimental car that took this idea to the extreme. The small startup comma.ai, lead by iPhone hacker George Hotz, got some press by building an autopilot similar in capability to many others from car companies in a short amount of time. In January, I wrote an introduction to their approach including how they used quick hacking of the car’s network bus to simplify having the computer control the car. They did it with CNNs, and almost entirely with CNNs. Their car feeds the images from a camera into the network, and out from the network come commands to adjust the steering and speed to keep a car in its lane. As such, there is very little traditional code in the system, just the neural network and a bit of control logic.
The network is built instead by training it. They drive the car around, and the car learns from the humans driving it what to do when it sees things in the field of view. To help in this training, they also give the car a LIDAR which provides an accurate 3D scan of the environment to more absolutely detect the presence of cars and other users of the road. By letting the network know during training that “there is really something there at these coordinates,” the network can learn how to tell the same thing from just the camera images. When it is time to drive, the network does not get the LIDAR data, however it does produce outputs of where it thinks the other cars are, allowing developers to test how well it is seeing things.
This approach is both interesting and frightening. This allows the development of a credible autopilot, but at the same time, the developers have minimal information about how it works, and never can truly understand why it is making the decisions it does. If it makes an error, they will generally not know why it made the error, though they can give it more training data until it no longer makes the error. (They can also replay all other scenarios for which they have recorded data to make sure no new errors are made with the new training data.)
You need a lot of training data. To drive using vision, you have to segment the 2D images that cameras provide into independent objects to create a 3D map of what you see. You can use parallax, relative motion and two cameras to do that, as humans do, but you also need to see patterns in the shapes to classify them when you see them from every distance and angle, and in every lighting condition. That’s been one of the big challenges for vision – you must understand things in sunlight, in diffuse light, in night-time (illuminated by headlights or other lights,) when the sun is pointed directly into your camera, and when complex objects like trees and casting moving shadows on the desired objects. You must tell pavement from lane markers from shoulders from debris from other road users in all those situations.
And you must do it so well that you only make a dangerous mistake perhaps once every million miles. A million miles is perhaps 2.7 billion frames of video. Fortunately you don’t have to be that good. It’s OK to not perceive an object every so often. It’s OK to miss it in one frame if you get it the next. Indeed, you as a human are probably looking at multiple frames at once to track motion, and so you must just make sure you don’t miss something for more like a couple of seconds. Human brains, good as they are, can’t figure out everything in an arbitrary still photograph, but they do a good job at spotting everything in a second or two of video.