I see it differently. As a robotics engineer I know the biggest impediment to ro...

I see it differently. As a robotics engineer I know the biggest impediment to robotics development is getting computers to understand the real world. The work on multimodal neurons, which see the word cake and know to associate it with images of cake, is a key stepping stone along the way to a fully functional embodied AI that can solve difficult real world problems. CLIP, DALL-E, and all these off shoots are representations of what we can pull from these efforts today. But long term this work will be incorporated in to bigger and more capable AI systems.

Just think: when I ask you “walk in to the workshop, grab a hammer and a box of nails, and meet me on the roof to help me secure some loose shingles” your mind is already imagining the path you will take to get there, what it will look like when you locate and grab the hammer and nails, and you’ve filled in that to get on the roof you have to meet me in the back yard to climb the ladder, which I never mentioned.

All these tiny details your mind can do effortlessly take huge efforts like CLIP to sort out how to make it work. And even CLIP is only text and images. There is a lot more to go from there.

A lot of people focus on DALL-E and the artifacts that come out along the way, but these are not the destination, just little stops showing the progress we are making on a much larger journey.