Droid future draws near with Google PaLM-E
Advanced deep learning models such as GPT-3 have paved the way for chatbot development, but physical robots have not been left behind. Recently, Google and Microsoft have delved into using similar AI models to enhance the capabilities of robots, resulting in impressive outcomes.
A new AI model called PaLM-E has been introduced by researchers at Google and the Berlin Institute of Technology. It integrates both language and vision skills to allow robots to operate independently in real-world situations, such as retrieving a chip bag from a kitchen or organizing colored blocks into designated areas of a rectangle.
PaLM-E is based on its previous large language model, PaLM. The "E" in the name refers to the model's ability to interact with physical objects and control robots. PaLM-E is also built upon Google's RT-1 model, which processes robot inputs and outputs actions, such as camera images, task instructions, and motor commands. The AI employs ViT-22B, a vision transformer model, to perform various tasks like image classification, object detection, and image captioning.
PaLM-E was appreciated by many authorities
This AI model is the most extensive Visual Language Model (VLM) to date, with 562 billion parameters. The AI boasts various abilities, including mathematical reasoning, multi-image reasoning, and chain-of-thought reasoning. The researchers explained in a report that the AI's skills are transferable across tasks through multi-task training, instead of being trained on individual tasks.
PaLM-E is an illustration of how the increased scale and advancement of large language models lead to improved capabilities, such as the ability to perform multimodal tasks with greater ease, accuracy, and autonomy.
All these features have been praised by many professors. It seems that the use of AI technologies in physical actions is even closer than we think.
According to Jeff Clune, an Associate Professor of Computer Science at the University of British Columbia, as reported by Motherboard:
“This work represents a major step forward, but on an expected path. It extends recent, exciting work out of DeepMind to the important and difficult arena of robotics (their work on ‘Frozen’ and ‘Flamingo’). More broadly, it is part of the recent tsunami of amazing AI advances that combine a simple, but powerful formula”.
Google is not alone in the VLM market
In addition to Google, Microsoft has also been exploring the application of multimodal AI and large language models in robotics. Microsoft's research involves extending the capabilities of ChatGPT to robotics and introducing a multimodal model named Kosmos-1, which can perform tasks such as image content analysis, visual puzzle-solving, visual recognition, and IQ tests.
According to Microsoft researchers' report, the integration of language models and robotic capabilities is a significant step toward creating artificial general intelligence (AGI) that possesses a level of intelligence comparable to human beings.
However, the researchers acknowledge that there are still real-world challenges to be addressed, such as navigating around obstacles in a kitchen or avoiding the risk of slipping.Advertisement