Large Language Models-powered Human-Robotic Interactions
March 17, 2023
Due to the rapid empirical improvements in large language models (LLMs), there is a growing interests in robotics community to leverage them for more natural human-robotic interactions.
At a high level, there is a chasm between the natural language for a human user to express their instructions/goals/questions, such as English, and the natural language for a robot to execute, such as torque commands. This mismatch in native languages make a human-robot conversation difficult. The traditional approaches primarily focus on solving the subproblems, and often in specific environments, such as grounding/reference resolution and efficient motion planning. One frequent shortcoming in this line of approach is the very limited language that a robot can understand (think the small palette of inflexible commands that virtual assistants like Siri can resolve). In fact, the rigidly constructed commands that can be parsed often feel robotic. In contrast, LLMs trained on enormous corpus, invite us to solve this challenge in a more holistic way.
Large Language Models
LLMs such as OpenAI GPT-3.5 are quite capable of code generation given just a textual description, such as programming problems. Instead of relying on a clever intermediate structured representation, LLMs are trained end-to-end to produce the final answer (by continuing the supplied prompts). Having been trained on the massive corpus covering many topics on the Internet, they seem to have implicitly captured an unprecedented amount of common sense that we share. This capability suggests a way forward beyond the tedious engineering of effectively domain-specific languages for each robot and each environment.
For example, we can get Anthropic Claude v1.2 to brew and serve coffee from a suggestive prompt of incomplete Python snippet (zero-shot!).
from env import turn_on_coffee_machine, turn_off_coffee_machine, locate_cupfrom cooking import resolve_ingredientsfrom two_arm_robot import grasp_object_left, grasp_object_right, place_object_left, place_object_right, plan_motion# A capable household robot.def brew_and_serve_coffee(coffee_type, kitchen_layout, serving_location):
The LLM completes it into:
def brew_and_serve_coffee(coffee_type, kitchen_layout, serving_location):turn_on_coffee_machine() # Turn on the coffee machine.ingredients = resolve_ingredients(coffee_type) # Figure out the ingredients we need.cup_location = locate_cup(kitchen_layout) # Find a cup.grasp_object_left(cup_location) # Grasp the cup with the left arm.for ingredient in ingredients:ingredient_location = locate_object(ingredient, kitchen_layout) # Find the ingredient.grasp_object_right(ingredient_location) # Grasp the ingredient container with the right arm.place_object_left(ingredient_location) # Pour the ingredient into the cup.grasp_object_right(ingredient_location) # Put the ingredient container back.place_object_left(serving_location) # Place the full cup at the serving location.turn_off_coffee_machine() # Turn off the coffee machine.
Without definitions of the unlying functions, it is not clear that this will work but it clearly demonstrates that the LLM knows about coffee making in a kitchen at some level of details and could express it in Python (in terms of controlling a two-armed robot).
Approach in Code-As-Policies
An interesting recent work attempts to bridge this gap with LLM is code-as-policies. This work has some attractive features: it uses a trained LLM without any extra training and LLM generates executable Python code for the robot. But an LLM such as GPT-3.5 cannot perceive an image directly. How can we specialize its generation for a specific situation?
Main ideas:
- Via prompt engineering, we can reduce the problem into format akin to programming challenges, a domain where LLM has showed some success.
- We implement (motion) primitives for the environment and related basic functions that serve the motion primitives. They are responsible for the low-level robot control. For example, in the demo below, the only motion primitive is picking from one position and then placing at another position on a tabletop. We also need to implement functions that returns an object (identified by its ID)'s position to support this primitive. These APIs effectively translate information from different sensing modalities into code which LLM can read. And as a whole, they implicitly define an ontology for the environment.
- With a few examples involving each primitives (few-shot prompting), we show LLM how this specific robot and environment works.
- LLM does the heavy lifting of grounding (resolving references) and planning which turns instructions into primitives. Interestingly, the solutions can contain unimplemented functions with function names suggestive of their utility. We leverage the "common sense" encoded in LLM to bypass some of the hardest subproblems in robotics, though not without silly mistakes at times.
- Whenever we encounter an undefined function in the generated code, we prompt the LLM to generate its implementation.
- Question & answering, conversational capabilities naturally arise via a few appropriate examples and passing past messages in prompts, respectively. Again, no specialized algorithms are needed. A trained LLM already comes with some ability to reply to questions (and generate code that finds the answer) and use contexts.
Some Observations
- GPT-3.5 seems quite capable in reducing natural language description related to positions into arithmetics of the coordinates.
- Despite not being a code generation-centric model,
text-davinci-003
tends to output better solutions thancode-davinci-002
. - Recursive function generation is a nice idea but it is quite error-prone in practice, e.g., mismatched semantics between a function's usage and its implementation.
- The solution is very much top-down functionally speaking. LLM has to come up a solution that covers all kinds of possible situations. This seems more complex than necessary: we only need a solution that solves the problem in the current situation.
- LLM can generate code involving objects not in the scene and that gives error. Generated code can contain syntax errors and it is messy to execute strings as Python code.
- Prompts can grow pretty long quickly due to the many demonstrative examples and the growing conversational history. This worsens latency and creates a upper limit on the conversational length. How can we maintain a more concise context?
- Nevertheless I am impressed by how far this solution gets us (in this tabletop environment demo). I also cannot help but think that some of the observed mistakes can be remedied through prompt engineering, e.g., providing better demonstrative examples, adding more concepts. But at some point, will prompt engineering be as tedious as the traditional decompositional solution? The former is an extensional definition whereas the latter, intensional. When is it a good idea to use the latter? Is there some way to combine them?
Demo
Building on the original demo, I improved the UI and added shortcuts for experimenting with prompting techniques. Chain-of-thought prompting (by asking LLM for a step-by-step explanation) tends to produce a better solution. Explicitly asking LLM to request clarifications could trigger more conversational behaviors when our instructions contain ambiguity.