Motion diffusion models and Reinforcement Learning (RL) based control for physics-based simulations have complementary strengths for human motion generation. The former is capable of generating a wide variety of motions, adhering to intuitive control such as text, while the latter offers physically plausible motion and direct interaction with the environment. In this work, we present a method that combines their respective strengths. CLoSD is a text-driven RL physics-based controller, guided by diffusion generation for various tasks. Our key insight is that motion diffusion can serve as an on-the-fly universal planner for a robust RL controller. To this end, CLoSD maintains a closed-loop interaction between two modules — a Diffusion Planner (DiP), and a tracking controller. DiP is a fast-responding autoregressive diffusion model, controlled by textual prompts and target locations, and the controller is a simple and robust motion imitator that continuously receives motion plans from DiP and provides feedback from the environment. CLoSD is capable of seamlessly performing a sequence of different tasks, including navigation to a goal location, striking an object with a hand or foot as specified in a text prompt, sitting down, and getting up.
CLoSD, generates text-controlled diverse and versatile motion for a SMPL-compatible robot which is interchangeable with the SMPL mesh.
Specify it with the text prompt! By simply changing it, we can reach the target location with any desired type of locomotion. Target striking can also be controlled through the text prompt and corresponding targets which can be specified for both hands, a foot, or a wrist as seen in the examples below. And since a single instance of CLoSD can perform all the above, these tasks can be natively combined into a sequence.
Our key insight is that motion diffusion can generate plans in real time for a tracking controller, flexibly guiding it to various tasks. However, diffusion models are slow and unaware of the physical world. At the same time, tracking controllers are not trained for generation-related inaccuracies to overcome these challenges.
We designed CLoSD which maintains closed-loop interaction between a real-time diffusion model and a robust tracking controller. Our real-time diffusion planner, DiP, rapidly generates short-term sequences according to the inputs and previous motion autoregressive. The tracking controller is an RL policy that consumes the DiP motion plan, frame by frame, and creates the actions to track it. After following the plan, we close the loop by feeding the frames performed in practice back into DiP as the new prefix. Depicted here is a sitting-down motion performed by CLoSD. Therefore, getting up from the sofa, for example, is as easy as changing the prompt and target.
Here, the blue markers depict the diffusion plan, and the orange character is the executed motion by the tracking controller. Note how the controller corrects micro artifacts of the diffusion planner in real time.
We compare our work with UniHSI [Xiao 2024], which is a state-of-the-art multi-task controller. UniHSI models their controller by reaching a sequence of contact points. Without a semantic description of the task, UniHSI touches the target instead of striking it and bending the pelvis instead of getting up. The semantic understanding of our diffusion model enables performing these tasks correctly.
We compare CLoSD to two baselines: an open loop without re-planning or feedback, and a closed loop without fine-tuning the components to communicate with each other. As can be seen, only the full CLoSD system with semantic planning and object interaction reasoning produces natural human motion for sitting down and getting up.