白辰甲 Bai Chenjia

Research Scientist
Institute of Artificial Intelligence (TeleAI), China Telecom

Biography

I am a Research Scientist at Institute of Artificial Intelligence (TeleAI), China Telecom and the Director of Embodied AI research center, specialized in the cutting-edge field of Embodied AI and Reinforcement Learning (RL). Our group is dedicated to develop embodied technologies encompassing perception, planning, locomotion, manipulation, and promoting the industrial application of embodied AI. Our group thrives under the leadership of Prof. Xuelong Li, who serves as the dean of TeleAI. Previously, I was a Researcher at Shanghai AI Laboratory, affiliated with IPEC group. My research interests include diffusion/transformer policy, LLM-driven planning, world model, preference learning, RL/MPC-based locomotion, dexterous manipulation, representation learning, sim-to-real, multi-agent collaboration, as well as real-world applications for robot arm, dexterous hand, quadruped robot, and humanoid robot.

I holds a Ph.D. degree in Computer Science from Harbin Institute of Technology (HIT), advised by Prof. Peng Liu. I am fortunate to have been collaborated with many fantastic researchers. I was a joint PhD student at University of Toronto and Vector Institute, working with Prof. Animesh Garg. I also used to be an intern at Huawei Noah’s Ark Lab (advised by Prof. Jianye Hao), Tencent Robotics X (advised by Dr. Lei Han), and Alibaba. I received my Bachelor’s degree and Master’s degree in Computer Science from HIT.

中文简介：白辰甲，博士，现任中国电信人工智能研究院（TeleAI）研究科学家，具身智能团队负责人。研究方向包括具身智能、人形机器人、运动和操作大模型、推理对齐等。在包括AI Journal、TPAMI、NeurIPS等学术会议和期刊上发表论文50余篇，出版专著一部。主持国家自然科学基金、国家重点研发计划课题。入选中国科协青年托举人才、上海市启明星扬帆计划、上海市光启青年人才，获世界人工智能大会优秀论文提名奖，哈工大优秀博士论文奖，并担任多个国际顶级会议和期刊的领域主席和审稿人。

团队招收具身智能方向全职研究人员、实习生、联培博士生，具体详见链接.

Interests

Embodied AI
Reinforcement Learning
Foundation Model for Decision Making

Education

PhD in Computer Science, 2017-2022
Harbin Institute of Technology
Joint PhD Program, 2021-2022
University of Toronto

Book

强化学习：前沿算法与应用

Publications

"✉" denotes corresponding author

Quickly discover relevant content by filtering publications.

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

under review

We propose a physics-based humanoid control framework, aiming to master highly-dynamic human behaviors such as Kungfu and dancing through multi-steps motion processing and adaptive motion tracking.

Weiji Xie(+) , Jinrui Han(+) , Jiakun Zheng(+) , Huanyu Li , Xinzhe Liu , Jiyuan Shi , Weinan Zhang , Chenjia Bai^✉ , Xuelong Li^✉

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

MoRE: Mixture of Residual Experts for Humanoid Lifelike Gaits Learning on Complex Terrains

under review

We propose a novel framework that enables humanoid robots to traverse complex terrains with controllable human-like gaits using a mixture of latent residual experts and multi-discriminators.

Dewei Wang , Xinmiao Wang , Xinzhe Liu , Jiyuan Shi , Yingnan Zhao , Chenjia Bai^✉ , Xuelong Li^✉

MoRE: Mixture of Residual Experts for Humanoid Lifelike Gaits Learning on Complex Terrains

HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning.

under review

we propose HumanoidGen, an automated task creation and demonstration collection framework that leverages atomic dexterous operations and LLM reasoning to generate relational constraints.

Zhi Jing , Siyuan Yang , Jicong Ao , Ting Xiao , Yugang Jiang , Chenjia Bai^✉

HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning.

Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction

under review

We propose a novel bimanual foundation policy that leverages text-to-video models to predict robot trajectories and uses optical flow as an intermediate variable to improve generalization.

Chenyou Fan , Fangzheng Yan , Chenjia Bai^✉ , Jiepeng Wang , Chi Zhang , Zhen Wang , Xuelong Li^✉

Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction

Learn as Individuals, Evolve as a Team: Multi-agent LLMs Adaptation in Embodied Environments

under review

We propose the Learn as Individuals, Evolve as a Team (LIET) framework to enable multi-agent LLMs to adapt to embodied environments through individual learning and team evolution

Xinran Li , Chenjia Bai^✉ , Zijian Li , Jiakun Zheng , Ting Xiao , Jun Zhang^✉

Learn as Individuals, Evolve as a Team: Multi-agent LLMs Adaptation in Embodied Environments

Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective

under review

We propose Diffusion-Inspired Multi-Agent world model (DIMA), a novel framework for multi-agent reinforcement learning that leverages diffusion models to reduce modeling complexity and improve sample efficiency.

Yang Zhang , Xinran Li , Jianing Ye , Delin Qu , Shuang Qiu , Chongjie Zhang , Xiu Li , Chenjia Bai^✉

Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective

Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning

under review

We propose Adversarial Locomotion and Motion Imitation (ALMI) for humanoid robots, which serves as a novel framework for loco-manipulation tasks, enabling adversarial policy learning between upper and lower body.

Jiyuan Shi , Xinzhe Liu , Dewei Wang , Ouyang Lu , Sören Schwertfeger , Fuchun Sun , Chenjia Bai^✉ , Xuelong Li^✉

Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning

Information-Theoretic Reward Decomposition for Generalizable RLHF.

under review

We decompose the reward value in RLHF into two independent components that consists prompt-free reward and prompt-related reward, and propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values.

Liyuan Mao , Haoran Xu , Amy Zhang , Weinan Zhang^✉ , Chenjia Bai^✉

Information-Theoretic Reward Decomposition for Generalizable RLHF.

Distributional Off-Policy Evaluation in Reinforcement Learning

In Journal of the American Statistical Association (JASA), 2025

This paper proposes an offline Wasserstein-based approach to estimate the joint distribution of multivariate discounted cumulative rewards, establishes finite sample error bounds in the batch setting, and demonstrates its superior performance through extensive numerical studies.

Zhengling Qi , Chenjia Bai , Zhaoran Wang , Lan Wang^✉

Distributional Off-Policy Evaluation in Reinforcement Learning

Humanoid Whole-Body Locomotion on Narrow Terrain via Dynamic Balance and Reinforcement Learning.

In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025

we propose a novel whole-body locomotion algorithm based on dynamic balance and Reinforcement Learning (RL) that enables humanoid robots to traverse extreme terrains, particularly narrow pathways and unexpected obstacles, using only proprioception.

Weiji Xie , Chenjia Bai^✉ , Jiyuan Shi , Junkai Yang , Yunfei Ge , Weinan Zhang^✉ , Xuelong Li

Humanoid Whole-Body Locomotion on Narrow Terrain via Dynamic Balance and Reinforcement Learning.

VLP: Vision-Language Preference Learning for Embodied Manipulation.

under review

we propose a novel Vision-Language Preference learning framework that learns a vision-language preference model to provide preference feedback for embodied manipulation tasks.

Runze Liu , Chenjia Bai^✉ , Jiafei Lyu , Shengjie Sun , Yali Du , Xiu Li^✉

VLP: Vision-Language Preference Learning for Embodied Manipulation.

Preference Aligned Diffusion Planner for Quadrupedal Locomotion Control.

In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025

We develop a learning framework combining offline diffusion planner and online preference alignment with weak preference labeling for legged locomotion control.

Xinyi Yuan , Zhiwei Shang , Zifan Wang , Chenkai Wang , Zhao Shan , Zhenchao Qi , Meixin Zhu^✉ , Chenjia Bai^✉ , Xuelong Li

Preference Aligned Diffusion Planner for Quadrupedal Locomotion Control.

大模型驱动的具身智能：发展与挑战

中国科学：信息科学

We give a comprehensive survey for embodied AI driven by large-scale models.

Chenjia Bai , Huazhe Xu , Xuelong Li^✉

Online Iterative Self-Alignment for Radiology Report Generation.

In Annual Meeting of the Association for Computational Linguistics (ACL), 2025

We propose an online iterative self-alignment method for radiology report generation that iteratively generates unlimited preference data and automatically aligns with radiologists’ multiple objectives.

Ting Xiao , Lei Shi , Yang Zhang , HaoFeng Yang , Zhe Wang , Chenjia Bai^✉

Online Iterative Self-Alignment for Radiology Report Generation.

Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration.

In Annual Meeting of the Association for Computational Linguistics (ACL), 2025

We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans.

Yang Zhang , Shixin Yang , Chenjia Bai^✉ , Fei Wu , Xiu Li , Xuelong Li , Zhen Wang

Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration.

Task-agnostic Pre-training and Task-guided Fine-tuning for Versatile Diffusion Planner.

In International Conference on Machine Learning (ICML), 2025

We develop a versatile diffusion planner that can leverage large-scale inferior data that contains task-agnostic sub-optimal trajectories, with the ability to fast adapt to specific tasks.

Chenyou Fan , Chenjia Bai^✉ , Zhao Shan , Haoran He , Yang Zhang , Zhen Wang

Task-agnostic Pre-training and Task-guided Fine-tuning for Versatile Diffusion Planner.

Online Preference Alignment for Language Models via Count-based Exploration.

In International Conference on Learning Representations (ICLR), 2025 Spotlight

We propose count-based online preference optimization for LLM alignment that leverages coin-flip counting to encourage exploration in online RLHF.

Chenjia Bai , Yang Zhang , Shuang Qiu , Qiaosheng Zhang , Kang Xu , Xuelong Li^✉

Online Preference Alignment for Language Models via Count-based Exploration.

On the Value of Myopic Behavior in Policy Reuse.

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025

We present a framework called Selective Myopic bEhavior Control~(SMEC), which results from the insight that the short-term behaviors of prior policies are sharable across tasks.

Chenjia Bai , Kang Xu , Shuang Qiu , Haoran He , Bin Zhao , Zhen Wang , Wei Li , Xuelong Li

On the Value of Myopic Behavior in Policy Reuse.

Discriminator-Guided Embodied Planning for LLM Agent.

In International Conference on Learning Representations (ICLR), 2025

We propose a novel framework that generalizes demonstrations to establish critic-regularized grounding and optimization in the long-term planning of LLMs.

Haofu Qian , Chenjia Bai^✉ , Jiatao Zhang , Fei Wu , Wei Song , Xuelong Li

Discriminator-Guided Embodied Planning for LLM Agent.

Exponential Topology-enabled Scalable Communication in Multi-agent Reinforcement Learning.

In International Conference on Learning Representations (ICLR), 2025

We introduce ExpoComm, a scalable communication protocol that leverages exponential topologies for efficient information dissemination among many agents in large-scale multi-agent reinforcement learning.

Xinran Li , Xiaolu Wang , Chenjia Bai , Jun Zhang

Exponential Topology-enabled Scalable Communication in Multi-agent Reinforcement Learning.

Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models.

In Transactions on Machine Learning Research (TMLR), 2025

we propose a novel world model for MARL that learns decentralized local dynamics for scalability, combined with a centralized representation aggregation from all agents.

Yang Zhang , Chenjia Bai^✉ , Bin Zhao , Junchi Yan , Xiu Li^✉ , Xuelong Li

Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models.

Radiology Report Generation via Multi-objective Preference Optimization.

In AAAI Conference on Artificial Intelligence (AAAI), 2025

We propose a new radiology report generation method that aligns the pre-trained model with multiple human preferences via preference-guided multi-objective optimization reinforcement learning.

Ting Xiao , Lei Shi , Peng Liu , Zhe Wang , Chenjia Bai^✉

Radiology Report Generation via Multi-objective Preference Optimization.

Forward KL Regularized Preference Optimization for Aligning Diffusion Policies.

In AAAI Conference on Artificial Intelligence (AAAI), 2025

We propose Forward KL regularized Preference optimization for aligning Diffusion policies to align the diffusion policy with preferences, learning to align the policy output with human intents in various tasks.

Zhao Shan , Chenyou Fan , Shuang Qiu , Jiyuan Shi , Chenjia Bai^✉

Provably Efficient Information-Directed Sampling Algorithms for Multi-Agent Reinforcement Learning.

In Artificial Intelligence (under review)

This work designs and analyzes a novel set of algorithms for multi-agent reinforcement learning (MARL) based on the principle of information-directed sampling (IDS).

Qiaosheng Zhang , Chenjia Bai , Shuyu Hu , Zhen Wang^✉ , Xuelong Li^✉

Provably Efficient Information-Directed Sampling Algorithms for Multi-Agent Reinforcement Learning.

ODRL: A Benchmark for Off-Dynamics Reinforcement Learning.

In Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024

We introduce ODRL, the first benchmark tailored for evaluating off-dynamics RL methods where one needs to transfer policies across different domains with dynamics mismatch.

Jiafei Lyu , Kang Xu , Jiacheng Xu , Mengbei Yan , Jing-Wen Yang , Zongzhang Zhang , Chenjia Bai^✉ , Zongqing Lu^✉ , Xiu Li^✉

ODRL: A Benchmark for Off-Dynamics Reinforcement Learning.

Regularized Conditional Diffusion Model for Multi-Task Preference Alignment.

In Neural Information Processing Systems (NeurIPS), 2024

We adopt multi-task preferences as a unified condition for both single- and multi-task decision-making, and propose preference representations aligned with preference labels.

Xudong Yu , Chenjia Bai^✉ , Haoran He , Changhong Wang , Xuelong Li

Regularized Conditional Diffusion Model for Multi-Task Preference Alignment.

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training.

In Neural Information Processing Systems (NeurIPS), 2024

We introduce a novel framework that leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos.

Haoran He , Chenjia Bai^✉ , Ling Pan , Weinan Zhang , Bin Zhao , Xuelong Li

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training.

Bridging the Sim-to-Real Gap from the Information Bottleneck Perspective.

In Annual Conference on Robot Learning (CORL), 2024 Oral

We propose a novel single-stage privileged knowledge distillation method called the Historical Information Bottleneck (HIB) to narrow the sim-to-real gap for legged locomotion.

Haoran He , Peilin Wu , Chenjia Bai , Hang Lai , Lingxiao Wang , Ling Pan, , Xiaolin Hu , Weinan Zhang^✉

Bridging the Sim-to-Real Gap from the Information Bottleneck Perspective.

SelfBC: Self Behavior Cloning for Offline Reinforcement Learning .

In European Conference on Artificial Intelligence (ECAI), 2024

We propose a novel dynamic policy constraint that restricts the learned policy on the samples generated by the exponentional moving average of previously learned policies for offline RL.

Shirong Liu , Chenjia Bai , Zixian Guo , Hao Zhang , Gaurav Sharma , Yang Liu^✉