Publications

Online Preference Alignment for Language Models via Count-based Exploration.

In International Conference on Learning Representations (ICLR), 2025 Spotlight

We propose count-based online preference optimization for LLM alignment that leverages coin-flip counting to encourage exploration in online RLHF.

Chenjia Bai , Yang Zhang , Shuang Qiu , Qiaosheng Zhang , Kang Xu , Xuelong Li^✉

Online Preference Alignment for Language Models via Count-based Exploration.

Discriminator-Guided Embodied Planning for LLM Agent.

In International Conference on Learning Representations (ICLR), 2025

We propose a novel framework that generalizes demonstrations to establish critic-regularized grounding and optimization in the long-term planning of LLMs.

Haofu Qian , Chenjia Bai^✉ , Jiatao Zhang , Fei Wu , Wei Song , Xuelong Li

Discriminator-Guided Embodied Planning for LLM Agent.

Exponential Topology-enabled Scalable Communication in Multi-agent Reinforcement Learning.

In International Conference on Learning Representations (ICLR), 2025

We introduce ExpoComm, a scalable communication protocol that leverages exponential topologies for efficient information dissemination among many agents in large-scale multi-agent reinforcement learning.

Xinran Li , Xiaolu Wang , Chenjia Bai , Jun Zhang

Exponential Topology-enabled Scalable Communication in Multi-agent Reinforcement Learning.

Preference Aligned Diffusion Planner for Quadrupedal Locomotion Control.

In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025

We develop a learning framework combining offline diffusion planner and online preference alignment with weak preference labeling for legged locomotion control.

Xinyi Yuan , Zhiwei Shang , Zifan Wang , Chenkai Wang , Zhao Shan , Zhenchao Qi , Meixin Zhu^✉ , Chenjia Bai^✉ , Xuelong Li

Preference Aligned Diffusion Planner for Quadrupedal Locomotion Control.

Radiology Report Generation via Multi-objective Preference Optimization.

In AAAI Conference on Artificial Intelligence (AAAI), 2025

We propose a new radiology report generation method that aligns the pre-trained model with multiple human preferences via preference-guided multi-objective optimization reinforcement learning.

Ting Xiao , Lei Shi , Peng Liu , Zhe Wang , Chenjia Bai^✉

Radiology Report Generation via Multi-objective Preference Optimization.

Forward KL Regularized Preference Optimization for Aligning Diffusion Policies.

In AAAI Conference on Artificial Intelligence (AAAI), 2025

We propose Forward KL regularized Preference optimization for aligning Diffusion policies to align the diffusion policy with preferences, learning to align the policy output with human intents in various tasks.

Zhao Shan , Chenyou Fan , Shuang Qiu , Jiyuan Shi , Chenjia Bai^✉

SelfBC: Self Behavior Cloning for Offline Reinforcement Learning .

In European Conference on Artificial Intelligence (ECAI), 2024

We propose a novel dynamic policy constraint that restricts the learned policy on the samples generated by the exponentional moving average of previously learned policies for offline RL.

Shirong Liu , Chenjia Bai , Zixian Guo , Hao Zhang , Gaurav Sharma , Yang Liu^✉

SelfBC: Self Behavior Cloning for Offline Reinforcement Learning .

Online Iterative Self-Alignment for Radiology Report Generation.

In Annual Meeting of the Association for Computational Linguistics (ACL), 2025

We propose an online iterative self-alignment method for radiology report generation that iteratively generates unlimited preference data and automatically aligns with radiologists’ multiple objectives.

Ting Xiao , Lei Shi , Yang Zhang , HaoFeng Yang , Zhe Wang , Chenjia Bai^✉

Online Iterative Self-Alignment for Radiology Report Generation.

Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration.

In Annual Meeting of the Association for Computational Linguistics (ACL), 2025

We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans.

Yang Zhang , Shixin Yang , Chenjia Bai^✉ , Fei Wu , Xiu Li , Xuelong Li , Zhen Wang

Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration.

Towards Reliable LLM-based Robots Planning via Combined Uncertainty Estimation

In Neural Information Processing Systems (NeurIPS), 2025

We propose CURE, a method that splits LLM planning uncertainty into epistemic and intrinsic parts for more reliable robot decision-making.

Shiyuan Yin , Chenjia Bai^✉ , Zhang Zizhao , Junwei Jin , Xinxin Zhang , Chi Zhang , Xuelong Li

Towards Reliable LLM-based Robots Planning via Combined Uncertainty Estimation

Contrastive Representation for Data Filtering in Cross-Domain Offline Reinforcement Learning.

In International Conference on Machine Learning (ICML), 2024

We propose a novel representation-based approach to measure the domain gap, where the representation is learned through a contrastive objective by sampling transitions from different domains.

Xiaoyu Wen , Chenjia Bai^✉ , Kang Xu , Xudong Yu , Yang Zhang , Xuelong Li , Zhen Wang

Contrastive Representation for Data Filtering in Cross-Domain Offline Reinforcement Learning.

SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation.

In International Conference on Machine Learning (ICML), 2024

We propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning.

Junjie Zhang , Chenjia Bai^✉ , Haoran He , Zhigang Wang , Bin Zhao , Xiu Li , Xuelong Li

SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation.

Provably Efficient Information-Directed Sampling Algorithms for Multi-Agent Reinforcement Learning.

In Artificial Intelligence (AIJ), 2025

This work designs and analyzes a novel set of algorithms for multi-agent reinforcement learning (MARL) based on the principle of information-directed sampling (IDS).

Qiaosheng Zhang , Chenjia Bai , Shuyu Hu , Zhen Wang^✉ , Xuelong Li^✉

Provably Efficient Information-Directed Sampling Algorithms for Multi-Agent Reinforcement Learning.

Constrained Ensemble Exploration for Unsupervised Skill Discovery.

In International Conference on Machine Learning (ICML), 2024

We propose a novel unsupervised RL framework via an ensemble of skills, where each skill performs partition exploration based on the state prototypes.

Chenjia Bai , Rushuai Yang , Qiaosheng Zhang , Kang Xu , Yi Chen , Ting Xiao , Xuelong Li

Constrained Ensemble Exploration for Unsupervised Skill Discovery.

Regularized Conditional Diffusion Model for Multi-Task Preference Alignment.

In Neural Information Processing Systems (NeurIPS), 2024

We adopt multi-task preferences as a unified condition for both single- and multi-task decision-making, and propose preference representations aligned with preference labels.

Xudong Yu , Chenjia Bai^✉ , Haoran He , Changhong Wang , Xuelong Li

Regularized Conditional Diffusion Model for Multi-Task Preference Alignment.

How Does Goal Relabeling Improve Sample Efficiency?

In International Conference on Machine Learning (ICML), 2024

We construct an example to show the information-theoretical improvement in sample efficiency achieved by goal relabeling and develop an RL algorithm called GOALIVE.

Sirui Zheng , Chenjia Bai , Zhuoran Yang , Zhaoran Wang

How Does Goal Relabeling Improve Sample Efficiency?

Cross-Domain Policy Adaptation by Capturing Representation Mismatch.

In International Conference on Machine Learning (ICML), 2024

We consider dynamics adaptation settings where there exists dynamics mismatch between the source domain and the target domain, and one can get access to sufficient source domain data, while can only have limited interactions with the target domain.

Jiafei Lyu , Chenjia Bai , Jing-Wen Yang , Zongqing Lu , Xiu Li

Cross-Domain Policy Adaptation by Capturing Representation Mismatch.

Skill Matters: Dynamic Skill Learning for Multi-Agent Cooperative Reinforcement Learning.

Neural Networks, 2024

We propose a novel Dynamic Skill Learning (DSL) framework to enable more effective adaptation and collaboration in complex tasks.

Tong Li , Chenjia Bai^✉ , Kang Xu , Chen Chu , Peican Zhu , Zhen Wang^✉

Skill Matters: Dynamic Skill Learning for Multi-Agent Cooperative Reinforcement Learning.

Robust Quadrupedal Locomotion via Risk-Averse Policy Learning.

In IEEE International Conference on Robotics and Automation (ICRA), 2024 Oral

We consider a novel risk-sensitive perspective to enhance the robustness of legged locomotion.

Jiyuan Shi , Chenjia Bai^✉ , Haoran He , Lei Han , Dong Wang , Bin Zhao , Mingguo Zhao , Xiu Li , Xuelong Li

Robust Quadrupedal Locomotion via Risk-Averse Policy Learning.

OVD-Explorer: Optimism should not be the Sole Pursuit of Exploration in Noisy Environments.

In AAAI Conference on Artificial Intelligence (AAAI), 2024

We propose Optimistic Value Distribution Explorer (OVD-Explorer) to achieve a noise-aware optimistic exploration for continuous control.

Jinyi Liu , Zhi Wang , Yan Zheng , Jianye Hao , Chenjia Bai , Junjie Ye , Zhen Wang , Et Al.

OVD-Explorer: Optimism should not be the Sole Pursuit of Exploration in Noisy Environments.

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training.

In Neural Information Processing Systems (NeurIPS), 2024

We introduce a novel framework that leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos.

Haoran He , Chenjia Bai^✉ , Ling Pan , Weinan Zhang , Bin Zhao , Xuelong Li

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training.

Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

under review

We propose Align-Then-stEer (ATE), a framework that adapts VLAs to novel robots and tasks through unified latent guidance. ATE can handle significant domain shifts without compromising performance and compatible to Pi0, RDT, and etc.

Yang Zhang , Chenwei Wang , Ouyang Lu , Yuan Zhao , Yunfei Ge , Zhenglong Sun , Xiu Li , Chi Zhang , Chenjia Bai^✉ , Xuelong Li^✉

Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

KungfuBot2: Learning Versatile Motion Skills for Humanoid Whole-Body Control

under review

We present VMS, a unified whole-body controller that enables humanoid robots to learn diverse and dynamic behaviors within a single policy through hybrid tracking and orthogonal mixture of experts.

Jinrui Han , Weiji Xie , Jiakun Zheng , Jiyuan Shi , Weinan Zhang , Ting Xiao , Chenjia Bai^✉

KungfuBot2: Learning Versatile Motion Skills for Humanoid Whole-Body Control

Towards Robust Offline-to-Online Reinforcement Learning via Uncertainty and Smoothness.

In Journal of Artificial Intelligence Research (JAIR), 2023

We propose the Robust Offline-to-Online (RO2O) algorithm, designed to enhance offline policies through uncertainty and smoothness, and to mitigate the performance drop in online adaptation.

Xiaoyu Wen , Xudong Yu , Rui Yang , Chenjia Bai^✉ , Zhen Wang

Towards Robust Offline-to-Online Reinforcement Learning via Uncertainty and Smoothness.

On the Value of Myopic Behavior in Policy Reuse.

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025

We present a framework called Selective Myopic bEhavior Control~(SMEC), which results from the insight that the short-term behaviors of prior policies are sharable across tasks.

Chenjia Bai , Kang Xu , Shuang Qiu , Haoran He , Bin Zhao , Zhen Wang , Wei Li , Xuelong Li

On the Value of Myopic Behavior in Policy Reuse.

大模型驱动的具身智能：发展与挑战

中国科学：信息科学

We give a comprehensive survey for embodied AI driven by large-scale models.

Chenjia Bai , Huazhe Xu , Xuelong Li^✉

Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning.

In Neural Information Processing Systems (NeurIPS), 2023

We aim to investigate the effectiveness of a single diffusion model in modeling large-scale multi-task offline data, which can be challenging due to diverse and multimodal data distribution.

Haoran He , Chenjia Bai^✉ , Kang Xu , Zhuoran Yang , Weinan Zhang , Dong Wang , Bin Zhao , Xuelong Li

Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning.

Cross-Domain Policy Adaptation via Value-Guided Data Filtering.

In Neural Information Processing Systems (NeurIPS), 2023

We reveal the limitations of these methods and explore the problem from the value difference perspective via a novel insight on the value consistency across domains.

Kang Xu , Chenjia Bai^✉ , Xiaoteng Ma , Dong Wang , Bin Zhao , Zhen Wang , Xuelong Li , Wei Li

Cross-Domain Policy Adaptation via Value-Guided Data Filtering.

Ensemble Successor Representations for Task Generalization in Offline-to-Online Reinforcement Learning.

In SCIENCE CHINA Information Sciences, 2023

Our work builds upon the investigation of successor representations for task generalization in online RL and extends the framework to incorporate offline-to-online learning.

Changhong Wang , Xudong Yu , Chenjia Bai , Qiaosheng Zhang , Zhen Wang^✉

Ensemble Successor Representations for Task Generalization in Offline-to-Online Reinforcement Learning.

Pessimistic Value Iteration for Multi-Task Data Sharing in Offline Reinforcement Learning.

In Artificial Intelligence (AIJ), 2023

We propose an uncertainty-based MTDS approach that shares the entire dataset without data selection.

Chenjia Bai , Lingxiao Wang , Jianye Hao , Zhuoran Yang , Bin Zhao , Zhen Wang^✉ , Xuelong Li^✉

Pessimistic Value Iteration for Multi-Task Data Sharing in Offline Reinforcement Learning.

Task-agnostic Pre-training and Task-guided Fine-tuning for Versatile Diffusion Planner.

In International Conference on Machine Learning (ICML), 2025

We develop a versatile diffusion planner that can leverage large-scale inferior data that contains task-agnostic sub-optimal trajectories, with the ability to fast adapt to specific tasks.

Chenyou Fan , Chenjia Bai^✉ , Zhao Shan , Haoran He , Yang Zhang , Zhen Wang

Task-agnostic Pre-training and Task-guided Fine-tuning for Versatile Diffusion Planner.

ODRL: A Benchmark for Off-Dynamics Reinforcement Learning.

In Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024

We introduce ODRL, the first benchmark tailored for evaluating off-dynamics RL methods where one needs to transfer policies across different domains with dynamics mismatch.

Jiafei Lyu , Kang Xu , Jiacheng Xu , Mengbei Yan , Jing-Wen Yang , Zongzhang Zhang , Chenjia Bai^✉ , Zongqing Lu^✉ , Xiu Li^✉

ODRL: A Benchmark for Off-Dynamics Reinforcement Learning.

Behavior Contrastive Learning for Unsupervised Skill Discovery.

In International Conference on Machine Learning (ICML), 2023

We propose a novel unsupervised skill discovery method through contrastive learning among behaviors, which makes the agent produce similar behaviors for the same skill and diverse behaviors for different skills.

Rushuai Yang , Chenjia Bai^✉ , Hongyi Guo , Siyuan Li , Bin Zhao , Zhen Wang , Peng Liu , Xuelong Li

Behavior Contrastive Learning for Unsupervised Skill Discovery.

Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning.

In Information Sciences, 2023

We introduce a novel strategy employing diverse randomized value functions to estimate the posterior distribution of Q-values.

Xudong Yu , Chenjia Bai^✉ , Hongyi Guo , Changhong Wang^✉ , Zhen Wang

Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning.

False Correlation Reduction for Offline Reinforcement Learning.

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.

Zhihong Deng , Zuyue Fu , Lingxiao Wang , Zhuoran Yang , Chenjia Bai , Tianyi Zhou , Jing Jiang

False Correlation Reduction for Offline Reinforcement Learning.

Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction

under review

We propose a novel bimanual foundation policy that leverages text-to-video models to predict robot trajectories and uses optical flow as an intermediate variable to improve generalization.

Chenyou Fan , Fangzheng Yan , Chenjia Bai^✉ , Jiepeng Wang , Chi Zhang , Zhen Wang , Xuelong Li^✉

Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction

Learn as Individuals, Evolve as a Team: Multi-agent LLMs Adaptation in Embodied Environments

under review

We propose the Learn as Individuals, Evolve as a Team (LIET) framework to enable multi-agent LLMs to adapt to embodied environments through individual learning and team evolution

Xinran Li , Chenjia Bai^✉ , Zijian Li , Jiakun Zheng , Ting Xiao , Jun Zhang^✉

Learn as Individuals, Evolve as a Team: Multi-agent LLMs Adaptation in Embodied Environments

MoRE: Mixture of Residual Experts for Humanoid Lifelike Gaits Learning on Complex Terrains

under review

We propose a novel framework that enables humanoid robots to traverse complex terrains with controllable human-like gaits using a mixture of latent residual experts and multi-discriminators.

Dewei Wang , Xinmiao Wang , Xinzhe Liu , Jiyuan Shi , Yingnan Zhao , Chenjia Bai^✉ , Xuelong Li^✉

MoRE: Mixture of Residual Experts for Humanoid Lifelike Gaits Learning on Complex Terrains

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

In Neural Information Processing Systems (NeurIPS), 2025

We propose a physics-based humanoid control framework, aiming to master highly-dynamic human behaviors such as Kungfu and dancing through multi-steps motion processing and adaptive motion tracking.

Weiji Xie(+) , Jinrui Han(+) , Jiakun Zheng(+) , Huanyu Li , Xinzhe Liu , Jiyuan Shi , Weinan Zhang , Chenjia Bai^✉ , Xuelong Li^✉

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning.

In Neural Information Processing Systems (NeurIPS), 2025

we propose HumanoidGen, an automated task creation and demonstration collection framework that leverages atomic dexterous operations and LLM reasoning to generate relational constraints.

Zhi Jing , Siyuan Yang , Jicong Ao , Ting Xiao , Yugang Jiang , Chenjia Bai^✉

HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning.

Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective

In Neural Information Processing Systems (NeurIPS), 2025

We propose Diffusion-Inspired Multi-Agent world model (DIMA), a novel framework for multi-agent reinforcement learning that leverages diffusion models to reduce modeling complexity and improve sample efficiency.

Yang Zhang , Xinran Li , Jianing Ye , Delin Qu , Shuang Qiu , Chongjie Zhang , Xiu Li , Chenjia Bai^✉

Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective

Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning

In Neural Information Processing Systems (NeurIPS), 2025

We propose Adversarial Locomotion and Motion Imitation (ALMI) for humanoid robots, which serves as a novel framework for loco-manipulation tasks, enabling adversarial policy learning between upper and lower body.

Jiyuan Shi , Xinzhe Liu , Dewei Wang , Ouyang Lu , Sören Schwertfeger , Fuchun Sun , Chenjia Bai^✉ , Xuelong Li^✉

Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning

Information-Theoretic Reward Decomposition for Generalizable RLHF.

In Neural Information Processing Systems (NeurIPS), 2025

We decompose the reward value in RLHF into two independent components that consists prompt-free reward and prompt-related reward, and propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values.

Liyuan Mao , Haoran Xu , Amy Zhang , Weinan Zhang^✉ , Chenjia Bai^✉

Information-Theoretic Reward Decomposition for Generalizable RLHF.

Humanoid Whole-Body Locomotion on Narrow Terrain via Dynamic Balance and Reinforcement Learning.

In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025

we propose a novel whole-body locomotion algorithm based on dynamic balance and Reinforcement Learning (RL) that enables humanoid robots to traverse extreme terrains, particularly narrow pathways and unexpected obstacles, using only proprioception.

Weiji Xie , Chenjia Bai^✉ , Jiyuan Shi , Junkai Yang , Yunfei Ge , Weinan Zhang^✉ , Xuelong Li

Humanoid Whole-Body Locomotion on Narrow Terrain via Dynamic Balance and Reinforcement Learning.

Distributional Off-Policy Evaluation in Reinforcement Learning

In Journal of the American Statistical Association (JASA), 2025

This paper proposes an offline Wasserstein-based approach to estimate the joint distribution of multivariate discounted cumulative rewards, establishes finite sample error bounds in the batch setting, and demonstrates its superior performance through extensive numerical studies.

Zhengling Qi , Chenjia Bai , Zhaoran Wang , Lan Wang^✉

Distributional Off-Policy Evaluation in Reinforcement Learning

VLP: Vision-Language Preference Learning for Embodied Manipulation.

In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

we propose a novel Vision-Language Preference learning framework that learns a vision-language preference model to provide preference feedback for embodied manipulation tasks.

Runze Liu , Chenjia Bai^✉ , Jiafei Lyu , Shengjie Sun , Yali Du , Xiu Li^✉

VLP: Vision-Language Preference Learning for Embodied Manipulation.

Bridging the Sim-to-Real Gap from the Information Bottleneck Perspective.

In Annual Conference on Robot Learning (CORL), 2024 Oral

We propose a novel single-stage privileged knowledge distillation method called the Historical Information Bottleneck (HIB) to narrow the sim-to-real gap for legged locomotion.

Haoran He , Peilin Wu , Chenjia Bai , Hang Lai , Lingxiao Wang , Ling Pan, , Xiaolin Hu , Weinan Zhang^✉

Bridging the Sim-to-Real Gap from the Information Bottleneck Perspective.

RORL: Robust Offline Reinforcement Learning via Conservative Smoothing.

In Neural Information Processing Systems (NeurIPS), 2022 Spotlight

We propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique.

Rui Yang^✉ , Chenjia Bai^✉ , Xiaoteng Ma , Zhaoran Wang , Chongjie Zhang , Lei Han

RORL: Robust Offline Reinforcement Learning via Conservative Smoothing.

Self-Supervised Imitation for Offline Reinforcement Learning with Hindsight Relabeling.

IEEE Transactions on Systems, Man, and Cybernetics: Systems. 2022

We present an offline RL algorithm that combines hindsight relabeling and supervised regression to predict actions without oracle information.

Xudong Yu , Chenjia Bai , Changhong Wang , Dengxiu Yu , C. L. Philip Chen , Zhen Wang^✉

Self-Supervised Imitation for Offline Reinforcement Learning with Hindsight Relabeling.

Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning.

In International Conference on Machine Learning (ICML), 2022 Spotlight

We study how RL can be empowered by contrastive learning in a class of Markov decision processes (MDPs) and Markov games (MGs) with low-rank transitions. For both models, we propose to extract the correct feature representations of the low-rank model by minimizing a contrastive loss.

Shuang Qiu , Lingxiao Wang , Chenjia Bai , Zhuoran Yang , Zhaoran Wang

Monotonic Quantile Network for Worst-Case Offline Reinforcement Learning.

IEEE Transactions on Neural Networks and Learning Systems, 2022

We propose monotonic quantile network (MQN) with conservative quantile regression (CQR) for risk-averse policy learning.

Chenjia Bai , Ting Xiao , Zhoufan Zhu , Lingxiao Wang , Fan Zhou , Peng Liu

Monotonic Quantile Network for Worst-Case Offline Reinforcement Learning.

Exploration in Deep Reinforcement Learning: From Single-Agent to Multiagent Domain.

IEEE Transactions on Neural Networks and Learning Systems, 2022

We conduct a comprehensive survey on existing exploration methods for both single-agent RL and multiagent RL.

Jianye Hao , Tianpei Yang , Hongyao Tang , Chenjia Bai , Jinyi Liu , Zhaopeng Meng , Peng Liu , Zhen Wang

Exploration in Deep Reinforcement Learning: From Single-Agent to Multiagent Domain.

Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning.

International Conference on Learning Representations (ICLR), 2022 Spotlight

We propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints.

Chenjia Bai , Lingxiao Wang , Zhuoran Yang , Zhihong Deng , Animesh Garg , Peng Liu , Zhaoran Wang

Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning.

Dynamic Bottleneck for Robust Self-Supervised Exploration.

In Neural Information Processing Systems (NeurIPS), 2021

We propose a Dynamic Bottleneck (DB) model, which attains a dynamics-relevant representation based on the information-bottleneck principle.

Chenjia Bai , Lingxiao Wang , Lei Han , Animesh Garg , Jianye Hao , Peng Liu , Zhaoran Wang

Dynamic Bottleneck for Robust Self-Supervised Exploration.

Principled Exploration via Optimistic Bootstrapping and Backward Induction.

In International Conference on Machine Learning (ICML), 2021 Spotlight

We propose a principled exploration method for DRL through Optimistic Bootstrapping and Backward Induction (OB2I).

Chenjia Bai , Lingxiao Wang , Lei Han , Jianye Hao , Animesh Garg , Peng Liu , Zhaoran Wang

Principled Exploration via Optimistic Bootstrapping and Backward Induction.

Addressing Hindsight Bias in Multi-Goal Reinforcement Learning.

IEEE Transactions on Cybernetics, 2021

We analyze the hindsight bias due to this use of hindsight goals and propose the bias-corrected HER (BHER), an efficient algorithm that corrects the hindsight bias in training.

Chenjia Bai , Lingxiao Wang , Yixin Wang , Rui Zhao , Chenyao Bai , Peng Liu

Addressing Hindsight Bias in Multi-Goal Reinforcement Learning.

Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning.

IEEE Transactions on Neural Networks and Learning Systems, 2021

We propose a variational dynamic model based on the conditional variational inference to model the multimodality and stochasticity.

Chenjia Bai , Peng Liu , Kaiyu Liu , Lingxiao Wang , Yingnan Zhao , Lei Han

Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning.

Generating Attentive Goals for Prioritized Hindsight Reinforcement Learning.

Knowledge-Based Systems (KBS), 2020

We propose a novel prioritized hindsight model for multi-goal RL in which the agent is provided with more valuable goals, as measured by the expected temporal-difference (TD) error.

Peng Liu , Chenjia Bai , Yingnan Zhao , Chenyao Bai , Wei Zhao , Xianglong Tang

Generating Attentive Goals for Prioritized Hindsight Reinforcement Learning.

Obtaining Accurate Estimated Action Values in Categorical Distributional Reinforcement Learning.

Knowledge-Based Systems (KBS), 2020

This paper describes a method of obtaining more accurate estimated action values for CDRL using adaptive bounds.

Yingnan Zhao , Peng Liu , Chenjia Bai , Wei Zhao , Xianglong Tang

Obtaining Accurate Estimated Action Values in Categorical Distributional Reinforcement Learning.

Active Sampling for Deep Q-learning Based on TD-error Adaptive Correction.

Journal of Computer Research and Development (in Chinese), 2019

We propose an active sampling method based on TD-error adaptive correction in order to solve sample efficiency problem in deep Q-learning.

Chenjia Bai , Peng Liu , Wei Zhao , Xianglong Tang

Active Sampling for Deep Q-learning Based on TD-error Adaptive Correction.

Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models.

In Transactions on Machine Learning Research (TMLR), 2025

we propose a novel world model for MARL that learns decentralized local dynamics for scalability, combined with a centralized representation aggregation from all agents.

Yang Zhang , Chenjia Bai^✉ , Bin Zhao , Junchi Yan , Xiu Li^✉ , Xuelong Li

Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models.