Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels

Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels

Wandong Sun^{1, 2, *}, Yongbo Su^{1, 2, *}, Leoric Huang^{2, *}, Alex Zhang², Dwyane Wei², Mu San²,
Daniel Tian², Ellie Cao², Finn Yan^{2, †}, Ethan Xie^{2, †}, Zongwu Xie^{1, †}

¹Harbin Institute of Technology, ²Honor Robot Team

^*Equal contribution, ^†Corresponding authors

Real World Results

🎧 Turn on the sound for the full experience!

Wild Parkour 1

Wild Parkour 2

Stair Up 1

Stair Up 2

Stair Up 3

Stair Up 4

Stair Down 1

Stair Down 2

Stair Down 3

Step Stone

Indoor Parkour

Balance Recovery

Abstract

Achieving robust vision-based humanoid locomotion remains challenging due to two fundamental issues: the sim-to-real gap introduces significant perception noise that degrades performance on fine-grained tasks, and training a unified policy across diverse terrains is hindered by conflicting learning objectives. To address these challenges, we present an end-to-end framework for vision-driven humanoid locomotion.

For robust sim-to-real transfer, we develop a high-fidelity depth sensor simulation that captures stereo matching artifacts and calibration uncertainties inherent in real-world sensing. We further propose a vision-aware behavior distillation approach that combines latent space alignment with noise-invariant auxiliary tasks, enabling effective knowledge transfer from privileged height maps to noisy depth observations. For versatile terrain adaptation, we introduce terrain-specific reward shaping integrated with multi-critic and multi-discriminator learning, where dedicated networks capture the distinct dynamics and motion priors of each terrain type.

We validate our approach on two humanoid platforms equipped with different stereo depth cameras. The resulting policy demonstrates robust performance across diverse environments, seamlessly handling extreme challenges such as high platforms and wide gaps, as well as fine-grained tasks including bidirectional long-term staircase traversal.

Method Overview

Our framework consists of two stages: (1) Privileged RL Training: A teacher policy is trained with height scan observa- tions using multi-critic and multi-discriminator learning, where terrain-specific reward shaping and dedicated value networks handle diverse terrain categories (stairs/platforms, gaps, rough terrain). (2) Vision-Aware Distillation: The privileged policy is distilled into a deployment policy operating on augmented depth images, combining behavior cloning with denoising objectives for robust sim-to-real transfer.

Augmentation Pipeline

Visualization of the depth augmentation pipeline. Starting from clean left and right depth images, the pipeline sequentially applies: (1) stereo fusion, (2) random convolution, (3) Gaussian noise, (4) Perlin noise, (5) scale randomization, (6) zero pixel failures, (7) max pixel failures, (8) depth clipping and spatial cropping to produce realistic depth observations for sim-to-real transfer.

More Augmentation Results

Additional depth augmentation examples across diverse terrains. Each triplet shows (left to right): left camera depth, right camera depth, and augmented output before spatial cropping. Depth values are normalized to [0, 2] m and rendered as color maps (cool = near, warm = far). The augmented images exhibit realistic stereo fusion holes (black regions), depth-dependent noise, and structured Perlin patterns while preserving terrain geometry essential for locomotion control.