EgoPush: Learning End-to-End Egocentric Multi-Object Rearrangement for Mobile Robots

Boyuan An; Zhexiong Wang; Yipeng Wang; Jiaqi Li; Sihang Li; Jing Zhang; Chen Feng

EgoPush: Learning End-to-End Egocentric Multi-Object Rearrangement for Mobile Robots

Intermediate

Boyuan An, Zhexiong Wang, Yipeng Wang et al.2/20/2026

arXiv

Key Summary

•EgoPush teaches a small mobile robot to push multiple objects into patterns (like a cross or a line) using only what it sees from its own camera, without any global map.
•The big idea is to think in terms of how objects relate to each other (object-centric, relative positions) instead of where they are on a global map (absolute positions).
•They first train a ‘teacher’ policy with simplified keypoint data but limit the teacher to only what the robot could reasonably see, so the teacher’s moves are copyable by a camera-only ‘student’.
•They split long, tricky tasks into stages (reach, then place) and give bigger rewards for finishing each stage sooner, which speeds up learning and reduces confusion.
•The student uses RGB just to separate objects and then relies on depth for control, plus a special “relational” loss to learn how objects relate (active–anchor–obstacles).
•In simulations, EgoPush beats classical mapping and end-to-end visual RL baselines by a lot; the student even gets 100% success on a simplified benchmark.
•The trained student transfers zero-shot to a real TurtleBot with only an egocentric camera and achieves 80% success on building cross-shaped configurations.
•Ablations show why each piece matters: restricted teacher views make its behavior learnable; stage rewards fix long-horizon credit assignment; relational distillation helps on asymmetric goals.
•The method handles different object shapes (cubes, cylinders, prisms), though contact stability gets harder on non-cuboids.
•Main limits: the student is mostly reactive with little memory, so narrow passages and heavy occlusions can still cause hesitations or deadlocks.

Why This Research Matters

Robots that can rearrange objects using only their own camera view are cheaper, simpler, and more robust than ones that depend on fragile global maps. This makes them practical in homes, hospitals, and warehouses where scenes are messy and change a lot. By focusing on relative object relationships, EgoPush stays reliable even when some objects go off-screen for a moment. Splitting long jobs into timed stages helps robots learn faster and act more decisively, which saves battery life and time. Most importantly, the approach transfers from simulation to real hardware without costly data collection, speeding up development and deployment.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how you can clean your room just by looking around from where you stand, even when your hands are full, and you can’t see everything at once? Robots want to do that too—especially small mobile robots that can’t pick up heavy things but can nudge and push them. Before this work, many robots that rearranged objects leaned on a “world map” or precise GPS-like coordinates. That sounds nice, but in real homes or warehouses, scenes change, objects move, and cameras get blocked, so maps can go stale fast.

🍞 Hook for Ego-centric perception: You know how you only see what’s in front of you, not the whole house at once? 🥬 The Concept: Ego-centric perception means the robot only uses the view from its own camera right now.

How it works: (1) Take a camera image from the robot’s point of view. (2) Use that local view to decide what to do next. (3) Move, then look again. (4) Repeat.
Why it matters: Without this, robots depend on fragile global maps; with it, they can handle occlusions and moving objects. 🍞 Anchor: A TurtleBot sees a box ahead, turns to face it, and nudges it closer to another box without needing a floor plan.

Many earlier methods either (a) built detailed maps and planned routes, or (b) learned directly from pixels using reinforcement learning (RL). The map-based ways often failed when objects moved or the camera saw too little texture for good localization (SLAM breaks if the world won’t sit still). End-to-end RL from raw images avoided mapping but learned slowly and could get confused when important objects went off-screen.

🍞 Hook for Non-prehensile manipulation: Imagine moving a big couch by pushing it with your legs instead of lifting it. 🥬 The Concept: Non-prehensile manipulation means moving objects without grasping them—often by pushing.

How it works: (1) Find the object. (2) Get into a good pushing position. (3) Apply gentle, steady nudges. (4) Re-adjust if it drifts.
Why it matters: Robots can handle big or awkward items they can’t pick up. 🍞 Anchor: A robot gently pushes a cube box into place to form a cross shape around an anchor box.

🍞 Hook for Reinforcement Learning (RL): Picture training a puppy: do a trick, get a treat; do the wrong thing, no treat. 🥬 The Concept: RL teaches a policy to pick actions by rewarding good outcomes.

How it works: (1) See the current situation. (2) Try an action. (3) Get a reward based on progress. (4) Adjust the policy to do better next time.
Why it matters: RL can learn complex sequences, like approach → align → push → fine-tune. 🍞 Anchor: The robot learns that lining up with the box and pushing straight gets more reward than spinning aimlessly.

The problem: Make a mobile robot rearrange several objects into a target formation (like a cross or a line) using only its own camera view, while dealing with occlusions and changing scenes over long time horizons. Past attempts tried a teacher–student approach where a privileged “teacher” knows the full global state and the “student” copies it from camera images. But if the teacher uses info the student can’t see (like objects behind the robot), the student receives confusing instructions and can’t imitate well. Also, long tasks with sparse rewards make it hard to know which early steps were helpful—called credit assignment.

Failed attempts:

Classical mapping/planning: strong if maps are perfect, but brittle when depth is noisy, texture is sparse, or objects move.
End-to-end pixel RL: avoids maps but struggles with sample efficiency and occlusions; cues vanish when objects leave the frame.
Omniscient teacher distillation: teacher behaviors aren’t reproducible from the student’s viewpoint, causing imitation to fail.

The gap: We need (1) a viewpoint-robust way to think about objects using relative relations, not precise global positions, (2) a teacher whose behavior stays visible and copyable from the student’s camera, and (3) stage-wise rewards so long tasks feel like bite-sized steps with timely feedback.

Real stakes: Home robots could tidy toys by pushing them into bins; warehouse robots could relieve jams by nudging boxes into lanes; hospital robots could clear corridors without needing perfect maps. Doing this with only a forward-facing camera reduces sensors, cost, and failure points, making robots more practical in everyday, messy environments.

02Core Idea

The “aha!” moment: Teach the robot to think in relative object relationships (not global coordinates), train a teacher that only uses what the camera could have seen, and split long tasks into stages with time-decayed rewards—then distill all of that into a camera-based student.

Three analogies:

Compass vs. map: Instead of a full city map, the robot uses a compass of “who is near whom” (relative object relations) and keeps re-checking what it sees.
Coach with cones: A coach runs drills in a visible part of the field (restricted teacher view), so players can later copy those moves in a real game.
Lego plan: Build a model step by step (stages), praising faster completion of each piece to keep momentum.

Before vs. After:

Before: Either rely on brittle global maps or try to learn directly from pixels and get lost when objects leave view; teachers used secrets students couldn’t see.
After: Use an object-centric latent that stores who-is-where relative to whom; a constrained teacher that never hides info from the student; and stage-wise, time-weighted rewards that keep learning steady. The student becomes reliable with just an egocentric camera and depth.

Why it works (intuition):

Relative relations are stable under camera motion; the robot only needs to know “active box near anchor, obstacles around” to act.
Constraining the teacher to visible cues prevents impossible-to-imitate behaviors; student supervision stays consistent.
Stage-wise, time-decayed rewards give timely credit for finishing each subgoal quickly, avoiding long-delayed, fuzzy signals.
Distilling not just actions but also the “relationships” among group latents (active–anchor–obstacles) teaches the student what to notice.

Building blocks (with sandwich explanations for each new concept):

🍞 Hook for Object-centric latent representation: Imagine arranging friends in a photo by how close they are to the birthday kid, not by exact GPS coordinates. 🥬 The Concept: An object-centric latent representation encodes each group (active object, anchor, obstacles) so the policy reasons about their relative spatial relations.

How it works: (1) Identify three roles: active, anchor, obstacles. (2) Encode each group into a compact vector (same units so they’re comparable). (3) Concatenate these to form the scene’s “relationship summary.” (4) Act based on these relations.
Why it matters: It avoids fragile global poses and focuses the policy on what drives decisions. 🍞 Anchor: The robot learns “active is left of anchor and far” → move right and forward; “too close” → slow and align.

🍞 Hook for Cross-modal distillation: Think of a music teacher (hears notes precisely) teaching a student who only watches finger positions—knowledge passes across senses. 🥬 The Concept: Cross-modal distillation transfers know-how from a privileged teacher (keypoints) to a visual student (camera/depth).

How it works: (1) Train teacher with simple, low-dimensional inputs. (2) Collect the teacher’s actions and relational cues. (3) Train student to match them from images/depth. (4) Repeat online so labels match current states.
Why it matters: It speeds up learning and reduces trial-and-error from pixels alone. 🍞 Anchor: Teacher says “turn 10°, push softly”; student learns to do that from a depth view.

🍞 Hook for Constrained Teacher RL: Imagine your coach only uses drills you can see and copy, not secret tricks behind your back. 🥬 The Concept: Constrained Teacher RL limits the teacher to egocentric, visibility-limited inputs so its behavior is camera-recoverable.

How it works: (1) Mask out points outside a virtual camera FOV. (2) Only reveal target-reference hints when the anchor is centered (center-gated). (3) Learn with RL under these rules. (4) Produce demonstrations the student can imitate.
Why it matters: Without constraints, the teacher moves in ways the student can’t explain from images, breaking imitation. 🍞 Anchor: The teacher turns to keep the anchor in view before pushing, so the student can see why those actions make sense.

🍞 Hook for Stage-wise training with temporally decayed rewards: Think of a video game with levels—finishing earlier gets more points. 🥬 The Concept: Split tasks into stages (reach, then place) and make completion rewards bigger if you finish each stage sooner.

How it works: (1) Define stages. (2) Give a completion bonus per stage. (3) Multiply by a timer that decays as steps pass. (4) Reset the timer at each new stage.
Why it matters: It gives clear, timely credit for progress and avoids waiting until the very end. 🍞 Anchor: The robot earns more by reaching the box quickly, then more by aligning it fast near the anchor, keeping momentum high.

03Methodology

At a high level: Camera RGB-D → (Instance masks from RGB) → (Group-wise depth layers) → (Latent encoding of active/anchor/obstacles) → Policy head → Robot velocities (v, ω)

Phase 1: Train a constrained teacher with RL on sparse keypoints; Phase 2: Distill to a visual student that takes only egocentric RGB-D (RGB for masks, depth for control) and imitates the teacher’s actions and relationships.

Step-by-step recipe with what/why/examples:

Observation grouping (student view)

What happens: The robot takes an RGB-D frame. Simple color segmentation (or a zero-shot model) finds object masks. Each detected instance is assigned a role: active, anchor, obstacles. We sum the masked depth per role to form three fixed-size depth layers.
Why it exists: Policies are easier to learn when the input has consistent shape and role labels; otherwise the network has to rediscover “who matters” every frame.
Example: The ‘active’ layer shows a nearby cube’s depth silhouette, the ‘anchor’ layer shows a stationary cube, and the ‘obstacles’ layer shows everything else.

Object-centric latent encoding

What happens: A CNN (student) or PointNet (teacher) encodes each role’s depth or keypoints into a compact vector. These group latents plus the previous action feed a small MLP policy head to output (v, ω).
Why it exists: The latent collapses raw geometry into actionable summaries, focusing on relations over absolute coordinates.
Example: If $Z_a$ ctive and $Z_a$ nchor encode “far and misaligned,” the policy outputs “turn a bit, then go forward.”

Phase 1 — Constrained Teacher RL (keypoints only)

What happens: The teacher gets sparse 3D keypoints for the active object, the anchor, obstacles, and a reference target cloud (the ideal placement relative to the anchor). Two constraints apply: (a) Virtual FOV masking hides anything outside a camera-like view; (b) Center-gated visibility hides the reference unless the anchor is centered in view. The teacher learns with PPO to output smooth, safe diff-drive actions.
Why it exists: It keeps teacher behavior recoverable from camera views, inducing “look-then-push” and other active perception habits the student can see and copy. Without constraints, the teacher may back into targets it never looks at—impossible for the student to justify.
Example: The constrained teacher first turns so the anchor is centered (unlocking the reference), then approaches the active box and pushes while keeping the anchor visible.

Rewards and long-horizon structure

What happens: Each stage has two phases—reach (robot to active) and place (active to reference near anchor). Completion rewards are time-decayed within each stage, giving more points for finishing sooner. Progress shaping rewards distance decreases (reach: robot→active; place: active→reference). Smoothness discourages jerky commands. A slowdown bonus near the target encourages settling before stopping. Early termination on collisions/out-of-bounds avoids reward hacking.
Why it exists: Long sequences with sparse end rewards make learning guessy and slow. Stage-timed rewards give prompt, comparable feedback and stabilize optimization. Without them, policies often stall or spin.
Example: The agent earns a bigger bonus if it reaches the box quickly, then another if it aligns and stabilizes it near the anchor without jitter.

Phase 2 — Student distillation (RGB-D only)

What happens: Online DAgger-style imitation: the student acts in the environment; at each visited state, we query the teacher for the target action; the student immediately updates to reduce action error. We also use a relational distillation loss: we match the pairwise cosine similarities between the teacher’s role latents (active–anchor–obstacles) and the student’s, forcing the student to learn the same relational geometry even without the teacher’s explicit reference target latent.
Why it exists: Pure behavior cloning drifts under small errors; online querying fixes this by labeling current states. The relational loss transfers the teacher’s “what matters between whom” structure. Without it, the student may learn actions but not the fine-grained spatial reasoning.
Example: Even when the student can’t see the reference, matching the teacher’s “active-vs-anchor” relationship teaches it to push toward the correct formation.

Sim-to-real details

What happens: During training, we randomize physics and camera pose and inject depth noise. For real deployment, simple HSV color segmentation isolates boxes; depth is stabilized with fast inpainting to reduce holes and flicker; commands are scaled to the real robot’s torque limits.
Why it exists: This shrinks the sim-to-real gap without heavy sensing stacks. Without denoising and randomization, good sim policies can wobble or fail on real sensors.
Example: The TurtleBot pushes four colored cubes to form a cross around the anchor in a $3×3$ m arena using only its forward camera.

Secret sauce (what’s clever):

Make the teacher teachable: limit its vision so student imitation is actually well-posed.
Think in relationships, not maps: object-centric relative latents are robust to occlusions and camera motion.
Reward the journey in chunks: stage-wise, time-decayed rewards turn a marathon into sprints.
Distill structure, not just actions: relational loss transfers “what to pay attention to” across modalities.

Extra sandwich concept — Sim-to-real transfer: 🍞 Hook: Imagine practicing a dance with smooth floors and bright lights, then performing on a rough stage with dim lights. 🥬 The Concept: Sim-to-real transfer means training in simulation and working in the real world without extra fine-tuning.

How it works: (1) Randomize sim physics/camera. (2) Add realistic sensor noise. (3) Clean up real depth with fast, stable filters. (4) Keep inputs simple (masks + depth) so it generalizes.
Why it matters: It cuts cost and time; you don’t need huge real-world datasets. 🍞 Anchor: The trained policy runs zero-shot on a real TurtleBot and builds a cross of boxes using only egocentric vision.

04Experiments & Results

What they measured and why:

Success Rate (SR): Did the robot finish the pattern correctly? This matches the end-goal users care about.
Execution Time and Trajectory Length: How fast and how far did it move? Shorter and quicker means more efficient, safer, and cheaper.
Reach rate: Did the robot at least make good contact with the active object?

Competitors:

Classical mapping+planning with Spatial Intention Maps (SIM) adapted to egocentric input without ground-truth pose.
End-to-end RL baselines from RGB or RGB-D, with and without perfect segmentation, and with/without RNN memory.
EgoPush teacher/student with and without key design pieces (FOV masking, center-gated reference, stage-timed rewards, relational distillation).

Scoreboard highlights with context:

Baseline comparison (simplified two-object push-toward-anchor task): EgoPush student achieved 100% reach and 100% success with a modest path length (~4.66 units). Others struggled: SIM got ~19% success (map/pose drift over time); RGB/RGB-D end-to-end variants had <1% success (some reached but couldn’t complete). This is like EgoPush getting a perfect score while others barely pass the first checkpoint.
Teacher constraints matter: All teachers (global, w/o center gate, ours) had very high SR (>98%), but only the constrained teacher produced a high-performing student (~70% SR in that benchmark), while the global teacher’s student got 0%. Translation: a top coach who uses invisible tricks is a bad teacher; a slightly less powerful but fair coach makes great students.
Credit assignment ablation: Starting from a sparse end-only reward (16% SR), adding stage-wise completion boosts SR to 88%; adding decay with a global timer reaches ~98%; using per-stage timers (reset each stage) gives ~99% SR with faster convergence. That’s like going from a confusing single final grade to clear, timed checkpoints that turbocharge learning.
Relational distillation: On symmetric cross-shaped goals, action-only distillation was close. But on asymmetric line-shaped goals, removing relational loss increased action error and led to complete failure—proving that modeling relationships among roles (active, anchor, obstacles) matters especially when the goal isn’t symmetric.
Different shapes: The student kept near-perfect reach on cylinders/prisms but success dropped (e.g., ~67% on cylinders, ~54% on prisms). This says: perception and approach transfer well; fine contact control is harder when shapes add tricky contact dynamics.
Real robot: Zero-shot on a TurtleBot in a $3×3$ m arena, 80% success on cross-shaped formations under a forgiving metric, finishing four pushes within about two minutes. That’s solid for no real-world fine-tuning, with only egocentric sensing.

Surprises and insights:

The globally omniscient teacher was not the best teacher—its student couldn’t imitate. Restricting what the teacher “knows” made students far stronger.
Stage timers that reset per stage worked better than a single episode timer—consistent time pressure per subgoal steadies learning.
With unconstrained views, the teacher sometimes learned to push with the robot’s rear (more stable torque geometry), a clever but non-imitable trick for the egocentric student. This underlines why constraining the teacher matters.

Overall: EgoPush wins because its teacher is designed to be imitable, its representation focuses on relations, and its rewards make long tasks learnable. The results back each choice with clear, practical gains.

05Discussion & Limitations

Limitations (specific):

Reactive policy with little memory: The student mainly uses the current frame and a short action history. In long occlusions or narrow corridors, it may oscillate—face the anchor (lose path) vs. face the path (lose anchor).
Shape-dependent contact: Cylinders and prisms complicate pushing stability; small misalignments spin objects or break contact.
Segmentation reliance: We used colored boxes and HSV thresholds; generalizing to varied textures and lighting needs stronger zero-shot instance segmentation.
Depth noise and sensing dead zones: Consumer depth cameras drop measurements near the robot and on shiny/flat tops; filtering helps but is imperfect.
Constrained teacher trade-off: Limiting the teacher slightly reduces teacher optimality. It’s a good trade for distillability, but there may be a sweet spot still unexplored.

Required resources:

Parallel simulation (hundreds to thousands of envs) for PPO and online distillation.
RGB-D camera on the robot and light-weight segmentation on a CPU/GPU.
Some engineering for depth denoising and safe velocity scaling on real hardware.

When not to use:

Ultra-cluttered, maze-like scenes that require long-term memory and global planning beyond local views.
Slippery or highly deformable objects where pushing dynamics are too unpredictable.
Environments where RGB-based instance separation is unreliable and no substitute (e.g., LiDAR, better segmentation) is available.

Open questions:

Memory and belief: Can we combine the object-centric latent with a recurrent unit (GRU/LSTM) to track invisible objects and corridors?
Better perception: How well does it work with modern zero-shot segmentation (e.g., SAM2) in the wild, without color aids?
Richer contact feedback: Would adding inexpensive tactile/force cues stabilize pushing on non-cuboids?
Multi-robot cooperation: Can multiple egocentric students share relational latents to co-push large objects?
Planning in latent space: Can we plan multi-step maneuvers directly over the object-centric latent with guarantees?

06Conclusion & Future Work

Three-sentence summary: EgoPush teaches a mobile robot to rearrange multiple objects using only egocentric vision by thinking in object-to-object relationships instead of global maps. A constrained teacher learns with camera-like visibility and stage-wise, time-decayed rewards, then distills its behavior and relational understanding into a camera-depth student. This design beats classical and pixel-RL baselines in sim and transfers zero-shot to a real TurtleBot, achieving strong success with simple sensing.

Main achievement: Showing that a carefully constrained teacher, object-centric relational representations, and stage-timed rewards together make long-horizon, contact-rich rearrangement learnable and imitable from egocentric vision alone.

Future directions: Add memory to the object-centric latent with a recurrent model to handle long occlusions; replace color-thresholding with robust zero-shot segmentation; incorporate light tactile feedback for stable pushing on varied shapes; extend to multi-robot cooperation and latent-space planning.

Why remember this: EgoPush flips the script—from “maps first” to “relations first,” from “omniscient teacher” to “imitable teacher,” and from “one giant reward” to “timed stages.” That recipe makes real-world, camera-only rearrangement practical, nudging everyday robots closer to useful, reliable helpers.

Practical Applications

•Home tidying: Push toys or boxes into bins or neat patterns without needing a full house map.
•Warehouse sorting: Nudge parcels into lanes or staging areas when grippers aren’t available.
•Hospital logistics: Clear hallways by pushing carts or bins slightly aside to unblock paths.
•Retail restocking: Align boxes on the floor or low shelves when picking is unnecessary.
•Event setup: Arrange barriers or stools into lines or crosses quickly with a mobile robot.
•Factory flow: Decongest workcells by redistributing totes through gentle pushes.
•Search-and-rescue staging: Organize supplies into accessible clusters in dynamic, dusty environments.
•Service robots: Form queues of items (like trays or containers) in cafeterias or kitchens.
•Education/competitions: Teach pushing and arrangement tasks with simple sensors and robust behavior.

Version: 1