Egocentric World Model for Photorealistic Hand-Object Interaction Synthesis

Li, Dayou; Liu, Lulin; Liu, Bangya; Zhou, Shijie; Feng, Jiu; Lu, Ziqi; Zheng, Minghui; You, Chenyu; Fan, Zhiwen

Egocentric World Model for Photorealistic Hand-Object Interaction Synthesis

Dayou Li^1*, Lulin Liu^1,2*, Bangya Liu³, Shijie Zhou⁴, Jiu Feng⁵, Ziqi Lu⁶,
Minghui Zheng¹, Chenyu You⁷, Zhiwen Fan¹^✉

¹Texas A&M University     ²University of Minnesota
³University of Wisconsin-Madison     ⁴University of California, Los Angeles
⁵University of Texas at Austin     ⁶Amazon
⁷State University of New York at Stony Brook

arXiv Code Coming soon

Qualitative comparison between a baseline world model and EgoHOI. Both methods start from the same first frame and are not given privileged future object states. The baseline relies on text-only guidance, while EgoHOI integrates physics-informed embeddings distilled from 3D estimates into the generative rollout process. As a result, EgoHOI better preserves ego-motion consistency, kinematic fidelity, and object integrity under viewpoint changes, producing more physically plausible and contact-consistent hand-object interactions over time.

Abstract

To serve as a scalable data source for embodied AI, world models should act as true simulators that infer interaction dynamics strictly from user actions, rather than mere conditional video generators relying on privileged future object states. In this context, egocentric Human–Object Interaction (HOI) world models are critical for predicting physically grounded first-person rollouts. However, building such models is profoundly challenging due to rapid head motions, severe occlusions, and high-DoF hand articulations that abruptly alter contact topologies. Consequently, existing approaches often circumvent these physics challenges by resorting to conditional video generation with access to known future object trajectories. We introduce EgoHOI, an egocentric HOI world model that breaks away from this shortcut to simulate photorealistic, contact-consistent interactions from action signals alone. To ensure physical accuracy without future-state inputs, EgoHOI distills geometric and kinematic priors from 3D estimates into physics-informed embeddings. These embeddings regularize the egocentric rollouts toward physically valid dynamics. Experiments on the HOT3D dataset demonstrate consistent gains over strong baselines, and ablations validate the effectiveness of our physics-informed design.

Pipeline

Qualitative Comparison with Baselines

We compare EgoHOI against four baseline families. Wan is used as a strong diffusion backbone that serves as a generic video generator. Cosmos 2B and Cosmos 14B are used as world model baselines at two parameter scales, and Uni3C is included as an additional comparison model. For all four models, we start from the officially released checkpoints and follow the official post-training configurations.