Egocentric World Model for Photorealistic Hand-Object Interaction Synthesis
Abstract
To serve as a scalable data source for embodied AI, world models should act as true simulators that infer interaction dynamics strictly from user actions, rather than mere conditional video generators relying on privileged future object states. In this context, egocentric Human–Object Interaction (HOI) world models are critical for predicting physically grounded first-person rollouts. However, building such models is profoundly challenging due to rapid head motions, severe occlusions, and high-DoF hand articulations that abruptly alter contact topologies. Consequently, existing approaches often circumvent these physics challenges by resorting to conditional video generation with access to known future object trajectories. We introduce EgoHOI, an egocentric HOI world model that breaks away from this shortcut to simulate photorealistic, contact-consistent interactions from action signals alone. To ensure physical accuracy without future-state inputs, EgoHOI distills geometric and kinematic priors from 3D estimates into physics-informed embeddings. These embeddings regularize the egocentric rollouts toward physically valid dynamics. Experiments on the HOT3D dataset demonstrate consistent gains over strong baselines, and ablations validate the effectiveness of our physics-informed design.
Pipeline
Qualitative Comparison with Baselines
We compare EgoHOI against four baseline families. Wan is used as a strong diffusion backbone that serves as a generic video generator. Cosmos 2B and Cosmos 14B are used as world model baselines at two parameter scales, and Uni3C is included as an additional comparison model. For all four models, we start from the officially released checkpoints and follow the official post-training configurations.
Scene 1
GT
Wan
Cosmos 2B
Cosmos 14B
Uni3C
Ours
Scene 2
GT
Wan
Cosmos 2B
Cosmos 14B
Uni3C
Ours
Scene 3
GT
Wan
Cosmos 2B
Cosmos 14B
Uni3C
Ours
Scene 4
GT
Wan
Cosmos 2B
Cosmos 14B
Uni3C
Ours
Scene 5
GT
Wan
Cosmos 2B
Cosmos 14B
Uni3C
Ours
Scene 6
GT
Wan
Cosmos 2B
Cosmos 14B
Uni3C
Ours
Scene 7
GT
Wan
Cosmos 2B
Cosmos 14B
Uni3C
Ours
Scene 8
GT
Wan
Cosmos 2B
Cosmos 14B
Uni3C
Ours