Egocentric World Model for Photorealistic Hand-Object Interaction Synthesis

Dayou Li1*, Lulin Liu1,2*, Bangya Liu3, Shijie Zhou4, Jiu Feng5, Ziqi Lu6,
Minghui Zheng1, Chenyu You7, Zhiwen Fan1

1Texas A&M University     2University of Minnesota
3University of Wisconsin-Madison     4University of California, Los Angeles
5University of Texas at Austin     6Amazon
7State University of New York at Stony Brook
Teaser image

Qualitative comparison between a baseline world model and EgoHOI. Both methods start from the same first frame and are not given privileged future object states. The baseline relies on text-only guidance, while EgoHOI integrates physics-informed embeddings distilled from 3D estimates into the generative rollout process. As a result, EgoHOI better preserves ego-motion consistency, kinematic fidelity, and object integrity under viewpoint changes, producing more physically plausible and contact-consistent hand-object interactions over time.

Abstract

To serve as a scalable data source for embodied AI, world models should act as true simulators that infer interaction dynamics strictly from user actions, rather than mere conditional video generators relying on privileged future object states. In this context, egocentric Human–Object Interaction (HOI) world models are critical for predicting physically grounded first-person rollouts. However, building such models is profoundly challenging due to rapid head motions, severe occlusions, and high-DoF hand articulations that abruptly alter contact topologies. Consequently, existing approaches often circumvent these physics challenges by resorting to conditional video generation with access to known future object trajectories. We introduce EgoHOI, an egocentric HOI world model that breaks away from this shortcut to simulate photorealistic, contact-consistent interactions from action signals alone. To ensure physical accuracy without future-state inputs, EgoHOI distills geometric and kinematic priors from 3D estimates into physics-informed embeddings. These embeddings regularize the egocentric rollouts toward physically valid dynamics. Experiments on the HOT3D dataset demonstrate consistent gains over strong baselines, and ablations validate the effectiveness of our physics-informed design.

Pipeline

Pipeline overview
We formulate EgoHOI as an egocentric world model that represents frames with a latent internal state and predicts action-driven transitions with a DiT backbone. Physics-informed embeddings distilled from reconstruction-based 3D priors, together with the first-frame object appearance, are integrated into the latent dynamics via lightweight adapters, enabling realistic hand–object interactions, geometry-consistent ego-motion, and stable object identity under viewpoint changes.

Qualitative Comparison with Baselines

We compare EgoHOI against four baseline families. Wan is used as a strong diffusion backbone that serves as a generic video generator. Cosmos 2B and Cosmos 14B are used as world model baselines at two parameter scales, and Uni3C is included as an additional comparison model. For all four models, we start from the officially released checkpoints and follow the official post-training configurations.

Scene 1

GT

Wan

Cosmos 2B

Cosmos 14B

Uni3C

Ours

Scene 2

GT

Wan

Cosmos 2B

Cosmos 14B

Uni3C

Ours

Scene 3

GT

Wan

Cosmos 2B

Cosmos 14B

Uni3C

Ours

Scene 4

GT

Wan

Cosmos 2B

Cosmos 14B

Uni3C

Ours

Scene 5

GT

Wan

Cosmos 2B

Cosmos 14B

Uni3C

Ours

Scene 6

GT

Wan

Cosmos 2B

Cosmos 14B

Uni3C

Ours

Scene 7

GT

Wan

Cosmos 2B

Cosmos 14B

Uni3C

Ours

Scene 8

GT

Wan

Cosmos 2B

Cosmos 14B

Uni3C

Ours