ComPose: When to Trust Hands for Object Pose Tracking

¹GIST, ²Yonsei University, ³DGIST
In collaboration with Yonsei RLLab
^†Corresponding author

Abstract

Reconstructing the motion of objects from videos is a key component for embodied AI and robot manipulation. While diverse approaches to object pose tracking have been studied, they rely heavily on strong external priors, such as depth data or 3D templates, and remain highly vulnerable to severe occlusions by hand grasps despite the use of explicit masks. In this work, we present ComPose, a 6DoF object tracking framework designed for hand-aware object pose estimation from RGB video. Rather than treating the hand purely as an occluder, our method harmonizes hand motions as a complementary cue for object tracking. In detail, we recover a variety of object motions over time by combining object and hand cues from foundation models within a unified tracking pipeline. Here, ComPose adaptively selects informative hand joints, combines object- and hand-derived cues for motion estimation, and refines the resulting object motion using visible geometric evidence and a learned correction. We further enforce temporal consistency over both rotation and translation, yielding stable 3D object trajectories over time without any external smoothing.

Method

ComPose combines object and hand cues from foundation models within a unified tracking pipeline. It (i) adaptively selects informative hand joints, (ii) combines object- and hand-derived cues for motion estimation, and (iii) refines the resulting object motion using visible geometric evidence and a learned correction. Temporal consistency is enforced over both rotation and translation, producing stable 3D object trajectories without external smoothing.

BibTeX

@article{shin2026compose, author = {Shin, Jisu and Lee, Junoh and Lee, JunGyu and Bae, Inhwan and Lee, Dohyeon and Im, Hokyun and Lee, Youngwoon and Jeon, Hae-Gon}, title = {ComPose: When to Trust Hands for Object Pose Tracking}, journal = {arXiv preprint arXiv:2605.23523}, year = {2026}, }

ComPose: When to Trust Hands for Object Pose Tracking

ComPose tracks 6DoF object poses from RGB video by harmonizing hand motion as a complementary cue — even under severe occlusion from hand grasps.

Abstract

Method

Results

BibTeX