Text-Guided Diffusion with Spectral Convolution for 3D Human Pose Estimation

Shi, Liyuan; Wu, Suping; Yang, Sheng; Qiu, Weibin; Qiang, Dong; Zhao, Jiarui

Text-Guided Diffusion with Spectral Convolution for 3D Human Pose Estimation

dc.contributor.author	Shi, Liyuan	en_US
dc.contributor.author	Wu, Suping	en_US
dc.contributor.author	Yang, Sheng	en_US
dc.contributor.author	Qiu, Weibin	en_US
dc.contributor.author	Qiang, Dong	en_US
dc.contributor.author	Zhao, Jiarui	en_US
dc.contributor.editor	Christie, Marc	en_US
dc.contributor.editor	Pietroni, Nico	en_US
dc.contributor.editor	Wang, Yu-Shuen	en_US
dc.date.accessioned	2025-10-07T05:03:10Z
dc.date.available	2025-10-07T05:03:10Z
dc.date.issued	2025
dc.description.abstract	Although significant progress has been made in monocular video-based 3D human pose estimation, existing methods lack guidance from fine-grained high-level prior knowledge such as action semantics and camera viewpoints, leading to significant challenges for pose reconstruction accuracy under scenarios with severely missing visual features, i.e., complex occlusion situations. We identify that the 3D human pose estimation task fundamentally constitutes a canonical inverse problem, and propose a motion-semantics-based diffusion(MS-Diff) framework to address this issue by incorporating high-level motion semantics with spectral feature regularization to eliminate interference noise in complex scenes and improve estimation accuracy. Specifically, we design a Multimodal Diffusion Interaction (MDI) module that incorporates motion semantics including action categories and camera viewpoints into the diffusion process, establishing semantic-visual feature alignment through a cross-modal mechanism to resolve pose ambiguities and effectively handle occlusions. Additionally, we leverage a Spectral Convolutional Regularization (SCR) module that implements adaptive filtering in the frequency domain to selectively suppress noise components. Extensive experiments on large-scale public datasets Human3.6M and MPI-INF-3DHP demonstrate that our method achieves state-of-the-art performance.	en_US
dc.description.number	7
dc.description.sectionheaders	Digital Human
dc.description.seriesinformation	Computer Graphics Forum
dc.description.volume	44
dc.identifier.doi	10.1111/cgf.70263
dc.identifier.issn	1467-8659
dc.identifier.pages	10 pages
dc.identifier.uri	https://doi.org/10.1111/cgf.70263
dc.identifier.uri	https://diglib.eg.org/handle/10.1111/cgf70263
dc.publisher	The Eurographics Association and John Wiley & Sons Ltd.	en_US
dc.subject	CCS Concepts: Computing methodologies → Activity recognition and understanding
dc.subject	Computing methodologies → Activity recognition and understanding
dc.title	Text-Guided Diffusion with Spectral Convolution for 3D Human Pose Estimation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: cgf70263.pdf
Size:: 1.35 MB
Format:: Adobe Portable Document Format

Download

Collections

44-Issue 7