Weakly Supervised 3D Human Pose and Shape Reconstruction with Normalizing Flows
Monocular 3D human pose and shape estimation is challenging due to the many degrees of freedom of the human body and the difficulty to acquire training data for large-scale supervised learning in complex visual scenes where humans with diverse shape and appearance, appear against complex backgrounds,in a variety of poses, and are partially occluded, or involved in interactions. Essential to learning is leveraging effective 3D human priors, and the ability to work under weak supervision, at scale, by exploiting, to the largest extent, the detailed human body semantics in images. In this paper we present new priors as well as large-scale weakly supervised models for 3D human pose and shape estimation. Key to our formulation are new latent normalizing flow representations, as well as fully differentiable, structurally-sensitive, semantic body part alignment(re-projection) loss functions that ensure consistent estimates and sharp feedback signals for learning. In extensive experiments using both motion capture datasets like CMU, Human3.6M, 3DPW, or AMASS, and repositories like COCO, we show that our proposed methods outperform existing counterparts, supporting the construction of an increasingly more accurate family of models based on large-scale training with unlabeled image data. "