Geometric Pose Affordance: 3D Human Pose with Scene Constraints


Zhe Wang, Liyan Chen, Shaurya Rathore, Daeyun Shin, Charless Fowlkes




Abstract

Full 3D estimation of human pose from a single image remains a challenging task despite many recent advances. In this paper, we explore the hypothesis that strong prior information about scene geometry can be used to improve pose estimation accuracy. To tackle this question empirically, we have assembled a novel (Geometric Pose Affordance) dataset, consisting of multi-view imagery of people interacting with a variety of rich 3D environments. We utilized a commercial motion capture system to collect gold-standard estimates of pose and construct accurate geometric 3D CAD models of the scene itself. To inject prior knowledge of scene constraints into existing frameworks for pose estimation from images, we introduce a novel, view-based representation of scene geometry, a (multi-layer depth map), which employs multi-hit ray tracing to concisely encode multiple surface entry and exit points along each camera view ray direction. We propose two different mechanisms for integrating multi-layer depth information pose estimation: input as encoded ray features used in lifting 2D pose to full 3D, and secondly as a differentiable loss that encourages learned models to favor geometrically consistent pose estimates. We show experimentally that these techniques can improve the accuracy of 3D pose estimates, particularly in the presence of occlusion and complex scene geometry.

Acknowledgements

This project is supported by NSF grants IIS-1813785, IIS1618806, IIS-1253538, CNS-1730158 and a hardware donation from NVIDIA. Zhe Wang personally thanks Shu Kong and Minhaeng Lee for helpful discussion, John Crawford and Fabio Paolizzo for providing support on the motion capture studio, and all the UCI friends who contribute to the dataset collection.

Description of the video.

  • The video shows the detail of full resolution video, cropped video with ground truth joints/markers overlay. We also show the subject id (here we use anonymous), take name, camera name, video time id, mocap time id, bone length (which is constant overtime), velocity, number of valid markers, invisible joints, and invisible markers (there are 53 markers and 34 joints for VICON system). The video is sampled from 10 clips (from 5 camera views) from 'Action', 'Motion' set with the same subject. legend: hollow circles: occluded joints; solid dots: non-occluded joints; dotted lines: partially/completely occluded body parts; solid lines: non-occluded body parts.

Data & Visualization

  • a. Mocap setting; b. RGB image; c. Scene meshes and human skeleton in Maya; d-f. corresponding first three layers of multi-layer depth map representation of scene geometry. d corresponds to a traditional depth map, recording the depth of the first surface of scene geometry from the same camera view of b. e is when the multi-hit ray leaves the first layer of objects (e.g. the backside of the boxes). f is the depth when the multi-hit view ray hits a third surface (e.g., floor behind the box).
  • All the subjects and capture scripts subset. The semantic actions of Action Set are constructed from a subset of Human3.6M, namely, Direction, Discussion, Writing, Greeting, Phoning, Photo, Posing and Walk Dog to provide a connection for comparisons between our dataset and the de facto standard benchmark. Motion Set includes poses with more dynamic range of motion, such as running, side-to-side jumping, rotating, jumping over, and improvised poses from subjects. Interaction Set mainly consists of close interactions between poses and object boundaries to provide ground truth for modeling affordance in 3D. There are three main poses in this group: Sitting, Touching, Standing on, corresponding to typical affordance relations Walkable, Reachable, Sittable.
  • All the scenes from camera 2 view.
  • All the scene meshes (shown in Meshlab)
  • Five camera views of single scene and corresponding multi-layer depth maps. Here the depth value is displayed as disparity (1/depth) for visualization purposes.
  • Statistics between joints and geometry relationship. Left: percentile of which multi-depth layer joints are most closest to; Right: percentile of number of joints occluded.

Results & Analysis

  • Flowchart of our method.
  • Results on close2geometry subset.
  • Visualization for the input images, with the overlay of ground truth pose in the same view(GT)(blue corresponds right human skeletons while red represents left human skeletons), column 2-4 is the first 3 layer of multi-layer depth map. Column 5 is the baseline prediction overlay with the 1st layer multi-layer depth map while column 6 is our ResNet-F prediction. We can view from the figures that geometry input and constraint always help the model give better prediction. The red rectangles highlight where baseline model give prediction violating geometry or not as good as the full model. legend: hollow circles: occluded joints; solid dots: non-occluded joints; dotted lines: partially/completely occluded body parts; solid lines: non-occluded body parts.
  • More visualization.

Last Updated on 19th May, 2019