GPA

Abstract

Full 3D estimation of human pose from a single image remains a challenging task despite many recent advances. In this paper, we explore the hypothesis that strong prior information about scene geometry can be used to improve pose estimation accuracy. To tackle this question empirically, we have assembled a novel (Geometric Pose Affordance) dataset, consisting of multi-view imagery of people interacting with a variety of rich 3D environments. We utilized a commercial motion capture system to collect gold-standard estimates of pose and construct accurate geometric 3D CAD models of the scene itself. To inject prior knowledge of scene constraints into existing frameworks for pose estimation from images, we introduce a novel, view-based representation of scene geometry, a (multi-layer depth map), which employs multi-hit ray tracing to concisely encode multiple surface entry and exit points along each camera view ray direction. We propose two different mechanisms for integrating multi-layer depth information pose estimation: input as encoded ray features used in lifting 2D pose to full 3D, and secondly as a differentiable loss that encourages learned models to favor geometrically consistent pose estimates. We show experimentally that these techniques can improve the accuracy of 3D pose estimates, particularly in the presence of occlusion and complex scene geometry. Low Resolution PDF

Acknowledgements

This project is supported by NSF grants IIS-1813785, IIS1618806, IIS-1253538, CNS-1730158 and a hardware donation from NVIDIA. Zhe Wang personally thanks Shu Kong and Minhaeng Lee for helpful discussion, John Crawford and Fabio Paolizzo for providing support on the motion capture studio, and all the UCI friends who contribute to the dataset collection.

Description of the video.

The video shows the detail of full resolution video, cropped video with ground truth joints/markers overlay. We also show the subject id (here we use anonymous), take name, camera name, video time id, mocap time id, bone length (which is constant overtime), velocity, number of valid markers, invisible joints, and invisible markers (there are 53 markers and 34 joints for VICON system). The video is sampled from 10 clips (from 5 camera views) from 'Action', 'Motion' set with the same subject. legend: hollow circles: occluded joints; solid dots: non-occluded joints; dotted lines: partially/completely occluded body parts; solid lines: non-occluded body parts.

Dataset/ Pre-trained Model Download.

uncropped images, uncropped images Deeplab mask ,cropped images aligned with multi-layer depth maps, cropped images with grabcut that filters out background pixels

crop_md.tar: multi-layer depth map aligned with cropped images
Gaussian_cropped_images_greenbackground.tar.gz: cropped images with grabcut segment human
Gaussian_cropped_images.tar.gz: cropped images
Gaussian_fullimg.tar.gz: full size images
img_jpg_new_resnet101deeplabv3humanmask.tar.gz: full size images with deeplabv3 segmented human
public_check_data.ipynb: data access notebook
xyz_gpa12_mdp_cntind_crop_cam_c2g.json: metainfo
c2gimgid.npy: close2geometry image ids

Data & Visualization

a. Mocap setting; b. RGB image; c. Scene meshes and human skeleton in Maya; d-f. corresponding first three layers of multi-layer depth map representation of scene geometry. d corresponds to a traditional depth map, recording the depth of the first surface of scene geometry from the same camera view of b. e is when the multi-hit ray leaves the first layer of objects (e.g. the backside of the boxes). f is the depth when the multi-hit view ray hits a third surface (e.g., floor behind the box).

All the subjects and capture scripts subset. The semantic actions of Action Set are constructed from a subset of Human3.6M, namely, Direction, Discussion, Writing, Greeting, Phoning, Photo, Posing and Walk Dog to provide a connection for comparisons between our dataset and the de facto standard benchmark. Motion Set includes poses with more dynamic range of motion, such as running, side-to-side jumping, rotating, jumping over, and improvised poses from subjects. Interaction Set mainly consists of close interactions between poses and object boundaries to provide ground truth for modeling affordance in 3D. There are three main poses in this group: Sitting, Touching, Standing on, corresponding to typical affordance relations Walkable, Reachable, Sittable.

All the scenes from camera 2 view.

All the scene meshes (shown in Meshlab)

Five camera views of single scene and corresponding multi-layer depth maps. Here the depth value is displayed as disparity (1/depth) for visualization purposes.

Statistics between joints and geometry relationship. Left: percentile of which multi-depth layer joints are most closest to; Right: percentile of number of joints occluded.

Results & Analysis

Flowchart of our method.

Results on close2geometry subset.

Visualization for the input images, with the overlay of ground truth pose in the same view(GT)(blue corresponds right human skeletons while red represents left human skeletons), column 2-4 is the first 3 layer of multi-layer depth map. Column 5 is the baseline prediction overlay with the 1st layer multi-layer depth map while column 6 is our ResNet-F prediction. We can view from the figures that geometry input and constraint always help the model give better prediction. The red rectangles highlight where baseline model give prediction violating geometry or not as good as the full model. legend: hollow circles: occluded joints; solid dots: non-occluded joints; dotted lines: partially/completely occluded body parts; solid lines: non-occluded body parts.

More visualization.

Geometric Pose Affordance: 3D Human Pose with Scene Constraints

Zhe Wang, Liyan Chen, Shaurya Rathore, Daeyun Shin, Charless Fowlkes

Abstract

Acknowledgements

Description of the video.

Dataset/ Pre-trained Model Download.

Terms of usage and License:

Citation

By downloading you agreed with the terms of usage and license!

Data & Visualization

Results & Analysis