Sapiens: The foundations of a model of human vision

Go Back

The remarkable success of large-scale pre-training followed by task-specific fine-tuning for language modeling has established the approach as a standard technique. Similarly, computer vision methods are gradually incorporating large data scales for pre-training. The emergence of large datasets such as LAION5B, Instagram-3.5B, JFT-300M, LVD142M, Visual Genome, and YFCC100M has enabled the exploration of data corpora far beyond the scope of traditional benchmarks. Notable works in this field include DINOv2, MAWS, and AIM. DINOv2 achieves state-of-the-art performance in self-supervised feature generation by scaling the contrastive iBot method on the LDV-142M dataset. MAWS studies the scaling of Mask Autoencoders (MAE) on billions of images. AIM investigates the scalability of BERT-like autoregressive visual pre-training for vision transformers. In contrast to these methods, which mostly focus on general image pre-training or zero-shot image classification, Sapiens takes an explicitly human-centric approach. Sapiens’ models leverage a vast collection of human images for pre-training, and then are fine-tuned for a variety of human-related tasks. The pursuit of large-scale 3D human digitization remains an important goal in computer vision.

While significant progress has been made within controlled or studio environments, challenges remain in extending these methods to unconstrained environments. To address these challenges, it is essential to develop multi-objective models that can perform multiple fundamental tasks, such as keypoint estimation, body part segmentation, depth estimation, and surface normal prediction from images in natural environments. In this work, Sapiens aims to develop models of these important human vision tasks that can generalize to natural environments. Currently, the largest publicly available language model contains over 100 billion parameters, and the more commonly used language models contain around 7 billion parameters. In contrast, Vision Transformers (ViT) have not scaled to this extent, despite sharing a similar architecture. While notable efforts in this direction include the development of dense ViT-4B, trained on both text and images, and formulating techniques for stable training of ViT-22B, commonly used vision backbones still range from 300M to 600M parameters and are primarily pre-trained at around 224 pixel image resolution. Similarly, existing transformer-based image generation models such as DiT use less than 700M parameters and operate on highly compressed latent spaces. To fill this gap, Sapiens introduces a collection of large-scale, high-resolution ViT models natively pre-trained at 1024 pixel image resolution on millions of human images.

Sapiens provides a family of models for four fundamental human-centric vision tasks: 2D pose estimation, body part segmentation, depth estimation, and surface normal prediction. Sapiens models natively support 1K high-resolution inference and can be easily adapted to individual tasks by simply fine-tuning models pre-trained on over 300 million real human images. Sapiens observes that, for the same computational budget, self-supervised pre-training on a dataset of curated human images significantly improves performance across a range of human-centric tasks. The resulting models show remarkable generalization to real data, even when labeled data is scarce or fully synthetic. The simple model design also enables scalability, with model performance improving across tasks as the number of parameters scales from 0.3 to 2 billion. Sapiens consistently outperforms existing baselines on a range of human-centric benchmarks, achieving significant improvements over previous state-of-the-art results, including 7.6 mAP on Humans-5K (pose), 17.1 mIoU on Humans-2K (partial segments), 22.4% relative RMSE on Hi4D (depth), and 53.5% relative angular error on Thuman2 (regular).

Recent years have seen remarkable progress in generating photorealistic humans in 2D and 3D. The success of these methods is largely due to the robust estimation of various assets, such as 2D key points, fine-grained body part segmentation, depth, and surface normals. However, robust and accurate estimation of these assets remains an active research area, and complex systems to improve performance on individual tasks often hinder wider adoption. Furthermore, obtaining accurate ground truth annotations in real-world environments is notoriously difficult to scale. The goal of Sapiens is to provide a unified framework and models for inferring these assets in real-world environments, enabling a wide range of human-centric applications for everyone.

Sapiens argues that such human-centric models must meet three criteria: generalization, broad applicability, and high fidelity. Generalization ensures robustness to unseen conditions, allowing the model to perform consistently in different environments. Broad applicability indicates the model’s versatility, making it suitable for a wide range of tasks with minimal modifications. High fidelity indicates the model’s ability to produce accurate, high-resolution output, essential for high-fidelity human generation tasks. This paper details the development of a model (collectively called Sapiens) that embodies these attributes.

Following insights, Sapiens leverages large datasets and scalable model architectures, which are key to generalization. To achieve broader applicability, Sapiens adopts a pre-train-then-fine-tune approach, allowing it to adapt to specific tasks after pre-training with minimal adjustments. This approach raises an important question: What kind of data is most effective for pre-training? Given computational limitations, should we focus on collecting as many human images as possible, or is it preferable to pre-train on a less curated set to better reflect real-world variability? Existing methods often overlook the distribution of pre-training data in the context of downstream tasks. To investigate the impact of pre-training data distribution on human-specific tasks, Sapiens collects the Humans-300M dataset, which features 300 million diverse human images. These unlabeled images are used to pre-train a set of vision transformers from scratch, with parameter counts ranging from 300 million to 2 billion.

Among various self-supervised methods for learning generic visual features from large datasets, Sapiens chooses the masked autoencoder (MAE) approach for its simplicity and efficiency in pre-training. Compared with contrasting or multiple inference strategies, MAE with a single-pass inference model can process a larger amount of images with the same computational resources. To increase fidelity, in contrast to previous methods, Sapiens increases the native input resolution for pre-training to 1024 pixels, resulting in approximately 4x increase in FLOPs compared to the largest existing vision backbone. Each model is pre-trained with 1.2 trillion tokens. For fine-tuning on human-centric tasks, Sapiens uses a consistent encoder-decoder architecture. The encoder is initialized with weights from pre-training, and the decoder, a lightweight, task-specific head, is initialized randomly. Both components are then fine-tuned end-to-end. Sapiens focuses on four main tasks: 2D pose estimation, body part segmentation, depth, and normal estimation, as shown in the following figure.

Consistent with previous studies, Sapiens confirms that label quality has a significant impact on the real-world performance of a model. Public benchmarks often contain noisy labels, providing inconsistent supervision signals during model fine-tuning. At the same time, it is crucial to use fine-grained and accurate annotations to closely match Sapiens’ primary goal of 3D human digitization. To this end, Sapiens proposes a significantly denser set of 2D whole-body keypoints for pose estimation and a detailed class vocabulary for body part segmentation, going beyond the scope of previous datasets. Specifically, Sapiens introduces a comprehensive collection of 308 keypoints covering the body, hands, feet, face, and face. In addition, Sapiens extends the segmentation class vocabulary to 28 classes, covering body parts such as hair, tongue, teeth, upper/lower lips, and torso. To ensure annotation quality and consistency, as well as a high degree of automation, Sapiens uses a multi-view capture setup to collect pose and segmentation annotations. Sapiens also uses human-centric synthetic data for depth and normal estimation, leveraging 600 detailed scans from RenderPeople to generate high-resolution depth maps and surface normals. Sapiens demonstrates that the combination of large-scale domain-specific pre-training and limited but high-quality annotations can achieve robust real-world generalization. Overall, the Sapiens method demonstrates an effective strategy for developing highly accurate discriminative models that can run in real-world scenarios without the need to collect costly and diverse sets of annotations.

Sapiens: Methods and Architecture

Sapiens employs a masked autoencoder (MAE) approach for pre-training. The model is trained to reconstruct the original human image based on partial observations. Like all autoencoders, Sapiens’ model has an encoder that maps a visible image to a latent representation and a decoder that reconstructs the original image from this latent representation. The pre-training dataset consists of single and multiple human images, each of which is resized to a fixed size with a square aspect ratio. Similar to ViT, the image is divided into non-overlapping regular patches with a fixed patch size. A subset of these patches is randomly selected to be masked, while the rest remain visible. The ratio of masked to visible patches (masking ratio) remains constant during training.

Sapiens’ model shows generalization across a range of image characteristics, including scale, cropping, subject age and ethnicity, and number of subjects. Each patch token in the model occupies 0.02% of the image area compared to 0.4% in standard ViT, a 16x reduction, providing the model with fine-grained inter-token inference. Even as the mask ratio increases to 95%, Sapiens’ model achieves reasonable reconstruction of human anatomy with the retained samples. The following image shows the reconstruction by Sapien’s pre-trained model on an unseen human image.

Additionally, Sapiens leverages a large, proprietary dataset of nearly 1 billion natural images for pre-training, focusing only on human images. Pre-processing discards images that contain watermarks, text, artistic depictions, or unnatural elements. Sapiens then filters images using an off-the-shelf person bounding box detector, retaining images with a detection score above 0.9 and bounding box dimensions above 300 pixels. Over 248 million images in the dataset contain multiple subjects.

2D Pose Estimation

The Sapien framework fine-tunes the encoder and decoder for P across multiple skeletons, including K = 17 (67), K = 133 (55), and new highly detailed skeletons for K = 308, as shown in the following figure.

Compared to existing formats with up to 68 facial keypoints, Sapien’s annotation consists of 243 facial keypoints, including representative points around the eyes, lips, nose, and ears. The design is finely tuned to capture subtle details of real-world facial expressions. Using these keypoints, the Sapien framework manually annotated 1 million images at 4K resolution from an indoor capture setup. Similar to previous tasks, we set the decoder output channels of the normal estimator, N, to 3, which corresponds to the xyz components of the normal vector for each pixel. The generated synthetic data is also used as supervision for surface normal estimation.

Sapiens: Experiments and Results

Sapiens-2B is pre-trained with PyTorch for 18 days using 1024 A100 GPUs. Sapiens uses the AdamW optimizer for all experiments. The learning schedule includes a short linear warm-up, cosine annealing for pre-training, and linear decay for fine-tuning. All models are pre-trained from scratch with a resolution of 1024 × 1024 and a patch size of 16. For fine-tuning, input images are resized to a 4:3 ratio, i.e., 1024 × 768. Sapiens applies standard augmentations such as cropping, scaling, inversion, and photometric distortion. For segmentation, depth, and regular prediction tasks, random backgrounds from non-human COCO images are added. Importantly, Sapiens uses differential learning rates to maintain generalization, with low learning rates in the first layers and gradually higher learning rates in subsequent layers. The learning rate decay per layer is set to 0.85 and the encoder weight decay is set to 0.1.

The design specifications of Sapiens are detailed in the following table. Following a specific approach, Sapiens prioritizes scaling models by width rather than depth. In particular, the Sapiens-0.3B model is architecturally similar to the traditional ViT-Large, but consists of 20x more FLOPs due to its higher resolution.

Sapiens is fine-tuned for pose estimation of face, body, feet, and hands (K = 308) using high-fidelity annotations. For training, Sapiens uses a training set with 1 million images, and for evaluation, we use a test set with 5K images called Humans5K. The evaluation follows a top-down approach, where Sapiens uses an off-the-shelf detector for bounding boxes and performs pose inference for a single human. Table 3 shows a comparison of the Sapiens model with existing methods for full-body pose estimation. All methods are evaluated on 114 common key points between Sapiens’ vocabulary of 308 key points and COCO-WholeBody’s vocabulary of 133 key points. Sapiens-0.6B outperforms the current state-of-the-art, DWPose-l, by +2.8 AP. Unlike DWPose, which utilizes a complex student-teacher framework with task-tailored feature distillation, Sapiens employs a general encoder-decoder architecture with extensive human-centric pre-training.

Interestingly, even with the same number of parameters, the Sapiens model performs better compared to other models. For example, Sapiens-0.3B outperforms VitPose+-L by +5.6 AP, and Sapiens-0.6B outperforms VitPose+-H by +7.9 AP. Within the Sapiens family, we show a direct correlation between model size and performance. Sapiens-2B establishes a new state-of-the-art with 61.1 AP, achieving a significant improvement of +7.6 AP over its predecessor. Despite fine-tuning with annotations from an indoor capture studio, Sapiens shows robust generalization to real-world scenarios, as shown in the following figure.

Sapiens is fine-tuned and evaluated using a 28-class segmentation vocabulary. The training set consists of 100K images, and the test set, Humans-2K, consists of 2K images. Sapiens is compared to existing body part segmentation methods that are fine-tuned on the same training set, using pre-trained checkpoints suggested by each method as initialization. Similar to pose estimation, Sapiens shows generalization for segmentation, which is shown in the following table.

Interestingly, our smallest model, Sapiens-0.3B, outperforms existing state-of-the-art segmentation methods such as Mask2Former and DeepLabV3+ by 12.6 mIoU due to high resolution and extensive human-centric pre-training. Moreover, increasing the model size further improves segmentation performance. Sapiens-2B achieves the best performance with 81.2 mIoU and 89.4 mAcc on the test set. The following figure shows the qualitative results for the Sapiens model.

Conclusion

Sapiens is a major step forward in advancing human-centric vision models into the realm of foundational models. The Sapiens model exhibits strong generalization capabilities across a wide range of human-centric tasks. Its state-of-the-art performance is achieved through (i) extensive pre-training on a curated dataset specifically tuned for understanding humans, (ii) an extended high-resolution and large-capacity vision transformer backbone, and (iii) high-quality annotations on extended studio and synthetic data. The Sapiens model has the potential to be a key building block for many downstream tasks, providing access to a high-quality vision backbone to a significantly broader portion of the community.