Image- and video-based 3D human recovery (i.e., pose and shape estimation) have achieved substantial progress.
However, due to the prohibitive cost of motion capture, existing datasets are often limited in scale and diversity.
In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths.
Specifically, we contribute, GTA-Human, a mega-scale 3D human dataset generated with the GTA-V game engine, featuring a
highly diverse set of subjects, actions, and scenarios. More importantly, we study the use of game-playing data and obtain
four major insights. First, synthetic data provides critical complements to the real data that is typically collected indoor.
In addition to an investigation into domain gap, we discover data mixture strategies are surprisingly effective.
A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin.
For video-based methods, GTA-Human is even on par with the in-domain training set. Second, the scale of the dataset matters.
The performance boost is closely related to the additional data available. A systematic study reveals the model sensitivity
to data density from multiple key aspects. Third, the effectiveness of GTA-Human is also attributed to the rich collection of
strong supervision labels (SMPL parameters), which are otherwise expensive to acquire in real datasets. Fourth, the benefits
of synthetic data extend to larger models such as deeper convolutional neural networks (CNNs) and Transformers, for which
significant improvements are also observed. We hope our work could pave the way for scaling up 3D human recovery
to the real world.