New Paradigm for Any-Size Video Training 🚀: Let Data Choose its Parallel!
We introduce Data-Centric Parallel (DCP), a simple but effective approach to accelerate distributed training of any-size videos.
Unlike previous methods that fix training settings, DCP dyanmically adjusts parallelism and other configs driven by incoming data during runtime.
This method significantly reduces communication overhead and computational inefficiencies, achieving up to 2.1x speedup.
As a ease-of-use method, DCP can enpower any video models and parallel methods with minimal code changes.
Sora [1] and its successors (Movie Gen [2], Open-Sora [3], Vchitect [4]) have recently achieved remarkable progress in video generation. They train on variable size videos (i.e., num frames, aspect ratio, and resolution), which offers two advantages: enhanced generation quality and flexible output video sizes [1].
We compare two common parallel methods for any-size videos training in Figure 1. Bucket parallel [5] fixes sequence parallel size based on the longest video, adjusting batch sizes for balance. Packed parallel [6], while also fixing sequence parallel size, concats multiple videos in one batch.
However, bucket parallel struggles with workload imbalance at small batch sizes and communication inefficiency for shorter videos, while packed parallel, despite better load balancing, incurs unnecessary communication overhead and requires complex implementation changes. Both approaches fail to addresses two critical challenges in any-size video training: necessary sequence parallel for long videos and the unbalanced workloads from diverse video sizes, due to their fixed settings and lack of data awareness.
To address these challenges, we propose Data-Centric Parallel, an innovative approach that transforms parallel training with data awareness. Instead of fixing all configs during training, our method adaptively change parallel and other settings driven by the incoming data, significantly improving computational efficiency and reducing communication overhead.
As shown in Figure 2, our method consists of two main stages: data profile and data-centric runtime. In data profile stage, we classify videos by size and determine optimal settings for each sequence through fast dual-layer profile. At runtime, we first enqueue videos into the batch until it's full, and then dynamically balance each sequences using two strategies:
As illustrated in Figure 3, in any-size videos training, video sizes and their memory consumption are highly varied, with short videos using less memory and long videos using more. Selective activation checkpointing are limited by longer videos, as longer videos typically require multi-gpu or even intra-node sequence parallelism if reducing recomputing.
Building upon our data-centric parallel approach, we propose pyramid activation checkpointing, which adaptively applies different ckpt ratios based on video sizes. This approach significantly improves throughput for the shorter videos by less recomputing, while maintaining communication overhead for longer videos, which is even more beneficial in datasets where shorter videos dominate.
Measured total latency of PAB for different models for generating a single video on 8 NVIDIA H100 GPUs. When utilizing a single GPU, we achieve a speedup ranging from 1.26x to 1.32x, which remains stable across different schedulers. Scaling to multiple GPUs, our method achieves a speedup of up to 10.6x, which almost linearly scales with the number of GPUs due to our efficient improvement of sequence parallel.
@misc{zhang2024dcp,
title={Training Any-Size Videos with Data-Centric Parallel},
author={Geng Zhang and Xuanlei Zhao and Kai Wang and Yang You},
year={2024},
}