Training Any-Size Videos with Data-Centric Parallel

National University of Singapore
(* and † indicate equal contribution and correspondence)

New Paradigm for Any-Size Video Training 🚀: Let Data Choose its Parallel! We introduce Data-Centric Parallel (DCP), a simple but effective approach to accelerate distributed training of any-size videos. Unlike previous methods that fix training settings, DCP dyanmically adjusts parallelism and other configs driven by incoming data during runtime. This method significantly reduces communication overhead and computational inefficiencies, achieving up to 2.1x speedup. As a ease-of-use method, DCP can enpower any video models and parallel methods with minimal code changes.



Video 1: Video demonstration of Data-Centric Parallel.

Method

Motivation - Reduce Imbalance and Communication Through Data-Centric

Sora [1] and its successors (Movie Gen [2], Open-Sora [3], Vchitect [4]) have recently achieved remarkable progress in video generation. They train on variable size videos (i.e., num frames, aspect ratio, and resolution), which offers two advantages: enhanced generation quality and flexible output video sizes [1].


Figure 1: Comparison of parallel methods for any-size video training including bucket parallel [5], packed parallel [6] and data-centric parallel (ours). \( Di \) refers to the \(i \)-th device.

We compare two common parallel methods for any-size videos training in Figure 1. Bucket parallel [5] fixes sequence parallel size based on the longest video, adjusting batch sizes for balance. Packed parallel [6], while also fixing sequence parallel size, concats multiple videos in one batch.

However, bucket parallel struggles with workload imbalance at small batch sizes and communication inefficiency for shorter videos, while packed parallel, despite better load balancing, incurs unnecessary communication overhead and requires complex implementation changes. Both approaches fail to addresses two critical challenges in any-size video training: necessary sequence parallel for long videos and the unbalanced workloads from diverse video sizes, due to their fixed settings and lack of data awareness.


Data-Centric Parallel - Let Data Choose its Parallel


Figure 2: Overview of Data-Centric Parallel (DCP). \( frame \) - number of frames, \( ar \) - aspect ratio, \( res \) - resolution, \( D_i\) - the \( i \)-th device, \( bs \) - batch size, \( grad\ acc \) - gradient accumulation steps, and \( sp \) - sequence parallel size.

To address these challenges, we propose Data-Centric Parallel, an innovative approach that transforms parallel training with data awareness. Instead of fixing all configs during training, our method adaptively change parallel and other settings driven by the incoming data, significantly improving computational efficiency and reducing communication overhead.

As shown in Figure 2, our method consists of two main stages: data profile and data-centric runtime. In data profile stage, we classify videos by size and determine optimal settings for each sequence through fast dual-layer profile. At runtime, we first enqueue videos into the batch until it's full, and then dynamically balance each sequences using two strategies:

  • Intra-data balance: adjusts the intrinsic settings (e.g., batch size, sequence parallel size) of the sequence.
  • Inter-data balance: optimizes among multiple sequences (e.g., gradient accumulation) without changing intrinsic settings.

  • Pyramid Activation Checkpointing - Adaptive Recompute for Any-Size Videos


    Figure 3: Demonstration of pyramid activation checkpointing. We apply different ckpt ratios for different sizes of videos, and trade sequence parallel for extra memory due to less ckpt.

    As illustrated in Figure 3, in any-size videos training, video sizes and their memory consumption are highly varied, with short videos using less memory and long videos using more. Selective activation checkpointing are limited by longer videos, as longer videos typically require multi-gpu or even intra-node sequence parallelism if reducing recomputing.

    Building upon our data-centric parallel approach, we propose pyramid activation checkpointing, which adaptively applies different ckpt ratios based on video sizes. This approach significantly improves throughput for the shorter videos by less recomputing, while maintaining communication overhead for longer videos, which is even more beneficial in datasets where shorter videos dominate.

    Evaluations

    End-to-End Speedups - How Fast Can DCP Train?

    Figure 4: End-to-end speedups of DCP compared with other parallel methods.

    Measured total latency of PAB for different models for generating a single video on 8 NVIDIA H100 GPUs. When utilizing a single GPU, we achieve a speedup ranging from 1.26x to 1.32x, which remains stable across different schedulers. Scaling to multiple GPUs, our method achieves a speedup of up to 10.6x, which almost linearly scales with the number of GPUs due to our efficient improvement of sequence parallel.

    Imbalance Analysis - How is Computation Imbalance with DCP?


    Figure 5: Imbalance analysis of DCP.

    Parallel Analysis - Why Should We Determine Parallel Size by Data?


    Figure 6: Parallel analysis of DCP.

    Recompute Analysis - When Can Pyramid Activation Checkpointing Help?


    Figure 7: Recompute analysis of DCP.

    Related works

    1. Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. "Video generation models as world simulators." 2024.
    2. Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas et al. "Movie Gen: A Cast of Media Foundation Models." arXiv, 2024.
    3. Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. "Open-Sora: Democratizing Efficient Video Production for All." GitHub, 2024.
    4. Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi et al. "Scaling rectified flow transformers for high-resolution image synthesis." ICML, 2024.
    5. Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner et al. "Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution." NeurIPS, 2024.
    6. Xuanlei Zhao, Shenggan Cheng, Chang Chen, Zangwei Zheng, Ziming Liu, Zheming Yang, and Yang You. "Dsp: Dynamic sequence parallelism for multi-dimensional transformers." arXiv, 2024.
    7. Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. "Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models." arXiv, 2023.
    8. Tailing Yuan, Yuliang Liu, Xucheng Ye, Shenglong Zhang, Jianchao Tan, Bin Chen, Chengru Song, and Di Zhang. "Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism." ATC, 2024.
    9. Jiannan Wang, Jiarui Fang, Aoyu Li, and PengCheng Yang. "PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models." arXiv, 2024.
    10. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. "Megatron-lm: Training multi-billion parameter language models using model parallelism." arXiv, 2019.
    11. Jiarui Fang, and Shangchun Zhao. "A Unified Sequence Parallelism Approach for Long Context Generative AI." arXiv, 2024.
    12. Shenggui Li, Fuzhao Xue, Yongbin Li and Yang You. "Sequence Parallelism: Long Sequence Training from System Perspective." ACL, 2021.

    BibTeX

    @misc{zhang2024dcp,
      title={Training Any-Size Videos with Data-Centric Parallel},
      author={Geng Zhang and Xuanlei Zhao and Kai Wang and Yang You},
      year={2024},
    }