Training Variable Sequences with Data-Centric Parallel

Geng Zhang^*, Xuanlei Zhao^*, Kai Wang^†, Yang You^†

VideoSys Team, National University of Singapore

(* and † indicate equal contribution and correspondence)

Code Paper (Coming Soon) Doc Twitter BibTex

Data should have its own parallel way, rather than being artificially defined. Let Data Choose its Parallel!

VideoSys Team

We introduce Data-Centric Parallel (DCP), a simple but effective approach to accelerate distributed training of varaible sequences (e.g., videos). Unlike previous methods that fix training settings, DCP dyanmically adjusts parallelism and other configs driven by incoming data during runtime. This method significantly reduces communication overhead and computational inefficiencies, achieving up to 2.1x speedup. As a ease-of-use method, our method can enpower any models and most parallel methods within minimal code changes (10 lines).

Video 1: Overview of Data-Centric Parallel.

Method

Motivation - Reduce Imbalance and Communication Through Data-Centric

Variable sequences training is commonly used in various tasks, such as video generation (Sora [1], Movie Gen [2], Open-Sora [3], HunyuanVideo [4]), text generation (Llama-3 [5]), and scientific computing (AlphaFold [6]). Such strategy offers them two advantages: enhanced generation quality and flexible output sizes [1].

Figure 1: Comparison of parallel methods for variable sequences training including bucket parallel [7], packed parallel [8] and data-centric parallel (ours). $ Di $ refers to the $i $-th device.

We compare two common parallel methods for variable sequences training in Figure 1. Bucket parallel [7] fixes sequence parallel size based on the longest sequence, adjusting batch sizes for balance. Packed parallel [8], while also fixing sequence parallel size, concats multiple sequences in one batch.

However, bucket parallel struggles with workload imbalance at small batch sizes and communication inefficiency for shorter sequences, while packed parallel, despite better load balancing, incurs unnecessary communication overhead and requires complex implementation changes. Both approaches fail to addresses two critical challenges in variable sequences training: necessary sequence parallel for long sequences and the unbalanced workloads from diverse sequence sizes, due to their fixed settings and lack of data awareness.

Data-Centric Parallel - Let Data Choose its Parallel

Video 2: Detailed demonstration of Data-Centric Parallel.

Figure 2: Overview of Data-Centric Parallel (DCP). We take video data as an example. $ frame $ - number of frames, $ ar $ - aspect ratio, $ res $ - resolution, $ D_i$ - the $ i $-th device, $ bs $ - batch size, $ grad\ acc $ - gradient accumulation steps, and $ sp $ - sequence parallel size.

To address these challenges, we propose Data-Centric Parallel, an innovative approach that transforms parallel training with data awareness. Instead of fixing all configs during training, our method adaptively change parallel and other settings driven by the incoming data, significantly improving computational efficiency and reducing communication overhead.

As shown in Figure 2, our method consists of two main stages: data profile and data-centric runtime. In data profile stage, we classify sequences by size and determine optimal settings for each sequence through fast dual-layer profile. At runtime, we first enqueue sequences into the batch until it's full, and then dynamically balance each sequences using two strategies:

DCP-intra: adjusts the intrinsic settings (e.g., batch size, sequence parallel size) of the sequence.

DCP-inter: optimizes among multiple sequences (e.g., gradient accumulation) without changing intrinsic settings.

Pyramid Activation Checkpointing - Adaptive Recompute for Variable Sequences

Figure 3: Demonstration of pyramid activation checkpointing. We apply different ckpt ratios for different sizes of sequences, and trade sequence parallel for extra memory due to less ckpt.

As illustrated in Figure 3, sequence sizes are highly varied in variable sequences training, with short sequences using less memory and long sequences using more. Selective activation checkpointing are limited by longer sequences, as they require multi-gpu or even intra-node sequence parallelism if reducing recomputing.

Based on DCP, we propose Pyramid Activation Checkpointing (PAC), which adaptively applies different ckpt ratios based on sequence sizes. This approach significantly improves throughput for the shorter sequences by less recomputing, while maintaining communication overhead for longer sequences, which is even more beneficial in datasets where shorter sequences dominate.

Performance Modeling - How to Achieve Best Efficiency?

For variable sequences training, optimization is challenging because it involves balancing multiple factors instead of communication alone. We introduce the ICC (Imbalance-Communication-Computation) metric for variable sequences training, which identifies three fundamental factors that determine the overall system performance: workload imbalance, communication overhead, and computation degradation. It can be formally expressed as:

$$ ICC(\lambda) = {\eta_{imb}(\lambda)}^{\alpha} \cdot {\eta_{comm}(\lambda)}^{\beta} \cdot {\eta_{comp}(\lambda)}^{\gamma} $$ $$ \eta_{imb}(\lambda) = \frac{T_{total}(\lambda)}{T_{idle}(\lambda)},\ \ \eta_{comm}(\lambda) = \frac{T_{comp}(\lambda)}{T_{comm}(\lambda)},\ \ \eta_{comp}(\lambda) = \frac{T_{optimal}(\lambda)}{T_{degrad}(\lambda)} $$

where $ \lambda $ is the sequence distribution, $ T_{total} $ is the total time, $ T_{idle} $ is the idle time, $ T_{comp} $ is the computation time, $ T_{comm} $ is the optiaml communication time, $ T_{optimal} $ is the optimal computation time, and $ T_{degrad} $ is the computation degradation time.

Figure 4: Roofline model of variable sequences training.

Based on ICC metric, we propose a roofline performance model to characterize and analyze the training performance of variable-length sequences, as illustrated in Figure 4. By optimizing ICC, we can achieve the best performance for each method.

The semi-optimal curve demonstrates the maximum achievable performance through intra-sequence balancing. To achieve better balance, it inevitable introduces reduced computational efficiency due to smaller batch sizes, and increased communication overhead for additional iterations.

The optimal curve represents the peak performance through inter-sequence balancing, showcasing the theoretical maximum efficiency of our proposed method.

In practice, we use some simple heuristic algorithms to find the best solution, and achieve almost same performance improvements.

Evaluations

We conduct experiments on NVIDIA H100 GPUs connected by NVLink and InfiniBand. The evaluation is performed using on Open-Sora v1.2 (1.1B). The sequence distribution follows common video size and variations. We use bucket parallel used in Open-Sora as the baseline method.

End-to-End Speedups - How Much Can DCP Speed Up?

Figure 5: End-to-end speedups of DCP compared with other parallel methods.

Measured throughput of DCP across different sequence length distributions using 32 NVIDIA H100 GPUs. The results demonstrated significant performance improvements: when short sequences are predominant, DCP achieved speedups of up to 2.48x. Notably, even in scenarios dominated by long sequences, DCP maintained substantial performance gains with speedups of 1.28x.

Imbalance Analysis - How is Imbalance with DCP?

Figure 6: Imbalance analysis of DCP. It is able to maintain low imbalance ratio as the number of GPUs increases.

Figure 6 illustrates how the imbalance ratio changes with increasing number of GPUs. The baseline method shows faster growth in imbalance, reaching 16.4% when scaled to 32 GPUs. In contrast, DCP demonstrates superior performance, maintaining the lowest imbalance ratio and exhibiting slower growth as GPU count increases.

Parallel Analysis - Why Should We Determine Parallel Size by Data?

Figure 7: Parallel analysis of DCP. It is able to achieve better performance than fixed parallel size.

Figure 7 illustrates the optimal parallel size and performance improvements across different sequence types. As expected, the optimal parallel size strongly correlates with the total sequence length. Notably, while one might assume smaller parallel sizes would be optimal, larger parallel sizes sometimes can enhance computational efficiency and achieve better workload balance.

Recompute Analysis - When Can Pyramid Activation Checkpointing Help?

Figure 8: Recompute analysis of DCP. PAC achieves significant speedup across most sequence lengths.

Figure 8 shows the speedup of Pyramid Activation Checkpointing. Our results show that it achieves significant speedup across most sequence lengths, with particularly notable performance gains for shorter sequences, which dominate most datasets.

Related works

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. "Video generation models as world simulators." 2024.
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas et al. "Movie Gen: A Cast of Media Foundation Models." arXiv, 2024.
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. "Open-Sora: Democratizing Efficient Video Production for All." GitHub, 2024.
Hunyuan Foundation Model Team. "HunyuanVideo: A Systematic Framework For Large Video Generation Model". arXiv, 2024.
Llama Team. "The Llama 3 Herd of Models". arXiv, 2024.
Jumper, J., Evans, R., Pritzel, A. et al. "Highly accurate protein structure prediction with AlphaFold". Nature, 596.
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi et al. "Scaling rectified flow transformers for high-resolution image synthesis." ICML, 2024.
Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner et al. "Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution." NeurIPS, 2024.
Xuanlei Zhao, Shenggan Cheng, Chang Chen, Zangwei Zheng, Ziming Liu, Zheming Yang, and Yang You. "Dsp: Dynamic sequence parallelism for multi-dimensional transformers." arXiv, 2024.
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. "Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models." arXiv, 2023.
Tailing Yuan, Yuliang Liu, Xucheng Ye, Shenglong Zhang, Jianchao Tan, Bin Chen, Chengru Song, and Di Zhang. "Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism." ATC, 2024.
Jiannan Wang, Jiarui Fang, Aoyu Li, and PengCheng Yang. "PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models." arXiv, 2024.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. "Megatron-lm: Training multi-billion parameter language models using model parallelism." arXiv, 2019.
Jiarui Fang, and Shangchun Zhao. "A Unified Sequence Parallelism Approach for Long Context Generative AI." arXiv, 2024.
Shenggui Li, Fuzhao Xue, Yongbin Li and Yang You. "Sequence Parallelism: Long Sequence Training from System Perspective." ACL, 2021.
Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, and Bin Cui. "Data-Centric and Heterogeneity-Adaptive Sequence Parallelism for Efficient LLM Training." arXiv, 2024.
Hao Ge, Fangcheng Fu, Haoyang Li, Xuanyu Wang, Sheng Lin, Yujie Wang, Xiaonan Nie, Hailin Zhang, Xupeng Miao, and Bin Cui. "Enabling Parallelism Hot Switching for Efficient Training of Large Language Models." SOSP, 2024.