Variable sequences training is commonly used in various tasks, such as video generation (Sora [1], Movie Gen [2], Open-Sora [3], HunyuanVideo [4]), text generation (Llama-3 [5]), and scientific computing (AlphaFold [6]). Such strategy offers them two advantages: enhanced generation quality and flexible output sizes [1].
We compare two common parallel methods for variable sequences training in Figure 1. Bucket parallel [7] fixes sequence parallel size based on the longest sequence, adjusting batch sizes for balance. Packed parallel [8], while also fixing sequence parallel size, concats multiple sequences in one batch.
However, bucket parallel struggles with workload imbalance at small batch sizes and communication inefficiency for shorter sequences, while packed parallel, despite better load balancing, incurs unnecessary communication overhead and requires complex implementation changes. Both approaches fail to addresses two critical challenges in variable sequences training: necessary sequence parallel for long sequences and the unbalanced workloads from diverse sequence sizes, due to their fixed settings and lack of data awareness.
To address these challenges, we propose Data-Centric Parallel, an innovative approach that transforms parallel training with data awareness. Instead of fixing all configs during training, our method adaptively change parallel and other settings driven by the incoming data, significantly improving computational efficiency and reducing communication overhead.
As shown in Figure 2, our method consists of two main stages: data profile and data-centric runtime. In data profile stage, we classify sequences by size and determine optimal settings for each sequence through fast dual-layer profile. At runtime, we first enqueue sequences into the batch until it's full, and then dynamically balance each sequences using two strategies:
As illustrated in Figure 3, sequence sizes are highly varied in variable sequences training, with short sequences using less memory and long sequences using more. Selective activation checkpointing are limited by longer sequences, as they require multi-gpu or even intra-node sequence parallelism if reducing recomputing.
Based on DCP, we propose Pyramid Activation Checkpointing (PAC), which adaptively applies different ckpt ratios based on sequence sizes. This approach significantly improves throughput for the shorter sequences by less recomputing, while maintaining communication overhead for longer sequences, which is even more beneficial in datasets where shorter sequences dominate.
For variable sequences training, optimization is challenging because it involves balancing multiple factors instead of communication alone. We introduce the ICC (Imbalance-Communication-Computation) metric for variable sequences training, which identifies three fundamental factors that determine the overall system performance: workload imbalance, communication overhead, and computation degradation. It can be formally expressed as:
$$ ICC(\lambda) = {\eta_{imb}(\lambda)}^{\alpha} \cdot {\eta_{comm}(\lambda)}^{\beta} \cdot {\eta_{comp}(\lambda)}^{\gamma} $$ $$ \eta_{imb}(\lambda) = \frac{T_{total}(\lambda)}{T_{idle}(\lambda)},\ \ \eta_{comm}(\lambda) = \frac{T_{comp}(\lambda)}{T_{comm}(\lambda)},\ \ \eta_{comp}(\lambda) = \frac{T_{optimal}(\lambda)}{T_{degrad}(\lambda)} $$
where \( \lambda \) is the sequence distribution, \( T_{total} \) is the total time, \( T_{idle} \) is the idle time, \( T_{comp} \) is the computation time, \( T_{comm} \) is the optiaml communication time, \( T_{optimal} \) is the optimal computation time, and \( T_{degrad} \) is the computation degradation time.
Based on ICC metric, we propose a roofline performance model to characterize and analyze the training performance of variable-length sequences, as illustrated in Figure 4. By optimizing ICC, we can achieve the best performance for each method.
In practice, we use some simple heuristic algorithms to find the best solution, and achieve almost same performance improvements.
We conduct experiments on NVIDIA H100 GPUs connected by NVLink and InfiniBand. The evaluation is performed using on Open-Sora v1.2 (1.1B). The sequence distribution follows common video size and variations. We use bucket parallel used in Open-Sora as the baseline method.
Measured throughput of DCP across different sequence length distributions using 32 NVIDIA H100 GPUs. The results demonstrated significant performance improvements: when short sequences are predominant, DCP achieved speedups of up to 2.48x. Notably, even in scenarios dominated by long sequences, DCP maintained substantial performance gains with speedups of 1.28x.
Figure 6 illustrates how the imbalance ratio changes with increasing number of GPUs. The baseline method shows faster growth in imbalance, reaching 16.4% when scaled to 32 GPUs. In contrast, DCP demonstrates superior performance, maintaining the lowest imbalance ratio and exhibiting slower growth as GPU count increases.
Figure 7 illustrates the optimal parallel size and performance improvements across different sequence types. As expected, the optimal parallel size strongly correlates with the total sequence length. Notably, while one might assume smaller parallel sizes would be optimal, larger parallel sizes sometimes can enhance computational efficiency and achieve better workload balance.
Figure 8 shows the speedup of Pyramid Activation Checkpointing. Our results show that it achieves significant speedup across most sequence lengths, with particularly notable performance gains for shorter sequences, which dominate most datasets.
@misc{zhang2024dcp,
title={Training Variable Sequences with Data-Centric Parallel},
author={Geng Zhang and Xuanlei Zhao and Kai Wang and Yang You},
year={2024},
}