Diffusion Transformers (DiTs) open a new era of video generation [1][2][3]. Despite these advances, existing models have difficulty capturing crucial details. Video enhancement could be an intuitive approach, where two goals are considered: 1. maintaining consistency; 2. improving visual quality.
The temporal attention plays a crucial role in ensuring consistency among frames, further preserving the details. To better understand the effect of temporal attention, we visualize the temporal attention patterns across various blocks.
Our visualizations reveal a key observation: attention weights among frames (non-diagonal) are significantly lower than those along the diagonal, which may lead to inconsistency among frames. Could we improve video quality by utilizing temporal attention?
The consistency among frames is similar to that among tokens in LLMs. Temperature parameter (tau) pre-softmax is used in LLMs to control attention distribution to balance between focused and diverse token selection [1][2][3].
Inspired by the above analysis, we are the first to find temperature of temporal attention determines the intensity of cross-frame correlation, with higher values enabling broader temporal context attention. We adjust the temporal attention outputs as a training-free enhancement that can be directly applied to existing video models.
Specifically, we design an Enhance Block as a parallel branch. This branch computes the average of non-diagonal elements of temporal attention maps as cross-frame intensity (CFI). An enhanced temperature parameter multiplies the CFI to enhance the temporal attention output.
The experimental results reveals significant improvements in video enhancement across all tested models. When examining the HunyuanVideo results, the enhanced version demonstrates superior contrast and clarity, particularly noticeable in the more realistic wheels and charging station.
The increase of temperature leads to more details and creativity. However, extreme high temperatures indicates illogical content and video distortion.
@misc{luo2024Enhance-A-Video,
title={Enhance-A-Video: Better Generated Video for Free},
author={Yang Luo and Xuanlei Zhao and Mengzhao Chen and Kaipeng Zhang and Wenqi Shao and Kai Wang and Zhangyang Wang and Yang You},
year={2024},
}