Enhance-A-Video: Better Generated Video for Free

Yang Luo¹, Xuanlei Zhao¹, Mengzhao Chen², Kaipeng Zhang², Wenqi Shao^2†, Kai Wang^1†, Zhangyang Wang³, Yang You¹

¹National University of Singapore ²Shanghai AI Lab ³The University of Texas at Austin

(† indicates corresponding author)

Improving Your Generated Video for Free!

Enhance-A-Video

Diffusion Transformers (DiTs) open a new era of video generation [1][2][3]. Despite these advances, existing models have difficulty capturing crucial details. Video enhancement could be an intuitive approach, where two goals are considered: 1. maintaining consistency; 2. improving visual quality.

The temporal attention plays a crucial role in ensuring consistency among frames, further preserving the details. To better understand the effect of temporal attention, we visualize the temporal attention patterns across various blocks.

The temporal attention maps for layer 2, 14, and 26.

Our visualizations reveal a key observation: attention weights among frames (non-diagonal) are significantly lower than those along the diagonal, which may lead to inconsistency among frames. Could we improve video quality by utilizing temporal attention?

The consistency among frames is similar to that among tokens in LLMs. Temperature parameter (tau) pre-softmax is used in LLMs to control attention distribution to balance between focused and diverse token selection [1][2][3].

Inspired by the above analysis, we are the first to find temperature of temporal attention determines the intensity of cross-frame correlation, with higher values enabling broader temporal context attention. We adjust the temporal attention outputs as a training-free enhancement that can be directly applied to existing video models.

Specifically, we design an Enhance Block as a parallel branch. This branch computes the average of non-diagonal elements of temporal attention maps as cross-frame intensity (CFI). An enhanced temperature parameter multiplies the CFI to enhance the temporal attention output.

Evaluations

Qualitative Results

HunyuanVideo

CogVideoX-2B

Open-Sora v1.2

The experimental results reveals significant improvements in video enhancement across all tested models. When examining the HunyuanVideo results, the enhanced version demonstrates superior contrast and clarity, particularly noticeable in the more realistic wheels and charging station.

Temperature Analysis

The increase of temperature leads to more details and creativity. However, extreme high temperatures indicates illogical content and video distortion.

Related works

Brooks, Tim, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang and Aditya Ramesh. “Video generation models as world simulators.“ OpenAI Research (2024).

Kong, Weijie, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jia-Liang Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fan Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Peng-Yu Li, Shuai Li, Weiyan Wang, Wenqing Yu, Xi Deng, Yang Li, Yi Chen, Yutao Cui, Yuanbo Peng, Zhen Yu, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Zunnan Xu, Yang-Dan Tao, Qinglin Lu, Songtao Liu, Daquan Zhou, Hongfa Wang, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang and Caesar Zhong. “HunyuanVideo: A Systematic Framework For Large Video Generative Models.” (2024).

Yang, Zhuoyi, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong and Jie Tang. “CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer.” ArXiv abs/2408.06072 (2024).

Wang, Pei-Hsin, Sheng-Iou Hsieh, Shih-Chieh Chang, Yu-Ting Chen, Jia-Yu Pan, Wei Wei and Da-Chang Juan. “Contextual Temperature for Language Modeling.” ArXiv abs/2012.13575 (2019).

Wang, Chi, Susan Liu and Ahmed Hassan Awadallah. “Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference.” ArXiv abs/2303.04673 (2023).

Renze, Matthew and Erhan Guven. “The Effect of Sampling Temperature on Problem Solving in Large Language Models.” ArXiv abs/2402.05201 (2024).