FRAG: Frequency Adapting Group for Diffusion Video Editing

Sunjae Yoon
Gwanhyeong Koo
Geonwoo Kim
Chang D. Yoo
Korea Advanced Institute of Science and Technology (KAIST)
ICML 2024
[Paper]
[Code]


Abstract

In video editing, the hallmark of a quality edit lies in its consistent and unobtrusive adjustment. Modification, when integrated, must be smooth and subtle, preserving the natural flow and aligning seamlessly with the original vision. Therefore, our primary focus is on overcoming the current challenges in high quality edit to ensure that each edit enhances the final product without disrupting its intended essence. However, quality deterioration such as blurring and flickering is routinely observed in recent diffusion video editing systems. We confirm that this deterioration often stems from high-frequency leak: the diffusion model fails to accurately synthesize high-frequency components during denoising process. To this end, we devise Frequency Adapting Group (FRAG) which enhances the video quality in terms of consistency and fidelity by introducing a novel receptive field branch to preserve high-frequency components during the denoising process. FRAG is performed in a model-agnostic manner without additional training and validates the effectiveness on video editing benchmarks (i.e., TGVE, DAVIS).


Illustration of video quality deterioration represented into two distinct categories: (a) content blur and (b) content flicker. For the comparison, we present our results in (c).



Observations and Proposed Approach


(a) shows experimental observations about latent noise reconstruction in terms of low and high frequencies, where high-frequency components are synthesized later in denoising than low frequencies. (b) illustrates previous video diffusion denoising and (c) illustrates our proposed denoising with the receptive field branch using Frequency Adapting Group (FRAG) to enhance the quality of editing.


Frequency Adapting Group (FRAG)


Illustration of Frequency Adapting Group (FRAG). FRAG takes \( t \) step latent noise \( z_t \) and produces receptive field \( g_t \) referred to as temporal group. The \( g_t \) guides denoising UNet to adaptively synthesize the frequency components according to frequency variations of latent noise during the denoising process. FRAG contains (a) frequency adaptive refinement that enhances the visual quality of attributes within latent noise and (b) temporal grouping that clusters latent noise frames to build \( g_t \).


Qualitative Results



Paper and Supplementary Material

Sunjae Yoon, Gwanhyeong Koo, Geonwoo Kim and Chang D. Yoo
FRAG: Frequency Adapting Group for Diffusion Video Editing
International Conference on Machine Learning (ICML) 2024
(hosted on ArXiv)


[Bibtex]


Acknowledgements

This work was supported by Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No. 2021-0-01381, Development of Causal AI through Video Understanding and Reinforcement Learning, and Its Applications to Real Environments) and partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2022-0-00184, Development and Study of AI Technologies to Inexpensively Conform to Evolving Policy on Ethics).