FreeAnimate

Network Architecture

The overview of FreeAnimate. We introduce a novel network architecture that combines a U-Net with ControlNet, offering a more efficient approach to incorporating reference image content and structure into the U-Net without relying on CLIP image encoder or Appearance Net. Specifically, our method incorporates three crucial components in addition to the basic SD model: Control branch, Inversion-Boosted Attention, and Reference-Anchored Self-Attention.

Control Branch

The Control Branch is responsible for effectively incorporating pose guidance into the generation process. It leverages a pre-trained ControlNet to encode the pose sequence and injects this information into the U-Net. Unlike traditional methods that require additional training or fine-tuning, our approach utilizes ControlNet as a plug-and-play module, enabling the model to directly align generated frames with the provided pose sequence. This setup avoids the need for large-scale training and ensures that the generated video follows the desired pose sequence while maintaining computational efficiency.

Inversion-Boosted Attention

Inversion-Boosted Attention (IBA) improves temporal consistency by leveraging attention maps extracted from preview frames during DDIM inversion. These maps refine both self- and cross-attention, enhancing video coherence and structural stability.

Inspired by FateZero, IBA uses DDIM inversion attention maps to guide denoising. At each step \( t \), we store the self-attention maps \( \left[\textit{s}_{t}^{\text{pre}}\right]_{t=1}^{T} \) and cross-attention maps \( \left[\textit{c}_{t}^{\text{pre}}\right]_{t=1}^{T} \) as follows:

\[ z_{T},\left[\textit{s}_{t}^{\text{pre}}\right]_{t=1}^{T},\left[\textit{c}_{t}^{\text{pre}}\right]_{t=1}^{T}= \operatorname{DDIM-INV}\left(z_{0}\right). \]

where \( \operatorname{DDIM-INV} \) represents the DDIM inversion process. During denoising, these attention maps refine the attention computation:

\[ \text{SELF-ATT}= \operatorname{Softmax} \left(\frac{Q_{s}^{\text{pre}} {K_{s}^{\text{pre}}}^T}{\sqrt{d}}\right) \cdot V_{s} =\textit{s}_{t}^{\text{pre}} \cdot V_{s}, \]

\[ \text{CROSS-ATT}= \operatorname{Softmax} \left(\frac{Q_{c}^{\text{pre}} {K_{c}^{\text{pre}}}^T}{\sqrt{d}}\right) \cdot V_{c} =\textit{c}_{t}^{\text{pre}} \cdot V_{c}. \]

Here, \( Q_{s}^{\text{pre}}, K_{s}^{\text{pre}}, V_{s} \) and \( Q_{c}^{\text{pre}}, K_{c}^{\text{pre}}, V_{c} \) denote the query, key, and value projections for self- and cross-attention, respectively, with \( d \) being the attention feature dimension.

By leveraging precomputed self- and cross-attention maps, IBA helps preserve motion integrity and spatial structure while reducing artifacts. These maps act as guidance signals, improving both temporal coherence and structural alignment during generation.

Reference-Anchored Self-Attention

Reference-Anchored Self-Attention (RA-SA) improves temporal consistency by anchoring frames to a reference image. By integrating both the current and reference latents, RA-SA enhances identity preservation throughout the video.

Specifically, \( \text{SELF-ATTENTION}(Q, K, V) \) for the latent code \( z^{i}_{t} \) of frame \( i \) at time step \( t \) is computed as:

\[ Q = W^{Q} z^{i}_{t}, \quad K = W^{K}\left[z^{i}_{t} ; z^{a}_{t}\right], \quad V = W^{V}\left[z^{i}_{t} ; z^{a}_{t}\right]. \]

where \( W^Q \), \( W^K \), and \( W^V \) are projection matrices from the U-Net, and \( \left[\cdot\right] \) denotes concatenation. \( z^{i}_{t} \) and \( z^{a}_{t} \) represent the latents of the current and anchor frames, respectively. While **Pixel2Video** and **FateZero** set \( a \) to \( 1 \) and \( \left\lfloor \frac{N}{2} \right\rfloor \), we use \( I_{ref} \) as the anchor frame for improved alignment.

During DDIM inversion, RA-SA replaces standard self-attention, modifying the self-attention map dimensions from \( R^{hw \times hw} \) to \( R^{hw \times 2hw} \). In the denoising process, query and key features originate from DDIM inversion attention maps, while value features are computed dynamically using the current and reference latents:

\[ Q = W^{Q} z^{i}_{t}, \quad K = W^{K}\left[z^{i}_{t} ; z^{a}_{t}\right], \quad V = W^{V}\left[z^{i}_{t} ; z^{a}_{t}\right]. \]

FreeAnimate: Training-Free Human Image Animation with Preview-Guided Denoising

Abstract

Network Architecture

Control Branch

Inversion-Boosted Attention

Reference-Anchored Self-Attention

Preview Generation Strategy

Comparisions with Existed Approaches

Animation on TikTok Dataset

Animation on TED-talks Dataset

Animation on EverybodyDanceNow Dataset

Other domain results

Acknowledgements