Animate Your Motion: Turning Still Images into Dynamic Videos

KULeuven
*Indicates Equal Contribution

Abstract

In recent years, diffusion models have made remarkable strides in text-to-video generation, sparking a quest for enhanced control over video outputs to more accurately reflect user intentions. Traditional efforts predominantly focus on employing either semantic cues, like images or depth maps, or motion-based conditions, like moving sketches or object bounding boxes. Semantic inputs offer a rich scene context but lack detailed motion specificity; conversely, motion inputs provide precise trajectory information but miss the broader semantic narrative. For the first time, we integrate both semantic and motion cues within a diffusion model for video generation, as demonstrated in Fig.1. To this end, we introduce the Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. It incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions, promoting synergy between different modalities. For model training, we separate the conditions for the two modalities, introducing a two-stage training pipeline. Experimental results demonstrate that our design significantly enhances video quality, motion precision, and semantic coherence.

MY ALT TEXT
Illustration of our proposed model: Designed for conditional video generation, our model can handle three control signals including images, bounding box sequences, and text. It builds on a pre-trained text-to-video framework, enriched with an object-gated self-attention layer, image-gated cross-attention layer, and a zero initialized input convolution layer. These enhancements allow it to adapt to bounding box and image conditions through a two-stage training process: first focusing on the object-gated self-attention, followed by the input convolution and image-gated cross-attention layers

BibTeX

@article{li2024animate,
  title={Animate Your Motion: Turning Still Images into Dynamic Videos},
  author={Li, Mingxiao and Wan, Bo and Moens, Marie-Francine and Tuytelaars, Tinne},
  journal={arXiv preprint arXiv:2403.10179},
  year={2024}
}