EchoMimic Series
EchoMimicV1: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditioning.
GitHub
EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation.
GitHub
EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation.
GitHub
Abstract
EchoMimicV3 makes human animation Faster, Higher in quality, Stronger in generalization, and makes various tasks together in one model. EchoMimicV3 is an efficient framework that unifies multi-task and multi-modal human animation. At the core of EchoMimicV3 lies a threefold design: a Soup-of-Tasks paradigm, a Soup-of-Modals paradigm, and a novel training and inference strategy. The Soup-of-Tasks leverages multi-task mask inputs and a counter-intuitive task allocation strategy to achieve multi-task gains without multi-model pains. Meanwhile, the Soup-of-Modals introduces a Coupled-Decoupled Multi-Modal Cross Attention module to inject multi-modal conditions, complemented by a Timestep Phase-aware Multi-Modal Allocation mechanism to dynamically modulate multi-modal mixtures. Besides, we propose Negative Direct Preference Optimization and Phase-aware Negative Classifier-Free Guidance, which ensure stable training and inference. Extensive experiments and analyses demonstrate that EchoMimicV3, with a minimal model size of 1.3 billion parameters, achieves competitive performance in both quantitative and qualitative evaluations.