Diffusers
lipsync
video-editing
chunyu-li commited on
Commit
23aa716
·
verified ·
1 Parent(s): b2e6665

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -4
README.md CHANGED
@@ -2,11 +2,72 @@
2
  license: openrail++
3
  library_name: diffusers
4
  tags:
5
- - video-to-video
 
 
6
  ---
7
 
8
- # The checkpoints of LatentSync 1.5
9
-
10
  Paper: https://arxiv.org/abs/2412.09262
11
 
12
- Code: https://github.com/bytedance/LatentSync
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: openrail++
3
  library_name: diffusers
4
  tags:
5
+ - lipsync
6
+ - video editing
7
+ pipeline_tag: video-to-video
8
  ---
9
 
 
 
10
  Paper: https://arxiv.org/abs/2412.09262
11
 
12
+ Code: https://github.com/bytedance/LatentSync
13
+
14
+ # What's new in LatentSync 1.5?
15
+
16
+ 1. Add temporal layer: Our previous claim that the [temporal layer](https://arxiv.org/abs/2307.04725) severely impairs lip-sync accuracy was incorrect; the issue was actually caused by a bug in the code implementation. We have corrected our [paper](https://arxiv.org/abs/2412.09262) and updated the code. After incorporating the temporal layer, LatentSync 1.5 demonstrates significantly improved temporal consistency compared to version 1.0.
17
+
18
+ 2. Improves performance on Chinese videos: many issues reported poor performance on Chinese videos, so we added Chinese data to the training of the new model version.
19
+
20
+ 3. Reduce the VRAM requirement of the stage2 training to **20 GB** through the following optimizations:
21
+
22
+ 1. Implement gradient checkpointing in U-Net, VAE, SyncNet and VideoMAE
23
+ 2. Replace xFormers with PyTorch's native implementation of FlashAttention-2.
24
+ 3. Clear the CUDA cache after loading checkpoints.
25
+ 4. The stage2 training only requires training the temporal layer and audio cross-attention layer, which significantly reduces VRAM requirement compared to the previous full-parameter fine-tuning.
26
+
27
+ Now you can train LatentSync on a single **RTX 3090**! Start the stage2 training with `configs/unet/stage2_efficient.yaml`.
28
+
29
+ 4. Other code optimizations:
30
+
31
+ 1. Remove the dependency on xFormers and Triton.
32
+ 2. Upgrade the diffusers version to `0.32.2`.
33
+
34
+ ## LatentSync 1.5 Demo
35
+
36
+ <table class="center">
37
+ <tr style="font-weight: bolder;text-align:center;">
38
+ <td width="50%"><b>Original video</b></td>
39
+ <td width="50%"><b>Lip-synced video</b></td>
40
+ </tr>
41
+ <tr>
42
+ <td>
43
+ <video src=https://github.com/user-attachments/assets/b0c8d1da-3fdc-4946-9800-1b2fd0ef9c7f controls preload></video>
44
+ </td>
45
+ <td>
46
+ <video src=https://github.com/user-attachments/assets/25dd1733-44c7-42fe-805a-d612d4bc30e0 controls preload></video>
47
+ </td>
48
+ </tr>
49
+ <tr>
50
+ <td>
51
+ <video src=https://github.com/user-attachments/assets/4e48e501-64b4-4b4f-a69c-ed18dd987b1f controls preload></video>
52
+ </td>
53
+ <td>
54
+ <video src=https://github.com/user-attachments/assets/e690d91b-9fe5-4323-a60e-2b7f546f01bc controls preload></video>
55
+ </td>
56
+ </tr>
57
+ <tr>
58
+ <td>
59
+ <video src=https://github.com/user-attachments/assets/e84e2c13-1deb-41f7-8382-048ba1922b71 controls preload></video>
60
+ </td>
61
+ <td>
62
+ <video src=https://github.com/user-attachments/assets/5a5ba09f-590b-4eb3-8dfb-a199d8d1e276 controls preload></video>
63
+ </td>
64
+ </tr>
65
+ <tr>
66
+ <td>
67
+ <video src=https://github.com/user-attachments/assets/11e4b2b6-64f4-4617-b005-059209fcaea5 controls preload></video>
68
+ </td>
69
+ <td>
70
+ <video src=https://github.com/user-attachments/assets/38437475-3c90-4d08-b540-c8e819e93e0d controls preload></video>
71
+ </td>
72
+ </tr>
73
+ </table>