yangxiaoyu6 commited on
Commit
2ad1ea3
·
1 Parent(s): f16e6f5
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. exp_md1000/checkpoint-508000.pt +3 -0
  2. exp_md1000/inference_audio_tagging/log-decode-epoch-10-avg-1-use-averaged-model-2023-12-16-11-16-11 +52 -0
  3. exp_md1000/inference_audio_tagging/log-decode-epoch-10-avg-1-use-averaged-model-2023-12-17-23-29-50 +41 -0
  4. exp_md1000/inference_audio_tagging/log-decode-epoch-11-avg-1-use-averaged-model-2023-12-16-11-14-26 +44 -0
  5. exp_md1000/inference_audio_tagging/log-decode-epoch-11-avg-1-use-averaged-model-2023-12-17-23-28-06 +44 -0
  6. exp_md1000/inference_audio_tagging/log-decode-epoch-12-avg-1-use-averaged-model-2023-12-16-11-12-42 +52 -0
  7. exp_md1000/inference_audio_tagging/log-decode-epoch-12-avg-1-use-averaged-model-2023-12-17-23-26-23 +42 -0
  8. exp_md1000/inference_audio_tagging/log-decode-epoch-13-avg-1-use-averaged-model-2023-12-16-11-10-58 +52 -0
  9. exp_md1000/inference_audio_tagging/log-decode-epoch-13-avg-1-use-averaged-model-2023-12-17-23-24-39 +49 -0
  10. exp_md1000/inference_audio_tagging/log-decode-epoch-14-avg-1-use-averaged-model-2023-12-16-11-09-06 +44 -0
  11. exp_md1000/inference_audio_tagging/log-decode-epoch-14-avg-1-use-averaged-model-2023-12-17-23-22-56 +49 -0
  12. exp_md1000/inference_audio_tagging/log-decode-epoch-15-avg-1-use-averaged-model-2023-12-17-23-21-16 +44 -0
  13. exp_md1000/inference_audio_tagging/log-decode-epoch-16-avg-1-use-averaged-model-2023-12-17-23-19-35 +45 -0
  14. exp_md1000/inference_audio_tagging/log-decode-epoch-17-avg-1-use-averaged-model-2023-12-17-23-17-44 +47 -0
  15. exp_md1000/inference_audio_tagging/log-decode-epoch-2-avg-1-use-averaged-model-2023-12-12-22-43-22 +52 -0
  16. exp_md1000/inference_audio_tagging/log-decode-epoch-20-avg-1-use-averaged-model-2023-12-20-11-15-49 +47 -0
  17. exp_md1000/inference_audio_tagging/log-decode-epoch-20-avg-2-use-averaged-model-2023-12-20-11-14-13 +41 -0
  18. exp_md1000/inference_audio_tagging/log-decode-epoch-20-avg-3-use-averaged-model-2023-12-20-11-12-38 +51 -0
  19. exp_md1000/inference_audio_tagging/log-decode-epoch-20-avg-4-use-averaged-model-2023-12-20-11-10-58 +42 -0
  20. exp_md1000/inference_audio_tagging/log-decode-epoch-21-avg-1-use-averaged-model-2023-12-20-11-09-20 +41 -0
  21. exp_md1000/inference_audio_tagging/log-decode-epoch-21-avg-2-use-averaged-model-2023-12-20-11-07-43 +49 -0
  22. exp_md1000/inference_audio_tagging/log-decode-epoch-21-avg-3-use-averaged-model-2023-12-20-11-06-15 +54 -0
  23. exp_md1000/inference_audio_tagging/log-decode-epoch-21-avg-4-use-averaged-model-2023-12-20-11-04-58 +48 -0
  24. exp_md1000/inference_audio_tagging/log-decode-epoch-22-avg-1-use-averaged-model-2023-12-20-11-03-23 +43 -0
  25. exp_md1000/inference_audio_tagging/log-decode-epoch-22-avg-2-use-averaged-model-2023-12-20-11-01-48 +45 -0
  26. exp_md1000/inference_audio_tagging/log-decode-epoch-22-avg-3-use-averaged-model-2023-12-20-11-00-13 +46 -0
  27. exp_md1000/inference_audio_tagging/log-decode-epoch-22-avg-4-use-averaged-model-2023-12-20-10-58-33 +46 -0
  28. exp_md1000/inference_audio_tagging/log-decode-epoch-23-avg-1-use-averaged-model-2023-12-20-10-56-58 +46 -0
  29. exp_md1000/inference_audio_tagging/log-decode-epoch-23-avg-2-use-averaged-model-2023-12-20-10-55-22 +48 -0
  30. exp_md1000/inference_audio_tagging/log-decode-epoch-23-avg-3-use-averaged-model-2023-12-20-10-53-47 +42 -0
  31. exp_md1000/inference_audio_tagging/log-decode-epoch-23-avg-4-use-averaged-model-2023-12-20-10-52-08 +45 -0
  32. exp_md1000/inference_audio_tagging/log-decode-epoch-24-avg-1-use-averaged-model-2023-12-20-10-50-32 +46 -0
  33. exp_md1000/inference_audio_tagging/log-decode-epoch-24-avg-2-use-averaged-model-2023-12-20-10-48-56 +53 -0
  34. exp_md1000/inference_audio_tagging/log-decode-epoch-24-avg-3-use-averaged-model-2023-12-20-10-47-17 +49 -0
  35. exp_md1000/inference_audio_tagging/log-decode-epoch-24-avg-4-use-averaged-model-2023-12-20-10-45-41 +44 -0
  36. exp_md1000/inference_audio_tagging/log-decode-epoch-25-avg-1-use-averaged-model-2023-12-20-10-44-03 +49 -0
  37. exp_md1000/inference_audio_tagging/log-decode-epoch-25-avg-2-use-averaged-model-2023-12-20-10-42-27 +45 -0
  38. exp_md1000/inference_audio_tagging/log-decode-epoch-25-avg-3-use-averaged-model-2023-12-20-10-40-52 +48 -0
  39. exp_md1000/inference_audio_tagging/log-decode-epoch-25-avg-4-use-averaged-model-2023-12-20-10-39-15 +43 -0
  40. exp_md1000/inference_audio_tagging/log-decode-epoch-26-avg-1-use-averaged-model-2023-12-21-10-09-44 +46 -0
  41. exp_md1000/inference_audio_tagging/log-decode-epoch-26-avg-2-use-averaged-model-2023-12-21-10-08-04 +54 -0
  42. exp_md1000/inference_audio_tagging/log-decode-epoch-26-avg-3-use-averaged-model-2023-12-21-10-06-27 +48 -0
  43. exp_md1000/inference_audio_tagging/log-decode-epoch-26-avg-4-use-averaged-model-2023-12-21-10-04-46 +47 -0
  44. exp_md1000/inference_audio_tagging/log-decode-epoch-27-avg-1-use-averaged-model-2023-12-21-10-03-10 +40 -0
  45. exp_md1000/inference_audio_tagging/log-decode-epoch-27-avg-2-use-averaged-model-2023-12-21-10-01-31 +43 -0
  46. exp_md1000/inference_audio_tagging/log-decode-epoch-27-avg-3-use-averaged-model-2023-12-21-09-59-54 +51 -0
  47. exp_md1000/inference_audio_tagging/log-decode-epoch-27-avg-4-use-averaged-model-2023-12-21-09-58-17 +44 -0
  48. exp_md1000/inference_audio_tagging/log-decode-epoch-28-avg-1-use-averaged-model-2023-12-21-09-56-41 +50 -0
  49. exp_md1000/inference_audio_tagging/log-decode-epoch-28-avg-2-use-averaged-model-2023-12-21-09-55-03 +43 -0
  50. exp_md1000/inference_audio_tagging/log-decode-epoch-28-avg-3-use-averaged-model-2023-12-21-09-53-24 +48 -0
exp_md1000/checkpoint-508000.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:acf1156c3cee982c5ee5cac2c00c2426e23089ed26799e8562666b4717e31dd0
3
+ size 1055775720
exp_md1000/inference_audio_tagging/log-decode-epoch-10-avg-1-use-averaged-model-2023-12-16-11-16-11 ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-16 11:16:11,042 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-16 11:16:11,042 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 10, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': False, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-10-avg-1-use-averaged-model'}
3
+ 2023-12-16 11:16:11,042 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-16 11:16:11,380 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 9 (excluded) to 10
5
+ 2023-12-16 11:16:18,495 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-16 11:16:18,495 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-16 11:16:18,534 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-16 11:16:18,860 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-16 11:16:23,534 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-16 11:16:26,912 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-16 11:16:30,010 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-16 11:16:30,160 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5210, 2.6728, 2.4574, 2.1405], device='cuda:0')
13
+ 2023-12-16 11:16:33,239 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
14
+ 2023-12-16 11:16:33,495 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.2524, 2.2834, 2.4219, 2.5243], device='cuda:0')
15
+ 2023-12-16 11:16:36,487 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
16
+ 2023-12-16 11:16:39,916 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
17
+ 2023-12-16 11:16:42,482 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7736, 2.9035, 2.7684, 2.3896], device='cuda:0')
18
+ 2023-12-16 11:16:43,175 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
19
+ 2023-12-16 11:16:46,464 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
20
+ 2023-12-16 11:16:46,721 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.4621, 2.1213, 2.4143, 2.9950, 1.6090, 2.1749, 2.9151, 1.7644],
21
+ device='cuda:0')
22
+ 2023-12-16 11:16:49,661 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
23
+ 2023-12-16 11:16:52,903 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
24
+ 2023-12-16 11:16:56,121 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
25
+ 2023-12-16 11:16:59,309 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
26
+ 2023-12-16 11:16:59,407 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5433, 3.8063, 3.2982, 3.7440], device='cuda:0')
27
+ 2023-12-16 11:17:02,527 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
28
+ 2023-12-16 11:17:05,067 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5219, 2.2802, 2.7563, 3.0207, 1.8218, 2.2484, 2.9468, 2.1817],
29
+ device='cuda:0')
30
+ 2023-12-16 11:17:05,732 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
31
+ 2023-12-16 11:17:06,345 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7913, 2.7647, 1.4067, 2.8494, 2.1108, 2.3063, 2.4163, 2.7820],
32
+ device='cuda:0')
33
+ 2023-12-16 11:17:08,911 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
34
+ 2023-12-16 11:17:10,961 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1413, 3.3117, 3.5794, 3.5279], device='cuda:0')
35
+ 2023-12-16 11:17:11,967 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1854, 3.5568, 3.5235, 3.4249], device='cuda:0')
36
+ 2023-12-16 11:17:12,182 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
37
+ 2023-12-16 11:17:15,388 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
38
+ 2023-12-16 11:17:15,500 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7799, 2.4910, 3.3156, 2.6991, 2.4647, 2.6907, 3.3613, 3.2189],
39
+ device='cuda:0')
40
+ 2023-12-16 11:17:18,558 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
41
+ 2023-12-16 11:17:19,122 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1523, 3.3334, 3.5373, 3.2937], device='cuda:0')
42
+ 2023-12-16 11:17:21,693 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
43
+ 2023-12-16 11:17:24,904 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
44
+ 2023-12-16 11:17:28,270 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
45
+ 2023-12-16 11:17:31,538 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
46
+ 2023-12-16 11:17:34,712 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
47
+ 2023-12-16 11:17:37,941 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
48
+ 2023-12-16 11:17:41,130 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
49
+ 2023-12-16 11:17:44,388 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
50
+ 2023-12-16 11:17:44,717 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
51
+ 2023-12-16 11:17:46,093 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.45236330139257647
52
+ 2023-12-16 11:17:46,093 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-10-avg-1-use-averaged-model-2023-12-17-23-29-50 ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-17 23:29:50,417 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-17 23:29:50,417 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 10, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': False, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-10-avg-1-use-averaged-model'}
3
+ 2023-12-17 23:29:50,417 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-17 23:29:50,767 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 9 (excluded) to 10
5
+ 2023-12-17 23:30:01,625 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-17 23:30:01,625 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-17 23:30:01,683 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-17 23:30:02,046 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-17 23:30:06,545 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-17 23:30:09,953 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-17 23:30:13,210 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-17 23:30:16,486 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-17 23:30:19,643 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
14
+ 2023-12-17 23:30:23,038 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
15
+ 2023-12-17 23:30:26,235 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
16
+ 2023-12-17 23:30:29,438 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
17
+ 2023-12-17 23:30:32,640 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
18
+ 2023-12-17 23:30:36,051 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
19
+ 2023-12-17 23:30:39,256 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
20
+ 2023-12-17 23:30:40,282 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8774, 4.7591, 5.0748, 4.9864], device='cuda:0')
21
+ 2023-12-17 23:30:41,423 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5223, 3.7254, 3.2381, 3.7806], device='cuda:0')
22
+ 2023-12-17 23:30:42,432 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
23
+ 2023-12-17 23:30:45,523 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
24
+ 2023-12-17 23:30:48,701 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
25
+ 2023-12-17 23:30:51,922 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
26
+ 2023-12-17 23:30:55,205 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
27
+ 2023-12-17 23:30:58,392 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
28
+ 2023-12-17 23:31:01,586 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
29
+ 2023-12-17 23:31:04,630 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
30
+ 2023-12-17 23:31:07,819 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
31
+ 2023-12-17 23:31:10,456 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5159, 2.4499, 2.7984, 3.0320, 3.2389, 2.6127, 2.4140, 2.4274],
32
+ device='cuda:0')
33
+ 2023-12-17 23:31:11,125 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
34
+ 2023-12-17 23:31:14,348 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
35
+ 2023-12-17 23:31:17,538 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
36
+ 2023-12-17 23:31:20,806 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
37
+ 2023-12-17 23:31:24,056 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
38
+ 2023-12-17 23:31:27,201 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
39
+ 2023-12-17 23:31:27,580 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
40
+ 2023-12-17 23:31:29,147 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.451453628110838
41
+ 2023-12-17 23:31:29,147 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-11-avg-1-use-averaged-model-2023-12-16-11-14-26 ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-16 11:14:26,772 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-16 11:14:26,772 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 11, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': False, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-11-avg-1-use-averaged-model'}
3
+ 2023-12-16 11:14:26,772 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-16 11:14:27,146 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 10 (excluded) to 11
5
+ 2023-12-16 11:14:38,884 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-16 11:14:38,884 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-16 11:14:38,933 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-16 11:14:39,260 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-16 11:14:44,193 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-16 11:14:47,660 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-16 11:14:51,010 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-16 11:14:54,068 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-16 11:14:55,528 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1801, 3.4630, 3.6947, 3.1291], device='cuda:0')
14
+ 2023-12-16 11:14:57,305 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
15
+ 2023-12-16 11:15:00,729 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
16
+ 2023-12-16 11:15:03,957 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
17
+ 2023-12-16 11:15:07,134 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
18
+ 2023-12-16 11:15:10,378 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
19
+ 2023-12-16 11:15:13,601 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
20
+ 2023-12-16 11:15:15,099 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0439, 5.9129, 6.0155, 6.0727], device='cuda:0')
21
+ 2023-12-16 11:15:16,803 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
22
+ 2023-12-16 11:15:19,993 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
23
+ 2023-12-16 11:15:23,195 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
24
+ 2023-12-16 11:15:26,384 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
25
+ 2023-12-16 11:15:29,617 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
26
+ 2023-12-16 11:15:32,879 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
27
+ 2023-12-16 11:15:35,926 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
28
+ 2023-12-16 11:15:39,190 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
29
+ 2023-12-16 11:15:42,407 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
30
+ 2023-12-16 11:15:45,627 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
31
+ 2023-12-16 11:15:49,129 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
32
+ 2023-12-16 11:15:52,275 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
33
+ 2023-12-16 11:15:53,051 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7504, 2.8340, 1.4434, 2.7144, 2.3212, 2.2933, 2.3869, 2.6434],
34
+ device='cuda:0')
35
+ 2023-12-16 11:15:55,452 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
36
+ 2023-12-16 11:15:57,422 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8542, 2.5852, 3.2373, 2.6513, 2.5947, 2.8500, 3.3345, 3.1918],
37
+ device='cuda:0')
38
+ 2023-12-16 11:15:58,576 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
39
+ 2023-12-16 11:16:01,749 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
40
+ 2023-12-16 11:16:01,800 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0553, 5.9314, 6.0187, 6.0771], device='cuda:0')
41
+ 2023-12-16 11:16:05,023 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
42
+ 2023-12-16 11:16:05,380 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
43
+ 2023-12-16 11:16:06,961 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4544754061215298
44
+ 2023-12-16 11:16:06,961 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-11-avg-1-use-averaged-model-2023-12-17-23-28-06 ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-17 23:28:06,842 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-17 23:28:06,842 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 11, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': False, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-11-avg-1-use-averaged-model'}
3
+ 2023-12-17 23:28:06,842 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-17 23:28:07,251 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 10 (excluded) to 11
5
+ 2023-12-17 23:28:18,422 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-17 23:28:18,422 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-17 23:28:18,464 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-17 23:28:18,793 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-17 23:28:23,463 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-17 23:28:26,779 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-17 23:28:30,071 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-17 23:28:31,809 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.4529, 2.5963, 2.7864, 2.9300, 3.2214, 2.5717, 2.3455, 2.5523],
13
+ device='cuda:0')
14
+ 2023-12-17 23:28:33,346 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
15
+ 2023-12-17 23:28:36,663 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
16
+ 2023-12-17 23:28:39,965 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
17
+ 2023-12-17 23:28:43,185 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
18
+ 2023-12-17 23:28:46,356 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
19
+ 2023-12-17 23:28:49,613 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
20
+ 2023-12-17 23:28:50,189 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8591, 2.5462, 3.2469, 2.7765, 2.5323, 2.9186, 3.4246, 3.2531],
21
+ device='cuda:0')
22
+ 2023-12-17 23:28:50,972 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7630, 3.0151, 2.7297, 2.4916], device='cuda:0')
23
+ 2023-12-17 23:28:52,943 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
24
+ 2023-12-17 23:28:56,147 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
25
+ 2023-12-17 23:28:59,349 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
26
+ 2023-12-17 23:29:02,489 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
27
+ 2023-12-17 23:29:05,779 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
28
+ 2023-12-17 23:29:07,671 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7351, 2.9057, 2.7316, 2.5065], device='cuda:0')
29
+ 2023-12-17 23:29:08,835 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
30
+ 2023-12-17 23:29:12,195 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
31
+ 2023-12-17 23:29:15,409 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
32
+ 2023-12-17 23:29:18,624 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
33
+ 2023-12-17 23:29:21,809 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
34
+ 2023-12-17 23:29:25,051 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
35
+ 2023-12-17 23:29:28,465 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
36
+ 2023-12-17 23:29:31,498 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
37
+ 2023-12-17 23:29:34,698 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
38
+ 2023-12-17 23:29:37,858 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
39
+ 2023-12-17 23:29:38,004 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.3312, 2.4974, 2.4951, 2.2906], device='cuda:0')
40
+ 2023-12-17 23:29:41,114 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
41
+ 2023-12-17 23:29:44,376 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
42
+ 2023-12-17 23:29:44,744 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
43
+ 2023-12-17 23:29:46,084 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4548011587857737
44
+ 2023-12-17 23:29:46,084 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-12-avg-1-use-averaged-model-2023-12-16-11-12-42 ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-16 11:12:42,792 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-16 11:12:42,792 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 12, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': False, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-12-avg-1-use-averaged-model'}
3
+ 2023-12-16 11:12:42,793 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-16 11:12:43,136 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 11 (excluded) to 12
5
+ 2023-12-16 11:12:54,338 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-16 11:12:54,339 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-16 11:12:54,380 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-16 11:12:54,703 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-16 11:12:59,477 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-16 11:13:02,909 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-16 11:13:06,222 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-16 11:13:09,248 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8441, 2.6795, 3.2704, 2.6782, 2.6707, 3.1470, 3.4700, 3.2698],
13
+ device='cuda:0')
14
+ 2023-12-16 11:13:09,452 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
15
+ 2023-12-16 11:13:12,662 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
16
+ 2023-12-16 11:13:16,149 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
17
+ 2023-12-16 11:13:19,266 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
18
+ 2023-12-16 11:13:22,488 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
19
+ 2023-12-16 11:13:25,707 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
20
+ 2023-12-16 11:13:29,111 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
21
+ 2023-12-16 11:13:30,294 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.8724, 5.3363, 5.1755, 4.8046], device='cuda:0')
22
+ 2023-12-16 11:13:32,320 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
23
+ 2023-12-16 11:13:35,312 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6637, 2.2560, 2.6339, 3.1661, 1.8498, 2.2992, 3.1535, 2.2405],
24
+ device='cuda:0')
25
+ 2023-12-16 11:13:35,544 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
26
+ 2023-12-16 11:13:37,665 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0157, 5.8355, 5.9711, 6.0474], device='cuda:0')
27
+ 2023-12-16 11:13:38,748 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
28
+ 2023-12-16 11:13:41,780 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
29
+ 2023-12-16 11:13:45,099 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
30
+ 2023-12-16 11:13:48,390 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
31
+ 2023-12-16 11:13:51,624 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
32
+ 2023-12-16 11:13:54,812 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
33
+ 2023-12-16 11:13:55,008 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.8809, 5.3578, 5.1343, 4.9156], device='cuda:0')
34
+ 2023-12-16 11:13:55,591 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8197, 2.7291, 1.7550, 2.8628, 2.2478, 2.3156, 2.4243, 2.8382],
35
+ device='cuda:0')
36
+ 2023-12-16 11:13:58,026 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
37
+ 2023-12-16 11:14:01,001 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5793, 3.9147, 3.3932, 3.8216], device='cuda:0')
38
+ 2023-12-16 11:14:01,193 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
39
+ 2023-12-16 11:14:04,456 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
40
+ 2023-12-16 11:14:04,862 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.8979, 5.4297, 5.0280, 4.9030], device='cuda:0')
41
+ 2023-12-16 11:14:07,635 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
42
+ 2023-12-16 11:14:10,832 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
43
+ 2023-12-16 11:14:11,108 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5916, 2.5938, 2.7861, 2.9899, 3.2056, 2.5096, 2.3297, 2.5132],
44
+ device='cuda:0')
45
+ 2023-12-16 11:14:14,072 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
46
+ 2023-12-16 11:14:16,257 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5526, 2.4856, 2.7824, 2.9715, 3.3147, 2.6220, 2.3419, 2.4864],
47
+ device='cuda:0')
48
+ 2023-12-16 11:14:17,292 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
49
+ 2023-12-16 11:14:20,524 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
50
+ 2023-12-16 11:14:20,902 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
51
+ 2023-12-16 11:14:22,493 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4577013025780574
52
+ 2023-12-16 11:14:22,493 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-12-avg-1-use-averaged-model-2023-12-17-23-26-23 ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-17 23:26:23,282 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-17 23:26:23,282 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 12, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': False, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-12-avg-1-use-averaged-model'}
3
+ 2023-12-17 23:26:23,283 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-17 23:26:23,679 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 11 (excluded) to 12
5
+ 2023-12-17 23:26:34,633 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-17 23:26:34,633 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-17 23:26:34,675 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-17 23:26:35,005 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-17 23:26:39,555 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-17 23:26:43,106 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-17 23:26:46,377 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-17 23:26:49,658 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-17 23:26:52,847 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
14
+ 2023-12-17 23:26:56,065 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8608, 2.8019, 3.5906, 3.3596, 2.8548, 3.1551, 3.5738, 3.7224],
15
+ device='cuda:0')
16
+ 2023-12-17 23:26:56,311 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
17
+ 2023-12-17 23:26:59,583 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
18
+ 2023-12-17 23:27:00,038 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.1469, 4.4152, 3.9901, 3.7648], device='cuda:0')
19
+ 2023-12-17 23:27:02,832 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
20
+ 2023-12-17 23:27:06,073 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
21
+ 2023-12-17 23:27:09,405 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
22
+ 2023-12-17 23:27:12,452 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
23
+ 2023-12-17 23:27:15,664 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
24
+ 2023-12-17 23:27:18,879 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
25
+ 2023-12-17 23:27:22,068 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
26
+ 2023-12-17 23:27:22,509 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7253, 2.7557, 3.0135, 3.0240, 3.5809, 2.5115, 2.5949, 2.8846],
27
+ device='cuda:0')
28
+ 2023-12-17 23:27:25,306 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
29
+ 2023-12-17 23:27:28,651 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
30
+ 2023-12-17 23:27:31,862 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
31
+ 2023-12-17 23:27:34,900 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
32
+ 2023-12-17 23:27:38,124 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
33
+ 2023-12-17 23:27:41,348 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
34
+ 2023-12-17 23:27:44,702 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
35
+ 2023-12-17 23:27:47,903 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
36
+ 2023-12-17 23:27:51,135 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
37
+ 2023-12-17 23:27:54,334 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
38
+ 2023-12-17 23:27:57,457 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
39
+ 2023-12-17 23:28:00,792 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
40
+ 2023-12-17 23:28:01,156 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
41
+ 2023-12-17 23:28:02,527 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.2781067236991598
42
+ 2023-12-17 23:28:02,528 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-13-avg-1-use-averaged-model-2023-12-16-11-10-58 ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-16 11:10:58,355 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-16 11:10:58,356 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 13, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': False, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-13-avg-1-use-averaged-model'}
3
+ 2023-12-16 11:10:58,357 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-16 11:10:58,778 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 12 (excluded) to 13
5
+ 2023-12-16 11:11:10,324 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-16 11:11:10,324 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-16 11:11:10,363 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-16 11:11:10,764 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-16 11:11:15,713 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-16 11:11:18,465 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6206, 2.6302, 1.6210, 2.6323, 2.1096, 2.0903, 2.4797, 1.7368],
11
+ device='cuda:0')
12
+ 2023-12-16 11:11:19,188 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
13
+ 2023-12-16 11:11:22,477 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
14
+ 2023-12-16 11:11:25,652 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
15
+ 2023-12-16 11:11:28,899 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
16
+ 2023-12-16 11:11:29,685 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7388, 3.0629, 2.7570, 2.3678], device='cuda:0')
17
+ 2023-12-16 11:11:32,377 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
18
+ 2023-12-16 11:11:32,701 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8405, 3.1562, 2.8892, 2.4944], device='cuda:0')
19
+ 2023-12-16 11:11:33,002 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7811, 2.6914, 3.3685, 2.9962, 2.6195, 3.0731, 3.5432, 3.2820],
20
+ device='cuda:0')
21
+ 2023-12-16 11:11:34,611 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8417, 2.9550, 1.8520, 3.0415, 2.3736, 2.3618, 2.3703, 2.8060],
22
+ device='cuda:0')
23
+ 2023-12-16 11:11:35,620 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
24
+ 2023-12-16 11:11:38,855 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
25
+ 2023-12-16 11:11:42,104 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
26
+ 2023-12-16 11:11:43,025 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8265, 2.6179, 3.3585, 2.7822, 2.5353, 2.7973, 3.3672, 3.3002],
27
+ device='cuda:0')
28
+ 2023-12-16 11:11:45,396 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
29
+ 2023-12-16 11:11:48,555 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
30
+ 2023-12-16 11:11:51,809 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
31
+ 2023-12-16 11:11:52,508 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0624, 5.9122, 6.0261, 6.0811], device='cuda:0')
32
+ 2023-12-16 11:11:54,983 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
33
+ 2023-12-16 11:11:58,190 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
34
+ 2023-12-16 11:12:01,413 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
35
+ 2023-12-16 11:12:04,693 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
36
+ 2023-12-16 11:12:04,858 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6210, 2.8047, 2.8528, 3.0416, 3.1964, 2.5386, 2.3448, 2.4252],
37
+ device='cuda:0')
38
+ 2023-12-16 11:12:07,763 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
39
+ 2023-12-16 11:12:11,019 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
40
+ 2023-12-16 11:12:14,176 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
41
+ 2023-12-16 11:12:16,955 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0465, 5.8816, 6.0177, 6.0763], device='cuda:0')
42
+ 2023-12-16 11:12:17,372 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
43
+ 2023-12-16 11:12:20,749 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
44
+ 2023-12-16 11:12:23,924 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
45
+ 2023-12-16 11:12:27,025 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
46
+ 2023-12-16 11:12:30,318 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
47
+ 2023-12-16 11:12:30,367 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0385, 5.8681, 6.0077, 6.0708], device='cuda:0')
48
+ 2023-12-16 11:12:33,538 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
49
+ 2023-12-16 11:12:36,817 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
50
+ 2023-12-16 11:12:37,171 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
51
+ 2023-12-16 11:12:38,573 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.45971156327397766
52
+ 2023-12-16 11:12:38,573 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-13-avg-1-use-averaged-model-2023-12-17-23-24-39 ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-17 23:24:39,832 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-17 23:24:39,832 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 13, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': False, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-13-avg-1-use-averaged-model'}
3
+ 2023-12-17 23:24:39,832 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-17 23:24:40,171 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 12 (excluded) to 13
5
+ 2023-12-17 23:24:50,945 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-17 23:24:50,946 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-17 23:24:50,988 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-17 23:24:51,317 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-17 23:24:55,876 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-17 23:24:57,048 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.7727, 5.7054, 5.7883, 5.8980], device='cuda:0')
11
+ 2023-12-17 23:24:59,129 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
12
+ 2023-12-17 23:25:02,423 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
13
+ 2023-12-17 23:25:05,628 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
14
+ 2023-12-17 23:25:08,854 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
15
+ 2023-12-17 23:25:12,400 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
16
+ 2023-12-17 23:25:15,663 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1630, 3.4439, 3.7554, 2.9572], device='cuda:0')
17
+ 2023-12-17 23:25:15,671 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
18
+ 2023-12-17 23:25:18,763 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
19
+ 2023-12-17 23:25:21,964 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
20
+ 2023-12-17 23:25:22,347 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0495, 5.8543, 5.9979, 6.0665], device='cuda:0')
21
+ 2023-12-17 23:25:25,361 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
22
+ 2023-12-17 23:25:28,537 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
23
+ 2023-12-17 23:25:31,458 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.7908, 4.7507, 5.0573, 4.8997], device='cuda:0')
24
+ 2023-12-17 23:25:31,742 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
25
+ 2023-12-17 23:25:32,218 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0395, 5.8716, 6.0053, 6.0718], device='cuda:0')
26
+ 2023-12-17 23:25:34,925 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
27
+ 2023-12-17 23:25:38,141 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
28
+ 2023-12-17 23:25:38,212 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8242, 2.6219, 3.3391, 2.7454, 2.5288, 2.8860, 3.4190, 3.2471],
29
+ device='cuda:0')
30
+ 2023-12-17 23:25:41,225 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
31
+ 2023-12-17 23:25:44,552 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
32
+ 2023-12-17 23:25:47,686 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8385, 4.7867, 5.0635, 4.9064], device='cuda:0')
33
+ 2023-12-17 23:25:47,780 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
34
+ 2023-12-17 23:25:48,822 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5503, 3.8728, 4.0503, 3.8089], device='cuda:0')
35
+ 2023-12-17 23:25:50,978 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
36
+ 2023-12-17 23:25:51,674 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.8765, 5.3496, 5.2040, 4.6921], device='cuda:0')
37
+ 2023-12-17 23:25:54,211 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
38
+ 2023-12-17 23:25:57,411 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
39
+ 2023-12-17 23:26:00,787 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
40
+ 2023-12-17 23:26:00,914 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7330, 3.0166, 2.7434, 2.3292], device='cuda:0')
41
+ 2023-12-17 23:26:03,926 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
42
+ 2023-12-17 23:26:07,122 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
43
+ 2023-12-17 23:26:10,107 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7994, 3.0886, 2.8453, 2.5028], device='cuda:0')
44
+ 2023-12-17 23:26:10,307 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
45
+ 2023-12-17 23:26:13,527 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
46
+ 2023-12-17 23:26:16,758 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
47
+ 2023-12-17 23:26:17,107 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
48
+ 2023-12-17 23:26:18,455 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.45971156327397766
49
+ 2023-12-17 23:26:18,455 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-14-avg-1-use-averaged-model-2023-12-16-11-09-06 ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-16 11:09:06,614 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-16 11:09:06,614 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 14, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': False, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-14-avg-1-use-averaged-model'}
3
+ 2023-12-16 11:09:06,616 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-16 11:09:06,986 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 13 (excluded) to 14
5
+ 2023-12-16 11:09:23,124 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-16 11:09:23,125 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-16 11:09:23,172 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-16 11:09:23,499 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-16 11:09:28,353 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-16 11:09:31,728 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-16 11:09:35,018 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-16 11:09:38,351 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-16 11:09:41,516 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
14
+ 2023-12-16 11:09:44,932 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
15
+ 2023-12-16 11:09:48,198 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
16
+ 2023-12-16 11:09:51,453 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
17
+ 2023-12-16 11:09:54,772 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
18
+ 2023-12-16 11:09:56,031 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5066, 4.0692, 3.9361, 3.5372], device='cuda:0')
19
+ 2023-12-16 11:09:58,153 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
20
+ 2023-12-16 11:09:58,919 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5245, 4.0821, 4.0162, 3.6717], device='cuda:0')
21
+ 2023-12-16 11:10:01,418 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
22
+ 2023-12-16 11:10:04,645 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
23
+ 2023-12-16 11:10:07,891 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
24
+ 2023-12-16 11:10:11,158 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
25
+ 2023-12-16 11:10:14,397 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
26
+ 2023-12-16 11:10:17,684 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5453, 4.0373, 3.8961, 3.6057], device='cuda:0')
27
+ 2023-12-16 11:10:18,447 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
28
+ 2023-12-16 11:10:20,271 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0396, 5.8880, 5.9961, 6.0650], device='cuda:0')
29
+ 2023-12-16 11:10:21,546 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
30
+ 2023-12-16 11:10:26,442 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
31
+ 2023-12-16 11:10:29,665 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
32
+ 2023-12-16 11:10:33,125 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
33
+ 2023-12-16 11:10:36,673 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
34
+ 2023-12-16 11:10:40,215 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
35
+ 2023-12-16 11:10:43,371 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.9214, 5.4075, 5.1889, 4.9160], device='cuda:0')
36
+ 2023-12-16 11:10:43,839 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
37
+ 2023-12-16 11:10:46,897 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
38
+ 2023-12-16 11:10:49,066 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.9950, 3.7325, 3.5088, 3.1377], device='cuda:0')
39
+ 2023-12-16 11:10:49,186 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
40
+ 2023-12-16 11:10:52,232 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8437, 3.1539, 3.0004, 2.5049], device='cuda:0')
41
+ 2023-12-16 11:10:52,444 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
42
+ 2023-12-16 11:10:52,767 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
43
+ 2023-12-16 11:10:54,174 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46230051044278675
44
+ 2023-12-16 11:10:54,174 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-14-avg-1-use-averaged-model-2023-12-17-23-22-56 ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-17 23:22:56,036 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-17 23:22:56,036 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 14, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': False, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-14-avg-1-use-averaged-model'}
3
+ 2023-12-17 23:22:56,036 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-17 23:22:56,377 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 13 (excluded) to 14
5
+ 2023-12-17 23:23:07,150 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-17 23:23:07,150 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-17 23:23:07,204 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-17 23:23:07,596 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-17 23:23:12,525 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-17 23:23:15,879 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-17 23:23:19,175 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-17 23:23:22,295 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-17 23:23:25,515 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
14
+ 2023-12-17 23:23:29,027 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
15
+ 2023-12-17 23:23:32,299 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
16
+ 2023-12-17 23:23:35,566 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
17
+ 2023-12-17 23:23:38,803 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
18
+ 2023-12-17 23:23:40,619 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8155, 2.6589, 3.3488, 2.9107, 2.6552, 2.8798, 3.4854, 3.3850],
19
+ device='cuda:0')
20
+ 2023-12-17 23:23:42,118 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
21
+ 2023-12-17 23:23:45,182 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
22
+ 2023-12-17 23:23:48,418 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
23
+ 2023-12-17 23:23:51,641 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
24
+ 2023-12-17 23:23:54,837 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
25
+ 2023-12-17 23:23:56,409 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7078, 2.4078, 2.9249, 3.1283, 1.8151, 2.3111, 3.0584, 2.3371],
26
+ device='cuda:0')
27
+ 2023-12-17 23:23:58,097 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
28
+ 2023-12-17 23:23:59,831 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0723, 3.9054, 3.5491, 3.2173], device='cuda:0')
29
+ 2023-12-17 23:24:00,458 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8155, 2.8085, 1.5572, 2.9974, 2.4917, 2.5216, 2.4534, 2.7490],
30
+ device='cuda:0')
31
+ 2023-12-17 23:24:01,456 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
32
+ 2023-12-17 23:24:04,534 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
33
+ 2023-12-17 23:24:07,778 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
34
+ 2023-12-17 23:24:10,937 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
35
+ 2023-12-17 23:24:14,168 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
36
+ 2023-12-17 23:24:17,182 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8517, 2.5889, 3.4547, 2.8183, 2.5989, 3.0641, 3.4985, 3.2982],
37
+ device='cuda:0')
38
+ 2023-12-17 23:24:17,551 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
39
+ 2023-12-17 23:24:20,733 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
40
+ 2023-12-17 23:24:23,651 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8528, 4.8219, 5.0693, 4.8882], device='cuda:0')
41
+ 2023-12-17 23:24:23,924 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
42
+ 2023-12-17 23:24:24,828 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1883, 3.4354, 3.7097, 2.9845], device='cuda:0')
43
+ 2023-12-17 23:24:27,088 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
44
+ 2023-12-17 23:24:28,782 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5551, 3.8626, 3.4812, 3.9426], device='cuda:0')
45
+ 2023-12-17 23:24:30,256 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
46
+ 2023-12-17 23:24:33,571 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
47
+ 2023-12-17 23:24:33,939 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
48
+ 2023-12-17 23:24:35,521 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46230051044278675
49
+ 2023-12-17 23:24:35,521 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-15-avg-1-use-averaged-model-2023-12-17-23-21-16 ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-17 23:21:16,310 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-17 23:21:16,310 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 15, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': False, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-15-avg-1-use-averaged-model'}
3
+ 2023-12-17 23:21:16,311 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-17 23:21:16,656 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 14 (excluded) to 15
5
+ 2023-12-17 23:21:23,802 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-17 23:21:23,802 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-17 23:21:23,842 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-17 23:21:24,163 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-17 23:21:28,493 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-17 23:21:31,932 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-17 23:21:35,249 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-17 23:21:38,513 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-17 23:21:39,751 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6689, 2.6391, 3.2550, 2.7387, 2.4567, 2.7300, 3.4494, 3.1553],
14
+ device='cuda:0')
15
+ 2023-12-17 23:21:41,807 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
16
+ 2023-12-17 23:21:45,269 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
17
+ 2023-12-17 23:21:48,463 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
18
+ 2023-12-17 23:21:51,729 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
19
+ 2023-12-17 23:21:54,915 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
20
+ 2023-12-17 23:21:58,322 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
21
+ 2023-12-17 23:22:01,498 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
22
+ 2023-12-17 23:22:04,678 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
23
+ 2023-12-17 23:22:07,162 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5693, 3.9012, 3.4947, 4.0123], device='cuda:0')
24
+ 2023-12-17 23:22:07,873 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
25
+ 2023-12-17 23:22:11,035 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
26
+ 2023-12-17 23:22:14,239 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
27
+ 2023-12-17 23:22:15,487 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7912, 2.6078, 3.3948, 2.8221, 2.5661, 2.8951, 3.4449, 3.3860],
28
+ device='cuda:0')
29
+ 2023-12-17 23:22:17,599 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
30
+ 2023-12-17 23:22:20,823 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
31
+ 2023-12-17 23:22:24,028 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
32
+ 2023-12-17 23:22:27,241 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
33
+ 2023-12-17 23:22:30,510 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
34
+ 2023-12-17 23:22:33,764 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
35
+ 2023-12-17 23:22:36,940 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
36
+ 2023-12-17 23:22:40,202 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
37
+ 2023-12-17 23:22:41,974 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.4278, 2.6357, 2.5193, 2.6546], device='cuda:0')
38
+ 2023-12-17 23:22:43,423 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
39
+ 2023-12-17 23:22:46,647 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
40
+ 2023-12-17 23:22:48,210 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5850, 4.0296, 4.0943, 4.1047], device='cuda:0')
41
+ 2023-12-17 23:22:50,072 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
42
+ 2023-12-17 23:22:50,417 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
43
+ 2023-12-17 23:22:51,812 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4622097627138241
44
+ 2023-12-17 23:22:51,812 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-16-avg-1-use-averaged-model-2023-12-17-23-19-35 ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-17 23:19:35,397 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-17 23:19:35,398 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 16, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': False, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-16-avg-1-use-averaged-model'}
3
+ 2023-12-17 23:19:35,398 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-17 23:19:35,854 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 15 (excluded) to 16
5
+ 2023-12-17 23:19:43,711 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-17 23:19:43,711 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-17 23:19:43,753 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-17 23:19:44,078 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-17 23:19:48,828 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-17 23:19:52,260 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-17 23:19:55,536 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-17 23:19:57,397 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.8712, 5.7286, 5.8277, 5.9227], device='cuda:0')
13
+ 2023-12-17 23:19:58,820 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
14
+ 2023-12-17 23:20:02,065 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
15
+ 2023-12-17 23:20:05,123 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.9101, 5.3893, 5.1791, 4.8233], device='cuda:0')
16
+ 2023-12-17 23:20:05,525 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
17
+ 2023-12-17 23:20:07,680 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.8862, 5.3655, 5.2603, 4.8280], device='cuda:0')
18
+ 2023-12-17 23:20:08,747 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
19
+ 2023-12-17 23:20:11,929 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
20
+ 2023-12-17 23:20:12,981 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.1125, 4.6258, 4.9307, 4.0854], device='cuda:0')
21
+ 2023-12-17 23:20:13,072 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1336, 3.4281, 3.7244, 3.1000], device='cuda:0')
22
+ 2023-12-17 23:20:15,213 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
23
+ 2023-12-17 23:20:16,005 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.9685, 3.6017, 3.6151, 3.0689], device='cuda:0')
24
+ 2023-12-17 23:20:18,546 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
25
+ 2023-12-17 23:20:21,767 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
26
+ 2023-12-17 23:20:24,983 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
27
+ 2023-12-17 23:20:28,213 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
28
+ 2023-12-17 23:20:31,403 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
29
+ 2023-12-17 23:20:34,628 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
30
+ 2023-12-17 23:20:38,033 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
31
+ 2023-12-17 23:20:41,239 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
32
+ 2023-12-17 23:20:42,097 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0922, 4.6414, 4.9812, 4.1098], device='cuda:0')
33
+ 2023-12-17 23:20:43,250 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5520, 4.0059, 3.9394, 3.6814], device='cuda:0')
34
+ 2023-12-17 23:20:44,483 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
35
+ 2023-12-17 23:20:47,651 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
36
+ 2023-12-17 23:20:50,843 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
37
+ 2023-12-17 23:20:54,141 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
38
+ 2023-12-17 23:20:57,342 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
39
+ 2023-12-17 23:21:00,481 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
40
+ 2023-12-17 23:21:03,763 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
41
+ 2023-12-17 23:21:06,962 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
42
+ 2023-12-17 23:21:10,271 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
43
+ 2023-12-17 23:21:10,598 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
44
+ 2023-12-17 23:21:12,039 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.462699455535075
45
+ 2023-12-17 23:21:12,040 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-17-avg-1-use-averaged-model-2023-12-17-23-17-44 ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-17 23:17:44,539 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-17 23:17:44,539 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 17, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': False, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-17-avg-1-use-averaged-model'}
3
+ 2023-12-17 23:17:44,539 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-17 23:17:44,917 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 16 (excluded) to 17
5
+ 2023-12-17 23:18:01,399 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-17 23:18:01,399 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-17 23:18:01,509 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-17 23:18:01,832 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-17 23:18:07,997 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-17 23:18:10,536 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.3648, 2.3429, 2.7070, 2.6881], device='cuda:0')
11
+ 2023-12-17 23:18:11,569 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
12
+ 2023-12-17 23:18:14,788 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
13
+ 2023-12-17 23:18:15,204 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.9962, 4.5660, 4.8826, 4.2075], device='cuda:0')
14
+ 2023-12-17 23:18:18,016 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
15
+ 2023-12-17 23:18:21,335 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
16
+ 2023-12-17 23:18:22,039 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0233, 3.6055, 3.6331, 3.1745], device='cuda:0')
17
+ 2023-12-17 23:18:24,646 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
18
+ 2023-12-17 23:18:27,866 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
19
+ 2023-12-17 23:18:29,339 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6779, 3.3861, 3.4279, 2.9028], device='cuda:0')
20
+ 2023-12-17 23:18:31,124 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
21
+ 2023-12-17 23:18:34,380 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
22
+ 2023-12-17 23:18:37,416 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1411, 3.3739, 3.7549, 3.0014], device='cuda:0')
23
+ 2023-12-17 23:18:37,709 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
24
+ 2023-12-17 23:18:38,885 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0407, 5.8648, 5.9982, 6.0660], device='cuda:0')
25
+ 2023-12-17 23:18:40,527 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0659, 5.8764, 6.0168, 6.0651], device='cuda:0')
26
+ 2023-12-17 23:18:40,955 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
27
+ 2023-12-17 23:18:44,063 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
28
+ 2023-12-17 23:18:47,263 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
29
+ 2023-12-17 23:18:50,487 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
30
+ 2023-12-17 23:18:53,713 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
31
+ 2023-12-17 23:18:55,759 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5775, 3.9719, 4.0440, 4.0744], device='cuda:0')
32
+ 2023-12-17 23:18:57,124 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
33
+ 2023-12-17 23:19:00,400 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
34
+ 2023-12-17 23:19:02,017 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1377, 3.4231, 3.6488, 3.0986], device='cuda:0')
35
+ 2023-12-17 23:19:03,610 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
36
+ 2023-12-17 23:19:03,825 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0425, 5.8562, 5.9956, 6.0623], device='cuda:0')
37
+ 2023-12-17 23:19:06,736 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
38
+ 2023-12-17 23:19:10,017 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
39
+ 2023-12-17 23:19:13,304 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
40
+ 2023-12-17 23:19:16,499 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
41
+ 2023-12-17 23:19:19,771 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
42
+ 2023-12-17 23:19:22,998 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
43
+ 2023-12-17 23:19:26,229 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
44
+ 2023-12-17 23:19:29,539 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
45
+ 2023-12-17 23:19:29,902 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
46
+ 2023-12-17 23:19:31,283 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4645884881071456
47
+ 2023-12-17 23:19:31,283 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-2-avg-1-use-averaged-model-2023-12-12-22-43-22 ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-12 22:43:22,868 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-12 22:43:22,868 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 2, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': False, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-2-avg-1-use-averaged-model'}
3
+ 2023-12-12 22:43:22,868 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-12 22:43:23,206 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 1 (excluded) to 2
5
+ 2023-12-12 22:43:39,254 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-12 22:43:39,254 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-12 22:43:39,298 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-12 22:43:39,625 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-12 22:43:44,491 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-12 22:43:47,997 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-12 22:43:48,431 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.2781, 1.7209, 2.4840, 1.7088, 1.7826, 2.2468, 2.5414, 1.9708],
12
+ device='cuda:0')
13
+ 2023-12-12 22:43:48,821 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9351, 2.5774, 2.6949, 2.2926], device='cuda:0')
14
+ 2023-12-12 22:43:49,332 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2447, 3.7221, 3.4814, 3.5065], device='cuda:0')
15
+ 2023-12-12 22:43:50,328 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5049, 2.1718, 1.4794, 1.8619, 1.8127, 2.1946, 1.8917, 1.0870],
16
+ device='cuda:0')
17
+ 2023-12-12 22:43:51,152 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
18
+ 2023-12-12 22:43:54,325 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
19
+ 2023-12-12 22:43:57,467 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
20
+ 2023-12-12 22:44:00,708 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
21
+ 2023-12-12 22:44:02,835 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.7926, 5.5741, 4.8881, 4.5260], device='cuda:0')
22
+ 2023-12-12 22:44:03,904 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
23
+ 2023-12-12 22:44:07,073 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
24
+ 2023-12-12 22:44:10,283 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
25
+ 2023-12-12 22:44:13,541 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
26
+ 2023-12-12 22:44:16,749 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
27
+ 2023-12-12 22:44:19,745 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
28
+ 2023-12-12 22:44:22,901 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
29
+ 2023-12-12 22:44:24,140 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0704, 4.0871, 4.8838, 4.0770], device='cuda:0')
30
+ 2023-12-12 22:44:24,638 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.4496, 2.0000, 2.4979, 2.4214, 3.1685, 2.4483, 2.1309, 1.7911],
31
+ device='cuda:0')
32
+ 2023-12-12 22:44:26,146 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
33
+ 2023-12-12 22:44:29,263 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
34
+ 2023-12-12 22:44:32,581 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
35
+ 2023-12-12 22:44:35,805 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
36
+ 2023-12-12 22:44:38,914 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
37
+ 2023-12-12 22:44:39,215 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6566, 2.2531, 1.3809, 1.9639, 2.0197, 2.4008, 2.0048, 2.6583],
38
+ device='cuda:0')
39
+ 2023-12-12 22:44:39,992 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.4076, 1.9106, 2.3999, 1.5667, 1.6214, 2.1884, 2.5347, 1.6207],
40
+ device='cuda:0')
41
+ 2023-12-12 22:44:41,978 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
42
+ 2023-12-12 22:44:45,137 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
43
+ 2023-12-12 22:44:48,375 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
44
+ 2023-12-12 22:44:49,678 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.3490, 3.4776, 3.0601, 2.8546], device='cuda:0')
45
+ 2023-12-12 22:44:51,557 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
46
+ 2023-12-12 22:44:54,665 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
47
+ 2023-12-12 22:44:57,783 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
48
+ 2023-12-12 22:45:00,898 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
49
+ 2023-12-12 22:45:04,075 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
50
+ 2023-12-12 22:45:04,389 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
51
+ 2023-12-12 22:45:05,708 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.3488509535462445
52
+ 2023-12-12 22:45:05,709 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-20-avg-1-use-averaged-model-2023-12-20-11-15-49 ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 11:15:49,244 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 11:15:49,244 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 20, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-20-avg-1-use-averaged-model'}
3
+ 2023-12-20 11:15:49,244 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 11:15:49,594 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 19 (excluded) to 20
5
+ 2023-12-20 11:15:55,394 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 11:15:55,395 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 11:15:55,443 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 11:15:55,766 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 11:16:00,133 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 11:16:03,418 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 11:16:03,470 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0587, 5.8758, 6.0222, 6.0633], device='cuda:0')
12
+ 2023-12-20 11:16:06,623 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
13
+ 2023-12-20 11:16:09,484 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.7960, 3.5778, 3.4734, 2.9477], device='cuda:0')
14
+ 2023-12-20 11:16:09,764 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
15
+ 2023-12-20 11:16:12,462 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7869, 2.4548, 2.7755, 3.2493, 2.0257, 2.3097, 3.0176, 2.3115],
16
+ device='cuda:0')
17
+ 2023-12-20 11:16:12,963 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
18
+ 2023-12-20 11:16:16,230 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
19
+ 2023-12-20 11:16:19,406 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
20
+ 2023-12-20 11:16:22,450 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
21
+ 2023-12-20 11:16:25,596 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
22
+ 2023-12-20 11:16:26,037 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5786, 3.9979, 4.0635, 4.0928], device='cuda:0')
23
+ 2023-12-20 11:16:28,915 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
24
+ 2023-12-20 11:16:32,041 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
25
+ 2023-12-20 11:16:32,374 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8531, 4.8571, 5.0605, 4.9368], device='cuda:0')
26
+ 2023-12-20 11:16:35,123 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
27
+ 2023-12-20 11:16:36,557 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.4886, 2.5395, 2.7653, 2.8470], device='cuda:0')
28
+ 2023-12-20 11:16:38,214 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
29
+ 2023-12-20 11:16:41,287 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
30
+ 2023-12-20 11:16:44,320 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
31
+ 2023-12-20 11:16:47,676 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
32
+ 2023-12-20 11:16:50,816 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
33
+ 2023-12-20 11:16:53,839 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
34
+ 2023-12-20 11:16:56,942 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
35
+ 2023-12-20 11:17:00,111 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
36
+ 2023-12-20 11:17:03,460 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
37
+ 2023-12-20 11:17:06,547 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
38
+ 2023-12-20 11:17:09,550 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
39
+ 2023-12-20 11:17:11,392 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5503, 2.6183, 2.8413, 2.9061], device='cuda:0')
40
+ 2023-12-20 11:17:12,692 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
41
+ 2023-12-20 11:17:15,850 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
42
+ 2023-12-20 11:17:16,624 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8476, 3.1237, 1.9008, 3.0684, 2.7817, 2.5808, 2.8058, 2.8088],
43
+ device='cuda:0')
44
+ 2023-12-20 11:17:19,097 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
45
+ 2023-12-20 11:17:19,357 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
46
+ 2023-12-20 11:17:20,868 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46579473748266215
47
+ 2023-12-20 11:17:20,868 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-20-avg-2-use-averaged-model-2023-12-20-11-14-13 ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 11:14:13,484 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 11:14:13,485 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 20, 'iter': 0, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-20-avg-2-use-averaged-model'}
3
+ 2023-12-20 11:14:13,485 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 11:14:13,827 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 18 (excluded) to 20
5
+ 2023-12-20 11:14:19,655 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 11:14:19,656 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 11:14:19,698 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 11:14:20,026 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 11:14:24,035 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.9203, 5.3414, 5.2214, 5.1403], device='cuda:0')
10
+ 2023-12-20 11:14:24,128 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
11
+ 2023-12-20 11:14:27,460 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
12
+ 2023-12-20 11:14:30,551 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
13
+ 2023-12-20 11:14:33,666 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
14
+ 2023-12-20 11:14:36,849 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
15
+ 2023-12-20 11:14:39,793 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2241, 3.5685, 3.7197, 3.5844], device='cuda:0')
16
+ 2023-12-20 11:14:40,248 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
17
+ 2023-12-20 11:14:43,400 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
18
+ 2023-12-20 11:14:46,521 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
19
+ 2023-12-20 11:14:49,642 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
20
+ 2023-12-20 11:14:52,953 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
21
+ 2023-12-20 11:14:56,050 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
22
+ 2023-12-20 11:14:59,124 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
23
+ 2023-12-20 11:15:02,259 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
24
+ 2023-12-20 11:15:02,729 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0997, 4.7574, 4.9451, 4.2907], device='cuda:0')
25
+ 2023-12-20 11:15:05,322 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
26
+ 2023-12-20 11:15:08,400 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
27
+ 2023-12-20 11:15:11,580 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
28
+ 2023-12-20 11:15:14,736 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
29
+ 2023-12-20 11:15:17,815 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
30
+ 2023-12-20 11:15:20,943 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
31
+ 2023-12-20 11:15:24,059 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
32
+ 2023-12-20 11:15:27,274 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
33
+ 2023-12-20 11:15:28,649 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0648, 5.9090, 6.0408, 6.0810], device='cuda:0')
34
+ 2023-12-20 11:15:30,418 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
35
+ 2023-12-20 11:15:33,479 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
36
+ 2023-12-20 11:15:36,526 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
37
+ 2023-12-20 11:15:39,624 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
38
+ 2023-12-20 11:15:42,818 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
39
+ 2023-12-20 11:15:43,171 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
40
+ 2023-12-20 11:15:44,679 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46630018129644646
41
+ 2023-12-20 11:15:44,679 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-20-avg-3-use-averaged-model-2023-12-20-11-12-38 ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 11:12:38,087 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 11:12:38,087 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 20, 'iter': 0, 'avg': 3, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-20-avg-3-use-averaged-model'}
3
+ 2023-12-20 11:12:38,087 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 11:12:38,441 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 17 (excluded) to 20
5
+ 2023-12-20 11:12:44,362 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 11:12:44,362 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 11:12:44,400 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 11:12:44,724 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 11:12:48,681 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 11:12:52,057 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 11:12:55,208 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-20 11:12:56,784 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7202, 2.9382, 2.7268, 2.3074], device='cuda:0')
13
+ 2023-12-20 11:12:57,367 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1991, 3.6340, 3.6528, 3.4666], device='cuda:0')
14
+ 2023-12-20 11:12:58,368 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
15
+ 2023-12-20 11:13:01,614 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
16
+ 2023-12-20 11:13:03,283 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7873, 3.2038, 2.6936, 2.4815], device='cuda:0')
17
+ 2023-12-20 11:13:04,943 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
18
+ 2023-12-20 11:13:07,547 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.9283, 5.4295, 5.3096, 5.0338], device='cuda:0')
19
+ 2023-12-20 11:13:08,095 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
20
+ 2023-12-20 11:13:11,246 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
21
+ 2023-12-20 11:13:13,276 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2507, 3.5745, 3.6873, 3.6339], device='cuda:0')
22
+ 2023-12-20 11:13:14,232 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
23
+ 2023-12-20 11:13:16,755 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5270, 2.4076, 2.7398, 2.6861], device='cuda:0')
24
+ 2023-12-20 11:13:17,517 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
25
+ 2023-12-20 11:13:20,666 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
26
+ 2023-12-20 11:13:23,787 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
27
+ 2023-12-20 11:13:27,035 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
28
+ 2023-12-20 11:13:28,948 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5757, 4.0239, 4.0702, 4.1350], device='cuda:0')
29
+ 2023-12-20 11:13:30,094 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
30
+ 2023-12-20 11:13:33,112 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
31
+ 2023-12-20 11:13:36,304 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
32
+ 2023-12-20 11:13:39,351 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
33
+ 2023-12-20 11:13:42,450 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
34
+ 2023-12-20 11:13:43,142 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7057, 2.8258, 2.8479, 3.1643, 3.1328, 2.6308, 2.4105, 2.3977],
35
+ device='cuda:0')
36
+ 2023-12-20 11:13:45,524 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
37
+ 2023-12-20 11:13:48,612 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
38
+ 2023-12-20 11:13:50,061 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2168, 3.4838, 3.4873, 3.6768], device='cuda:0')
39
+ 2023-12-20 11:13:51,882 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
40
+ 2023-12-20 11:13:55,027 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
41
+ 2023-12-20 11:13:56,737 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0968, 4.6907, 4.9517, 4.2245], device='cuda:0')
42
+ 2023-12-20 11:13:57,388 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0873, 5.9365, 6.0423, 6.0898], device='cuda:0')
43
+ 2023-12-20 11:13:58,067 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
44
+ 2023-12-20 11:14:01,149 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
45
+ 2023-12-20 11:14:02,047 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5756, 4.0228, 4.1209, 4.1007], device='cuda:0')
46
+ 2023-12-20 11:14:04,156 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
47
+ 2023-12-20 11:14:06,837 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0603, 5.9023, 6.0413, 6.0788], device='cuda:0')
48
+ 2023-12-20 11:14:07,312 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
49
+ 2023-12-20 11:14:07,615 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
50
+ 2023-12-20 11:14:09,156 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4656765929328333
51
+ 2023-12-20 11:14:09,156 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-20-avg-4-use-averaged-model-2023-12-20-11-10-58 ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 11:10:58,438 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 11:10:58,438 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 20, 'iter': 0, 'avg': 4, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-20-avg-4-use-averaged-model'}
3
+ 2023-12-20 11:10:58,439 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 11:10:58,786 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 16 (excluded) to 20
5
+ 2023-12-20 11:11:08,991 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 11:11:08,991 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 11:11:09,034 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 11:11:09,357 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 11:11:13,190 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 11:11:16,568 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 11:11:19,747 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-20 11:11:22,891 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-20 11:11:26,021 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
14
+ 2023-12-20 11:11:29,351 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
15
+ 2023-12-20 11:11:32,549 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
16
+ 2023-12-20 11:11:35,645 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
17
+ 2023-12-20 11:11:38,754 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
18
+ 2023-12-20 11:11:42,103 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
19
+ 2023-12-20 11:11:42,236 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1985, 3.3550, 3.7877, 3.0826], device='cuda:0')
20
+ 2023-12-20 11:11:45,206 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
21
+ 2023-12-20 11:11:48,337 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
22
+ 2023-12-20 11:11:51,429 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
23
+ 2023-12-20 11:11:54,519 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
24
+ 2023-12-20 11:11:57,334 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8573, 2.6727, 3.4476, 2.8746, 2.5684, 3.0583, 3.5229, 3.3946],
25
+ device='cuda:0')
26
+ 2023-12-20 11:11:57,705 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
27
+ 2023-12-20 11:12:00,951 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
28
+ 2023-12-20 11:12:03,927 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
29
+ 2023-12-20 11:12:07,067 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
30
+ 2023-12-20 11:12:10,121 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
31
+ 2023-12-20 11:12:13,218 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
32
+ 2023-12-20 11:12:16,398 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
33
+ 2023-12-20 11:12:19,581 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
34
+ 2023-12-20 11:12:22,716 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
35
+ 2023-12-20 11:12:25,878 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
36
+ 2023-12-20 11:12:28,899 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
37
+ 2023-12-20 11:12:31,041 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1377, 3.3589, 3.7661, 3.0158], device='cuda:0')
38
+ 2023-12-20 11:12:32,131 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
39
+ 2023-12-20 11:12:32,415 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9104, 3.5115, 3.5077, 3.3621], device='cuda:0')
40
+ 2023-12-20 11:12:32,524 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
41
+ 2023-12-20 11:12:34,022 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46546306823219463
42
+ 2023-12-20 11:12:34,022 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-21-avg-1-use-averaged-model-2023-12-20-11-09-20 ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 11:09:20,094 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 11:09:20,095 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 21, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-21-avg-1-use-averaged-model'}
3
+ 2023-12-20 11:09:20,095 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 11:09:20,462 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 20 (excluded) to 21
5
+ 2023-12-20 11:09:26,164 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 11:09:26,165 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 11:09:26,206 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 11:09:26,542 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 11:09:30,709 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 11:09:34,084 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 11:09:37,231 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-20 11:09:40,265 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-20 11:09:41,465 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.4669, 3.9527, 4.0458, 4.0524], device='cuda:0')
14
+ 2023-12-20 11:09:43,416 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
15
+ 2023-12-20 11:09:46,782 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
16
+ 2023-12-20 11:09:49,923 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
17
+ 2023-12-20 11:09:53,089 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
18
+ 2023-12-20 11:09:56,232 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
19
+ 2023-12-20 11:09:59,455 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
20
+ 2023-12-20 11:10:02,598 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
21
+ 2023-12-20 11:10:02,889 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8581, 4.8685, 5.0724, 4.9401], device='cuda:0')
22
+ 2023-12-20 11:10:04,010 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1044, 3.3565, 3.8013, 3.0911], device='cuda:0')
23
+ 2023-12-20 11:10:05,612 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
24
+ 2023-12-20 11:10:08,702 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
25
+ 2023-12-20 11:10:11,886 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
26
+ 2023-12-20 11:10:12,607 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5534, 3.9425, 3.9592, 3.7140], device='cuda:0')
27
+ 2023-12-20 11:10:15,071 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
28
+ 2023-12-20 11:10:18,375 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
29
+ 2023-12-20 11:10:24,397 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
30
+ 2023-12-20 11:10:27,575 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
31
+ 2023-12-20 11:10:30,657 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
32
+ 2023-12-20 11:10:33,764 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
33
+ 2023-12-20 11:10:36,909 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
34
+ 2023-12-20 11:10:40,100 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
35
+ 2023-12-20 11:10:43,215 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
36
+ 2023-12-20 11:10:46,322 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
37
+ 2023-12-20 11:10:49,351 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
38
+ 2023-12-20 11:10:52,524 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
39
+ 2023-12-20 11:10:52,856 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
40
+ 2023-12-20 11:10:54,268 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46595983733119856
41
+ 2023-12-20 11:10:54,268 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-21-avg-2-use-averaged-model-2023-12-20-11-07-43 ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 11:07:43,911 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 11:07:43,911 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 21, 'iter': 0, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-21-avg-2-use-averaged-model'}
3
+ 2023-12-20 11:07:43,911 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 11:07:44,303 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 19 (excluded) to 21
5
+ 2023-12-20 11:07:50,283 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 11:07:50,283 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 11:07:50,328 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 11:07:50,657 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 11:07:54,794 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 11:07:58,088 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 11:08:01,319 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-20 11:08:04,318 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-20 11:08:06,380 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5313, 2.7711, 2.9608, 2.9445], device='cuda:0')
14
+ 2023-12-20 11:08:07,472 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
15
+ 2023-12-20 11:08:10,782 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
16
+ 2023-12-20 11:08:13,858 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
17
+ 2023-12-20 11:08:14,222 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8633, 2.6707, 3.3580, 2.7944, 2.5826, 2.8619, 3.4880, 3.3999],
18
+ device='cuda:0')
19
+ 2023-12-20 11:08:16,915 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
20
+ 2023-12-20 11:08:20,081 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
21
+ 2023-12-20 11:08:22,233 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1426, 3.3460, 3.6452, 3.0659], device='cuda:0')
22
+ 2023-12-20 11:08:23,074 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1518, 3.3482, 3.7365, 3.0685], device='cuda:0')
23
+ 2023-12-20 11:08:23,327 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
24
+ 2023-12-20 11:08:26,506 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
25
+ 2023-12-20 11:08:29,180 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5371, 3.9603, 3.9117, 3.6375], device='cuda:0')
26
+ 2023-12-20 11:08:29,490 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
27
+ 2023-12-20 11:08:32,614 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
28
+ 2023-12-20 11:08:35,745 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
29
+ 2023-12-20 11:08:38,887 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
30
+ 2023-12-20 11:08:39,665 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7374, 2.9497, 2.8714, 3.2335, 3.2574, 2.7122, 2.5683, 2.5181],
31
+ device='cuda:0')
32
+ 2023-12-20 11:08:40,349 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5781, 4.0453, 4.1008, 4.0297], device='cuda:0')
33
+ 2023-12-20 11:08:42,130 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
34
+ 2023-12-20 11:08:45,233 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
35
+ 2023-12-20 11:08:48,378 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
36
+ 2023-12-20 11:08:50,160 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2859, 3.7938, 3.7872, 3.6188], device='cuda:0')
37
+ 2023-12-20 11:08:51,485 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
38
+ 2023-12-20 11:08:54,558 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
39
+ 2023-12-20 11:08:57,717 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
40
+ 2023-12-20 11:09:00,861 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
41
+ 2023-12-20 11:09:04,025 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
42
+ 2023-12-20 11:09:04,894 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8867, 2.7355, 3.5004, 3.0148, 2.6714, 3.1555, 3.5046, 3.3658],
43
+ device='cuda:0')
44
+ 2023-12-20 11:09:07,113 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
45
+ 2023-12-20 11:09:10,297 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
46
+ 2023-12-20 11:09:13,477 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
47
+ 2023-12-20 11:09:13,830 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
48
+ 2023-12-20 11:09:15,297 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46654886810101026
49
+ 2023-12-20 11:09:15,297 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-21-avg-3-use-averaged-model-2023-12-20-11-06-15 ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 11:06:15,887 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 11:06:15,888 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 21, 'iter': 0, 'avg': 3, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-21-avg-3-use-averaged-model'}
3
+ 2023-12-20 11:06:15,888 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 11:06:16,231 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 18 (excluded) to 21
5
+ 2023-12-20 11:06:19,787 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 11:06:19,788 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 11:06:19,829 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 11:06:20,157 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 11:06:22,846 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 11:06:25,625 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 11:06:26,455 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.8559, 5.1792, 5.1477, 4.9557], device='cuda:0')
12
+ 2023-12-20 11:06:27,866 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6291, 4.6759, 4.8443, 4.7830], device='cuda:0')
13
+ 2023-12-20 11:06:28,176 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
14
+ 2023-12-20 11:06:30,702 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
15
+ 2023-12-20 11:06:33,119 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
16
+ 2023-12-20 11:06:35,195 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2547, 3.8186, 3.6821, 3.4924], device='cuda:0')
17
+ 2023-12-20 11:06:35,784 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
18
+ 2023-12-20 11:06:38,237 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
19
+ 2023-12-20 11:06:39,634 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6096, 4.0606, 4.0649, 4.1089], device='cuda:0')
20
+ 2023-12-20 11:06:40,715 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
21
+ 2023-12-20 11:06:43,846 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
22
+ 2023-12-20 11:06:47,088 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
23
+ 2023-12-20 11:06:47,518 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5092, 4.0175, 4.0198, 3.7575], device='cuda:0')
24
+ 2023-12-20 11:06:49,965 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8793, 4.8429, 5.0769, 4.9786], device='cuda:0')
25
+ 2023-12-20 11:06:50,247 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
26
+ 2023-12-20 11:06:52,735 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6897, 2.9779, 2.9252, 3.1259, 3.2087, 2.6627, 2.4916, 2.4102],
27
+ device='cuda:0')
28
+ 2023-12-20 11:06:53,293 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
29
+ 2023-12-20 11:06:53,439 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5648, 2.6656, 2.7513, 2.9670], device='cuda:0')
30
+ 2023-12-20 11:06:56,273 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
31
+ 2023-12-20 11:06:59,164 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8141, 4.8171, 5.0680, 4.9154], device='cuda:0')
32
+ 2023-12-20 11:06:59,445 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
33
+ 2023-12-20 11:07:02,568 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
34
+ 2023-12-20 11:07:04,348 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5598, 3.9808, 3.8979, 3.7287], device='cuda:0')
35
+ 2023-12-20 11:07:05,404 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0855, 5.9111, 6.0316, 6.0730], device='cuda:0')
36
+ 2023-12-20 11:07:05,843 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
37
+ 2023-12-20 11:07:08,778 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1505, 3.3695, 3.7438, 3.0835], device='cuda:0')
38
+ 2023-12-20 11:07:08,970 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
39
+ 2023-12-20 11:07:12,039 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
40
+ 2023-12-20 11:07:15,100 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
41
+ 2023-12-20 11:07:18,232 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
42
+ 2023-12-20 11:07:19,967 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.9959, 3.7315, 3.4118, 3.1316], device='cuda:0')
43
+ 2023-12-20 11:07:21,439 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
44
+ 2023-12-20 11:07:24,519 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
45
+ 2023-12-20 11:07:27,031 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7929, 2.6129, 2.7436, 3.3087, 2.0227, 2.3036, 3.0193, 2.3700],
46
+ device='cuda:0')
47
+ 2023-12-20 11:07:27,683 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
48
+ 2023-12-20 11:07:28,106 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8415, 4.8176, 5.0712, 4.9051], device='cuda:0')
49
+ 2023-12-20 11:07:30,772 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
50
+ 2023-12-20 11:07:33,925 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
51
+ 2023-12-20 11:07:37,141 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
52
+ 2023-12-20 11:07:37,510 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
53
+ 2023-12-20 11:07:39,128 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4666887532823904
54
+ 2023-12-20 11:07:39,128 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-21-avg-4-use-averaged-model-2023-12-20-11-04-58 ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 11:04:58,087 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 11:04:58,087 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 21, 'iter': 0, 'avg': 4, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-21-avg-4-use-averaged-model'}
3
+ 2023-12-20 11:04:58,087 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 11:04:58,428 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 17 (excluded) to 21
5
+ 2023-12-20 11:05:06,321 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 11:05:06,321 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 11:05:06,367 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 11:05:06,696 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 11:05:09,713 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 11:05:11,707 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.8163, 5.6989, 5.8234, 5.9072], device='cuda:0')
11
+ 2023-12-20 11:05:12,223 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
12
+ 2023-12-20 11:05:14,790 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
13
+ 2023-12-20 11:05:17,229 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
14
+ 2023-12-20 11:05:19,542 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
15
+ 2023-12-20 11:05:20,449 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2205, 3.7597, 3.8391, 3.5481], device='cuda:0')
16
+ 2023-12-20 11:05:21,873 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
17
+ 2023-12-20 11:05:22,001 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0613, 5.8908, 6.0345, 6.0804], device='cuda:0')
18
+ 2023-12-20 11:05:24,061 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
19
+ 2023-12-20 11:05:26,379 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
20
+ 2023-12-20 11:05:28,792 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
21
+ 2023-12-20 11:05:31,178 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
22
+ 2023-12-20 11:05:33,592 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
23
+ 2023-12-20 11:05:35,966 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
24
+ 2023-12-20 11:05:38,212 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
25
+ 2023-12-20 11:05:40,101 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5342, 4.0219, 3.9738, 3.7552], device='cuda:0')
26
+ 2023-12-20 11:05:40,610 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
27
+ 2023-12-20 11:05:42,148 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5757, 4.0700, 3.6656, 3.9691], device='cuda:0')
28
+ 2023-12-20 11:05:43,099 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
29
+ 2023-12-20 11:05:45,686 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
30
+ 2023-12-20 11:05:48,085 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
31
+ 2023-12-20 11:05:49,580 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8391, 4.8322, 5.0839, 4.9953], device='cuda:0')
32
+ 2023-12-20 11:05:50,100 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0561, 5.9113, 6.0405, 6.0826], device='cuda:0')
33
+ 2023-12-20 11:05:50,465 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
34
+ 2023-12-20 11:05:52,828 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
35
+ 2023-12-20 11:05:54,441 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7527, 3.1270, 2.7678, 2.5458], device='cuda:0')
36
+ 2023-12-20 11:05:55,239 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
37
+ 2023-12-20 11:05:57,858 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
38
+ 2023-12-20 11:05:58,021 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.1303, 4.7748, 4.8765, 4.2490], device='cuda:0')
39
+ 2023-12-20 11:06:00,368 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
40
+ 2023-12-20 11:06:02,645 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
41
+ 2023-12-20 11:06:02,931 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8392, 3.0961, 1.9076, 2.8978, 2.6866, 2.6739, 2.8477, 2.8210],
42
+ device='cuda:0')
43
+ 2023-12-20 11:06:05,083 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
44
+ 2023-12-20 11:06:07,443 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
45
+ 2023-12-20 11:06:09,933 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
46
+ 2023-12-20 11:06:10,178 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
47
+ 2023-12-20 11:06:11,583 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4661239957194168
48
+ 2023-12-20 11:06:11,584 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-22-avg-1-use-averaged-model-2023-12-20-11-03-23 ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 11:03:23,760 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 11:03:23,760 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 22, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-22-avg-1-use-averaged-model'}
3
+ 2023-12-20 11:03:23,760 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 11:03:24,127 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 21 (excluded) to 22
5
+ 2023-12-20 11:03:29,935 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 11:03:29,935 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 11:03:29,976 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 11:03:30,303 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 11:03:34,133 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 11:03:36,386 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.4276, 3.9535, 3.9566, 4.0588], device='cuda:0')
11
+ 2023-12-20 11:03:37,445 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
12
+ 2023-12-20 11:03:40,572 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
13
+ 2023-12-20 11:03:43,654 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
14
+ 2023-12-20 11:03:44,120 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6774, 2.9198, 1.7264, 2.8067, 2.7028, 2.4774, 2.6967, 2.2030],
15
+ device='cuda:0')
16
+ 2023-12-20 11:03:46,848 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
17
+ 2023-12-20 11:03:50,165 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
18
+ 2023-12-20 11:03:53,287 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
19
+ 2023-12-20 11:03:53,649 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8710, 2.6711, 3.3951, 2.7988, 2.5262, 2.7845, 3.5050, 3.3978],
20
+ device='cuda:0')
21
+ 2023-12-20 11:03:56,285 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
22
+ 2023-12-20 11:03:59,383 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
23
+ 2023-12-20 11:04:02,635 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
24
+ 2023-12-20 11:04:05,690 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
25
+ 2023-12-20 11:04:08,802 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
26
+ 2023-12-20 11:04:11,885 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
27
+ 2023-12-20 11:04:14,941 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
28
+ 2023-12-20 11:04:17,987 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
29
+ 2023-12-20 11:04:21,237 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
30
+ 2023-12-20 11:04:24,252 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
31
+ 2023-12-20 11:04:27,309 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
32
+ 2023-12-20 11:04:30,473 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
33
+ 2023-12-20 11:04:33,579 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
34
+ 2023-12-20 11:04:36,823 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
35
+ 2023-12-20 11:04:40,021 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
36
+ 2023-12-20 11:04:43,160 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
37
+ 2023-12-20 11:04:43,299 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2861, 3.6977, 3.8188, 3.6654], device='cuda:0')
38
+ 2023-12-20 11:04:46,123 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
39
+ 2023-12-20 11:04:49,202 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
40
+ 2023-12-20 11:04:52,383 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
41
+ 2023-12-20 11:04:52,709 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
42
+ 2023-12-20 11:04:54,112 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46692434034463237
43
+ 2023-12-20 11:04:54,112 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-22-avg-2-use-averaged-model-2023-12-20-11-01-48 ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 11:01:48,820 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 11:01:48,820 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 22, 'iter': 0, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-22-avg-2-use-averaged-model'}
3
+ 2023-12-20 11:01:48,820 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 11:01:49,177 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 20 (excluded) to 22
5
+ 2023-12-20 11:01:54,709 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 11:01:54,709 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 11:01:54,749 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 11:01:55,077 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 11:01:59,054 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 11:02:02,334 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 11:02:03,056 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5889, 4.0579, 3.6889, 4.0691], device='cuda:0')
12
+ 2023-12-20 11:02:05,474 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
13
+ 2023-12-20 11:02:07,772 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6229, 2.3290, 2.7873, 3.2877, 2.0903, 2.3765, 2.9185, 2.2264],
14
+ device='cuda:0')
15
+ 2023-12-20 11:02:08,637 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
16
+ 2023-12-20 11:02:11,852 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
17
+ 2023-12-20 11:02:15,098 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
18
+ 2023-12-20 11:02:18,240 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
19
+ 2023-12-20 11:02:21,216 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
20
+ 2023-12-20 11:02:24,350 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
21
+ 2023-12-20 11:02:27,694 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
22
+ 2023-12-20 11:02:28,885 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8015, 2.4540, 2.8953, 3.2423, 2.0630, 2.4692, 3.1323, 2.3155],
23
+ device='cuda:0')
24
+ 2023-12-20 11:02:30,824 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
25
+ 2023-12-20 11:02:33,044 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0602, 5.8825, 6.0415, 6.0733], device='cuda:0')
26
+ 2023-12-20 11:02:33,969 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
27
+ 2023-12-20 11:02:37,102 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
28
+ 2023-12-20 11:02:40,189 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
29
+ 2023-12-20 11:02:43,125 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.8894, 5.3494, 5.2495, 5.1433], device='cuda:0')
30
+ 2023-12-20 11:02:43,242 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
31
+ 2023-12-20 11:02:46,313 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
32
+ 2023-12-20 11:02:49,461 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
33
+ 2023-12-20 11:02:52,593 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
34
+ 2023-12-20 11:02:55,727 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
35
+ 2023-12-20 11:02:58,823 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
36
+ 2023-12-20 11:03:02,047 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
37
+ 2023-12-20 11:03:05,221 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
38
+ 2023-12-20 11:03:08,317 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
39
+ 2023-12-20 11:03:11,348 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
40
+ 2023-12-20 11:03:14,469 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2283, 3.3411, 3.7874, 3.0444], device='cuda:0')
41
+ 2023-12-20 11:03:14,488 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
42
+ 2023-12-20 11:03:17,646 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
43
+ 2023-12-20 11:03:17,949 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
44
+ 2023-12-20 11:03:19,331 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46689134536883886
45
+ 2023-12-20 11:03:19,331 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-22-avg-3-use-averaged-model-2023-12-20-11-00-13 ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 11:00:13,479 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 11:00:13,479 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 22, 'iter': 0, 'avg': 3, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-22-avg-3-use-averaged-model'}
3
+ 2023-12-20 11:00:13,479 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 11:00:13,829 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 19 (excluded) to 22
5
+ 2023-12-20 11:00:19,382 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 11:00:19,383 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 11:00:19,425 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 11:00:19,751 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 11:00:24,057 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 11:00:27,349 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 11:00:29,427 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.9781, 3.3984, 3.5079, 3.0788], device='cuda:0')
12
+ 2023-12-20 11:00:30,561 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
13
+ 2023-12-20 11:00:33,629 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
14
+ 2023-12-20 11:00:36,688 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
15
+ 2023-12-20 11:00:40,007 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
16
+ 2023-12-20 11:00:41,404 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0949, 3.5328, 3.5815, 3.4514], device='cuda:0')
17
+ 2023-12-20 11:00:43,151 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
18
+ 2023-12-20 11:00:44,488 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8228, 4.4640, 4.7091, 4.0549], device='cuda:0')
19
+ 2023-12-20 11:00:46,301 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
20
+ 2023-12-20 11:00:48,834 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0733, 5.8752, 6.0271, 6.0803], device='cuda:0')
21
+ 2023-12-20 11:00:49,425 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
22
+ 2023-12-20 11:00:52,680 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
23
+ 2023-12-20 11:00:55,825 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
24
+ 2023-12-20 11:00:58,918 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
25
+ 2023-12-20 11:01:02,058 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
26
+ 2023-12-20 11:01:05,134 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
27
+ 2023-12-20 11:01:05,444 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2757, 3.6134, 3.8948, 3.6891], device='cuda:0')
28
+ 2023-12-20 11:01:08,174 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
29
+ 2023-12-20 11:01:11,387 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
30
+ 2023-12-20 11:01:14,399 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
31
+ 2023-12-20 11:01:17,471 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
32
+ 2023-12-20 11:01:20,609 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
33
+ 2023-12-20 11:01:23,695 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
34
+ 2023-12-20 11:01:26,281 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8308, 2.9800, 1.9636, 2.8692, 2.6939, 2.7260, 2.9187, 2.9849],
35
+ device='cuda:0')
36
+ 2023-12-20 11:01:27,105 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
37
+ 2023-12-20 11:01:30,218 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
38
+ 2023-12-20 11:01:30,980 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1491, 3.2663, 3.7722, 3.0070], device='cuda:0')
39
+ 2023-12-20 11:01:33,238 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
40
+ 2023-12-20 11:01:36,325 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
41
+ 2023-12-20 11:01:39,397 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
42
+ 2023-12-20 11:01:40,257 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8013, 3.0848, 2.9766, 2.4625], device='cuda:0')
43
+ 2023-12-20 11:01:42,655 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
44
+ 2023-12-20 11:01:43,008 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
45
+ 2023-12-20 11:01:44,551 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46691350631619744
46
+ 2023-12-20 11:01:44,551 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-22-avg-4-use-averaged-model-2023-12-20-10-58-33 ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 10:58:33,848 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 10:58:33,849 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 22, 'iter': 0, 'avg': 4, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-22-avg-4-use-averaged-model'}
3
+ 2023-12-20 10:58:33,849 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 10:58:34,207 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 18 (excluded) to 22
5
+ 2023-12-20 10:58:44,701 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 10:58:44,701 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 10:58:44,742 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 10:58:45,066 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 10:58:49,090 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 10:58:52,384 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 10:58:55,490 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-20 10:58:58,680 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-20 10:59:00,447 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5860, 4.0352, 4.1142, 4.1435], device='cuda:0')
14
+ 2023-12-20 10:59:01,787 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
15
+ 2023-12-20 10:59:05,129 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
16
+ 2023-12-20 10:59:06,530 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6664, 2.5878, 3.2258, 2.6873, 2.5104, 2.7937, 3.4016, 3.1781],
17
+ device='cuda:0')
18
+ 2023-12-20 10:59:08,267 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
19
+ 2023-12-20 10:59:11,405 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
20
+ 2023-12-20 10:59:14,438 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
21
+ 2023-12-20 10:59:17,747 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
22
+ 2023-12-20 10:59:20,786 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
23
+ 2023-12-20 10:59:23,933 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
24
+ 2023-12-20 10:59:26,978 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
25
+ 2023-12-20 10:59:30,108 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
26
+ 2023-12-20 10:59:32,661 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8223, 2.9395, 1.8217, 2.9546, 2.6430, 2.4790, 2.6158, 2.8845],
27
+ device='cuda:0')
28
+ 2023-12-20 10:59:33,206 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
29
+ 2023-12-20 10:59:36,441 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
30
+ 2023-12-20 10:59:39,098 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
31
+ 2023-12-20 10:59:42,267 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
32
+ 2023-12-20 10:59:45,324 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
33
+ 2023-12-20 10:59:48,372 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6056, 3.0822, 2.6287, 2.2330], device='cuda:0')
34
+ 2023-12-20 10:59:48,408 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
35
+ 2023-12-20 10:59:51,696 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
36
+ 2023-12-20 10:59:54,834 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
37
+ 2023-12-20 10:59:57,917 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
38
+ 2023-12-20 11:00:01,030 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1599, 3.3898, 3.7804, 3.0739], device='cuda:0')
39
+ 2023-12-20 11:00:01,045 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
40
+ 2023-12-20 11:00:02,734 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5706, 4.0026, 4.1196, 4.1421], device='cuda:0')
41
+ 2023-12-20 11:00:03,388 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8578, 3.1636, 2.9859, 2.6449], device='cuda:0')
42
+ 2023-12-20 11:00:04,184 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
43
+ 2023-12-20 11:00:07,335 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
44
+ 2023-12-20 11:00:07,665 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
45
+ 2023-12-20 11:00:09,097 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46697526055281713
46
+ 2023-12-20 11:00:09,097 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-23-avg-1-use-averaged-model-2023-12-20-10-56-58 ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 10:56:58,159 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 10:56:58,159 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 23, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-23-avg-1-use-averaged-model'}
3
+ 2023-12-20 10:56:58,159 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 10:56:58,498 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 22 (excluded) to 23
5
+ 2023-12-20 10:57:04,134 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 10:57:04,134 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 10:57:04,181 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 10:57:04,533 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 10:57:08,872 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 10:57:12,191 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 10:57:15,356 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-20 10:57:17,813 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6549, 2.8269, 2.8222, 2.9814], device='cuda:0')
13
+ 2023-12-20 10:57:18,472 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
14
+ 2023-12-20 10:57:21,556 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
15
+ 2023-12-20 10:57:24,840 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
16
+ 2023-12-20 10:57:27,902 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
17
+ 2023-12-20 10:57:28,989 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5725, 2.7842, 2.7884, 2.9472], device='cuda:0')
18
+ 2023-12-20 10:57:31,109 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
19
+ 2023-12-20 10:57:34,244 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
20
+ 2023-12-20 10:57:36,702 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8813, 2.6819, 3.4513, 2.8761, 2.6020, 3.0253, 3.5391, 3.2965],
21
+ device='cuda:0')
22
+ 2023-12-20 10:57:37,528 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
23
+ 2023-12-20 10:57:40,642 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
24
+ 2023-12-20 10:57:43,756 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
25
+ 2023-12-20 10:57:45,637 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0437, 3.6673, 3.6405, 3.3147], device='cuda:0')
26
+ 2023-12-20 10:57:46,725 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
27
+ 2023-12-20 10:57:49,840 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
28
+ 2023-12-20 10:57:52,025 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0745, 5.8795, 6.0085, 6.0708], device='cuda:0')
29
+ 2023-12-20 10:57:52,872 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
30
+ 2023-12-20 10:57:56,028 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
31
+ 2023-12-20 10:57:59,087 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
32
+ 2023-12-20 10:57:59,387 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7933, 3.0623, 1.8310, 2.8659, 2.7469, 2.6966, 2.7873, 3.0126],
33
+ device='cuda:0')
34
+ 2023-12-20 10:58:02,244 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
35
+ 2023-12-20 10:58:02,482 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5067, 2.6426, 2.9523, 2.8566], device='cuda:0')
36
+ 2023-12-20 10:58:05,388 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
37
+ 2023-12-20 10:58:08,555 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
38
+ 2023-12-20 10:58:11,558 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
39
+ 2023-12-20 10:58:14,691 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
40
+ 2023-12-20 10:58:17,867 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
41
+ 2023-12-20 10:58:20,997 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
42
+ 2023-12-20 10:58:24,154 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
43
+ 2023-12-20 10:58:27,424 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
44
+ 2023-12-20 10:58:27,784 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
45
+ 2023-12-20 10:58:29,254 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4672294856773102
46
+ 2023-12-20 10:58:29,255 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-23-avg-2-use-averaged-model-2023-12-20-10-55-22 ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 10:55:22,770 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 10:55:22,770 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 23, 'iter': 0, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-23-avg-2-use-averaged-model'}
3
+ 2023-12-20 10:55:22,770 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 10:55:23,124 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 21 (excluded) to 23
5
+ 2023-12-20 10:55:28,910 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 10:55:28,910 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 10:55:28,968 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 10:55:29,302 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 10:55:33,399 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 10:55:36,737 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 10:55:39,908 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-20 10:55:42,244 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5448, 2.6940, 2.6991, 2.9517, 3.1957, 2.5538, 2.2816, 2.3894],
13
+ device='cuda:0')
14
+ 2023-12-20 10:55:42,267 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6805, 2.7956, 1.8296, 2.7913, 2.5967, 2.6133, 2.6505, 2.1995],
15
+ device='cuda:0')
16
+ 2023-12-20 10:55:43,106 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
17
+ 2023-12-20 10:55:44,133 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0485, 5.8610, 6.0046, 6.0689], device='cuda:0')
18
+ 2023-12-20 10:55:46,309 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
19
+ 2023-12-20 10:55:49,467 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
20
+ 2023-12-20 10:55:52,043 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8505, 2.4890, 2.9270, 3.4757, 2.2964, 2.5200, 3.2618, 2.4451],
21
+ device='cuda:0')
22
+ 2023-12-20 10:55:52,487 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
23
+ 2023-12-20 10:55:55,625 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
24
+ 2023-12-20 10:55:58,765 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
25
+ 2023-12-20 10:56:02,126 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
26
+ 2023-12-20 10:56:05,241 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
27
+ 2023-12-20 10:56:06,109 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0639, 3.4447, 3.7432, 3.3268], device='cuda:0')
28
+ 2023-12-20 10:56:08,282 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
29
+ 2023-12-20 10:56:11,342 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
30
+ 2023-12-20 10:56:14,438 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
31
+ 2023-12-20 10:56:17,421 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
32
+ 2023-12-20 10:56:20,781 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
33
+ 2023-12-20 10:56:23,830 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
34
+ 2023-12-20 10:56:26,888 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
35
+ 2023-12-20 10:56:30,068 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
36
+ 2023-12-20 10:56:33,202 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
37
+ 2023-12-20 10:56:36,415 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
38
+ 2023-12-20 10:56:39,415 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
39
+ 2023-12-20 10:56:41,085 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1648, 3.3316, 3.7046, 3.0234], device='cuda:0')
40
+ 2023-12-20 10:56:42,524 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
41
+ 2023-12-20 10:56:45,645 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
42
+ 2023-12-20 10:56:47,560 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8917, 2.7313, 3.3804, 2.7867, 2.6712, 3.1485, 3.5576, 3.4627],
43
+ device='cuda:0')
44
+ 2023-12-20 10:56:48,714 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
45
+ 2023-12-20 10:56:51,949 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
46
+ 2023-12-20 10:56:52,316 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
47
+ 2023-12-20 10:56:53,837 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46715505120687295
48
+ 2023-12-20 10:56:53,838 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-23-avg-3-use-averaged-model-2023-12-20-10-53-47 ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 10:53:47,222 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 10:53:47,222 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 23, 'iter': 0, 'avg': 3, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-23-avg-3-use-averaged-model'}
3
+ 2023-12-20 10:53:47,222 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 10:53:47,571 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 20 (excluded) to 23
5
+ 2023-12-20 10:53:53,105 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 10:53:53,106 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 10:53:53,144 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 10:53:53,466 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 10:53:57,468 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 10:54:00,150 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8106, 3.4244, 3.4437, 3.0550], device='cuda:0')
11
+ 2023-12-20 10:54:00,815 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
12
+ 2023-12-20 10:54:03,138 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.4775, 3.9828, 3.6405, 3.9825], device='cuda:0')
13
+ 2023-12-20 10:54:03,975 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
14
+ 2023-12-20 10:54:07,155 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
15
+ 2023-12-20 10:54:10,305 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
16
+ 2023-12-20 10:54:13,707 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
17
+ 2023-12-20 10:54:15,312 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7149, 2.9078, 2.9069, 3.2008, 3.2296, 2.7076, 2.4560, 2.6394],
18
+ device='cuda:0')
19
+ 2023-12-20 10:54:16,737 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
20
+ 2023-12-20 10:54:19,838 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
21
+ 2023-12-20 10:54:22,945 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
22
+ 2023-12-20 10:54:23,173 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0843, 5.8989, 6.0467, 6.0766], device='cuda:0')
23
+ 2023-12-20 10:54:26,346 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
24
+ 2023-12-20 10:54:29,456 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
25
+ 2023-12-20 10:54:32,483 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
26
+ 2023-12-20 10:54:35,538 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
27
+ 2023-12-20 10:54:38,716 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
28
+ 2023-12-20 10:54:41,764 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
29
+ 2023-12-20 10:54:44,942 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
30
+ 2023-12-20 10:54:48,064 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
31
+ 2023-12-20 10:54:51,135 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
32
+ 2023-12-20 10:54:54,289 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
33
+ 2023-12-20 10:54:57,419 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
34
+ 2023-12-20 10:55:00,729 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
35
+ 2023-12-20 10:55:03,874 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
36
+ 2023-12-20 10:55:07,012 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
37
+ 2023-12-20 10:55:10,066 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
38
+ 2023-12-20 10:55:13,158 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
39
+ 2023-12-20 10:55:16,427 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
40
+ 2023-12-20 10:55:16,793 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
41
+ 2023-12-20 10:55:18,230 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46723830824330675
42
+ 2023-12-20 10:55:18,230 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-23-avg-4-use-averaged-model-2023-12-20-10-52-08 ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 10:52:08,101 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 10:52:08,101 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 23, 'iter': 0, 'avg': 4, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-23-avg-4-use-averaged-model'}
3
+ 2023-12-20 10:52:08,101 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 10:52:08,444 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 19 (excluded) to 23
5
+ 2023-12-20 10:52:17,768 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 10:52:17,769 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 10:52:17,819 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 10:52:18,147 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 10:52:21,909 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 10:52:25,279 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 10:52:28,451 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-20 10:52:31,597 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-20 10:52:34,788 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
14
+ 2023-12-20 10:52:38,065 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
15
+ 2023-12-20 10:52:41,210 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
16
+ 2023-12-20 10:52:44,272 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
17
+ 2023-12-20 10:52:47,414 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
18
+ 2023-12-20 10:52:50,651 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
19
+ 2023-12-20 10:52:53,885 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
20
+ 2023-12-20 10:52:56,041 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7994, 2.3686, 2.6597, 3.1798, 1.9373, 2.3890, 3.0216, 2.2980],
21
+ device='cuda:0')
22
+ 2023-12-20 10:52:57,039 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
23
+ 2023-12-20 10:52:57,930 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8509, 3.0907, 2.1191, 2.8986, 2.7374, 2.6884, 2.6836, 2.9336],
24
+ device='cuda:0')
25
+ 2023-12-20 10:53:00,221 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
26
+ 2023-12-20 10:53:03,313 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
27
+ 2023-12-20 10:53:06,445 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
28
+ 2023-12-20 10:53:09,589 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
29
+ 2023-12-20 10:53:12,697 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
30
+ 2023-12-20 10:53:15,770 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
31
+ 2023-12-20 10:53:18,920 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
32
+ 2023-12-20 10:53:19,514 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8414, 3.0596, 1.8992, 3.0082, 2.7119, 2.6764, 2.5959, 2.7645],
33
+ device='cuda:0')
34
+ 2023-12-20 10:53:22,083 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
35
+ 2023-12-20 10:53:25,265 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
36
+ 2023-12-20 10:53:28,466 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
37
+ 2023-12-20 10:53:30,739 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.1052, 4.6517, 5.0213, 4.1650], device='cuda:0')
38
+ 2023-12-20 10:53:31,613 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
39
+ 2023-12-20 10:53:34,595 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
40
+ 2023-12-20 10:53:34,843 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.1247, 4.7627, 4.9785, 4.3122], device='cuda:0')
41
+ 2023-12-20 10:53:37,666 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
42
+ 2023-12-20 10:53:40,856 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
43
+ 2023-12-20 10:53:41,223 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
44
+ 2023-12-20 10:53:42,991 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46717957666901516
45
+ 2023-12-20 10:53:42,992 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-24-avg-1-use-averaged-model-2023-12-20-10-50-32 ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 10:50:32,664 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 10:50:32,664 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 24, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-24-avg-1-use-averaged-model'}
3
+ 2023-12-20 10:50:32,664 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 10:50:33,024 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 23 (excluded) to 24
5
+ 2023-12-20 10:50:38,859 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 10:50:38,859 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 10:50:38,899 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 10:50:39,227 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 10:50:43,340 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 10:50:46,640 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 10:50:49,866 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-20 10:50:52,840 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-20 10:50:56,049 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
14
+ 2023-12-20 10:50:59,360 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
15
+ 2023-12-20 10:51:02,553 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
16
+ 2023-12-20 10:51:05,714 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
17
+ 2023-12-20 10:51:08,834 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
18
+ 2023-12-20 10:51:12,165 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
19
+ 2023-12-20 10:51:14,136 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.8888, 5.3275, 5.1870, 4.9690], device='cuda:0')
20
+ 2023-12-20 10:51:15,188 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
21
+ 2023-12-20 10:51:18,199 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
22
+ 2023-12-20 10:51:21,231 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
23
+ 2023-12-20 10:51:24,354 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
24
+ 2023-12-20 10:51:27,473 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
25
+ 2023-12-20 10:51:30,700 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
26
+ 2023-12-20 10:51:33,493 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5962, 4.0835, 4.1275, 4.2003], device='cuda:0')
27
+ 2023-12-20 10:51:33,826 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
28
+ 2023-12-20 10:51:36,953 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
29
+ 2023-12-20 10:51:37,364 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.1005, 4.7436, 4.9370, 4.2737], device='cuda:0')
30
+ 2023-12-20 10:51:39,167 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0381, 3.6689, 3.6961, 3.3511], device='cuda:0')
31
+ 2023-12-20 10:51:39,978 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
32
+ 2023-12-20 10:51:43,190 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
33
+ 2023-12-20 10:51:46,405 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
34
+ 2023-12-20 10:51:48,852 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9100, 2.7452, 3.3747, 2.9567, 2.7228, 3.0491, 3.5594, 3.4461],
35
+ device='cuda:0')
36
+ 2023-12-20 10:51:49,529 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
37
+ 2023-12-20 10:51:52,648 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
38
+ 2023-12-20 10:51:55,375 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5347, 3.9743, 4.0591, 3.7593], device='cuda:0')
39
+ 2023-12-20 10:51:55,800 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
40
+ 2023-12-20 10:51:57,135 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8227, 2.6648, 2.8151, 3.4215, 1.9542, 2.3104, 3.0903, 2.4416],
41
+ device='cuda:0')
42
+ 2023-12-20 10:51:58,856 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
43
+ 2023-12-20 10:52:01,998 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
44
+ 2023-12-20 10:52:02,332 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
45
+ 2023-12-20 10:52:03,751 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46763480126476387
46
+ 2023-12-20 10:52:03,751 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-24-avg-2-use-averaged-model-2023-12-20-10-48-56 ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 10:48:56,881 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 10:48:56,881 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 24, 'iter': 0, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-24-avg-2-use-averaged-model'}
3
+ 2023-12-20 10:48:56,881 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 10:48:57,308 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 22 (excluded) to 24
5
+ 2023-12-20 10:49:02,719 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 10:49:02,719 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 10:49:02,758 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 10:49:03,089 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 10:49:07,346 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 10:49:10,270 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.3551, 3.7676, 3.8025, 3.8615], device='cuda:0')
11
+ 2023-12-20 10:49:10,648 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
12
+ 2023-12-20 10:49:13,838 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
13
+ 2023-12-20 10:49:15,506 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7862, 2.8688, 1.8684, 2.8396, 2.6485, 2.6592, 2.7715, 2.9141],
14
+ device='cuda:0')
15
+ 2023-12-20 10:49:16,449 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7167, 2.9045, 2.9003, 3.1714, 3.1905, 2.6751, 2.4910, 2.5156],
16
+ device='cuda:0')
17
+ 2023-12-20 10:49:17,008 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
18
+ 2023-12-20 10:49:20,047 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
19
+ 2023-12-20 10:49:22,135 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5926, 4.0174, 4.0387, 4.1476], device='cuda:0')
20
+ 2023-12-20 10:49:23,319 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
21
+ 2023-12-20 10:49:26,468 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
22
+ 2023-12-20 10:49:29,665 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
23
+ 2023-12-20 10:49:32,869 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
24
+ 2023-12-20 10:49:36,198 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
25
+ 2023-12-20 10:49:37,368 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2635, 3.6282, 3.6999, 3.6278], device='cuda:0')
26
+ 2023-12-20 10:49:39,355 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
27
+ 2023-12-20 10:49:42,378 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
28
+ 2023-12-20 10:49:45,383 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
29
+ 2023-12-20 10:49:48,496 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
30
+ 2023-12-20 10:49:51,639 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
31
+ 2023-12-20 10:49:54,909 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
32
+ 2023-12-20 10:49:57,099 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8074, 4.8366, 5.0518, 4.9116], device='cuda:0')
33
+ 2023-12-20 10:49:58,040 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
34
+ 2023-12-20 10:50:01,125 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
35
+ 2023-12-20 10:50:04,238 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
36
+ 2023-12-20 10:50:04,485 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.7982, 4.8037, 5.0578, 4.9325], device='cuda:0')
37
+ 2023-12-20 10:50:07,212 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1707, 3.3783, 3.8437, 3.0453], device='cuda:0')
38
+ 2023-12-20 10:50:07,400 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
39
+ 2023-12-20 10:50:07,546 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5504, 2.6988, 2.8989, 2.8307], device='cuda:0')
40
+ 2023-12-20 10:50:10,528 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
41
+ 2023-12-20 10:50:13,624 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
42
+ 2023-12-20 10:50:14,223 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6606, 2.5859, 2.7178, 2.9232], device='cuda:0')
43
+ 2023-12-20 10:50:16,647 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0197, 3.7185, 3.7132, 3.2520], device='cuda:0')
44
+ 2023-12-20 10:50:16,760 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2589, 3.7117, 3.6402, 3.5176], device='cuda:0')
45
+ 2023-12-20 10:50:16,789 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
46
+ 2023-12-20 10:50:19,673 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.9095, 5.3919, 5.1990, 4.9975], device='cuda:0')
47
+ 2023-12-20 10:50:19,952 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
48
+ 2023-12-20 10:50:23,072 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
49
+ 2023-12-20 10:50:23,639 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0236, 3.6324, 3.5681, 3.1481], device='cuda:0')
50
+ 2023-12-20 10:50:26,380 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
51
+ 2023-12-20 10:50:26,662 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
52
+ 2023-12-20 10:50:28,266 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4678751339833697
53
+ 2023-12-20 10:50:28,266 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-24-avg-3-use-averaged-model-2023-12-20-10-47-17 ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 10:47:17,045 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 10:47:17,045 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 24, 'iter': 0, 'avg': 3, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-24-avg-3-use-averaged-model'}
3
+ 2023-12-20 10:47:17,045 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 10:47:17,395 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 21 (excluded) to 24
5
+ 2023-12-20 10:47:22,988 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 10:47:22,988 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 10:47:23,030 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 10:47:23,354 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 10:47:27,319 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 10:47:30,701 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 10:47:32,042 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7532, 3.1476, 3.7085, 2.9582], device='cuda:0')
12
+ 2023-12-20 10:47:33,837 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
13
+ 2023-12-20 10:47:36,979 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
14
+ 2023-12-20 10:47:38,212 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.9734, 4.6247, 4.9158, 4.2518], device='cuda:0')
15
+ 2023-12-20 10:47:40,292 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
16
+ 2023-12-20 10:47:43,683 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
17
+ 2023-12-20 10:47:49,526 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
18
+ 2023-12-20 10:47:52,787 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
19
+ 2023-12-20 10:47:55,921 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
20
+ 2023-12-20 10:47:57,826 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6977, 2.9882, 2.9382, 3.1370, 3.1896, 2.6716, 2.4757, 2.5104],
21
+ device='cuda:0')
22
+ 2023-12-20 10:47:59,253 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
23
+ 2023-12-20 10:48:00,872 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9249, 2.7196, 3.4739, 3.1477, 2.7351, 3.1189, 3.5510, 3.4769],
24
+ device='cuda:0')
25
+ 2023-12-20 10:48:01,218 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2793, 3.7572, 3.7132, 3.6564], device='cuda:0')
26
+ 2023-12-20 10:48:02,462 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
27
+ 2023-12-20 10:48:05,679 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
28
+ 2023-12-20 10:48:07,716 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6402, 2.6255, 2.9543, 2.9100], device='cuda:0')
29
+ 2023-12-20 10:48:08,859 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
30
+ 2023-12-20 10:48:11,855 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
31
+ 2023-12-20 10:48:14,751 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7077, 3.2850, 2.7264, 2.3151], device='cuda:0')
32
+ 2023-12-20 10:48:15,124 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
33
+ 2023-12-20 10:48:18,503 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
34
+ 2023-12-20 10:48:21,734 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
35
+ 2023-12-20 10:48:24,854 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
36
+ 2023-12-20 10:48:28,081 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
37
+ 2023-12-20 10:48:28,839 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1248, 3.3595, 3.8671, 3.0543], device='cuda:0')
38
+ 2023-12-20 10:48:31,312 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
39
+ 2023-12-20 10:48:34,525 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
40
+ 2023-12-20 10:48:37,632 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
41
+ 2023-12-20 10:48:40,764 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
42
+ 2023-12-20 10:48:43,501 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5844, 3.9695, 3.5028, 3.9544], device='cuda:0')
43
+ 2023-12-20 10:48:43,877 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
44
+ 2023-12-20 10:48:46,990 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
45
+ 2023-12-20 10:48:47,261 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2336, 3.7827, 3.7901, 3.7196], device='cuda:0')
46
+ 2023-12-20 10:48:50,098 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
47
+ 2023-12-20 10:48:50,480 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
48
+ 2023-12-20 10:48:51,980 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4677813299273917
49
+ 2023-12-20 10:48:51,980 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-24-avg-4-use-averaged-model-2023-12-20-10-45-41 ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 10:45:41,331 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 10:45:41,331 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 24, 'iter': 0, 'avg': 4, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-24-avg-4-use-averaged-model'}
3
+ 2023-12-20 10:45:41,331 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 10:45:41,687 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 20 (excluded) to 24
5
+ 2023-12-20 10:45:47,672 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 10:45:47,672 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 10:45:47,713 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 10:45:48,038 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 10:45:52,270 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 10:45:55,603 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 10:45:58,064 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5422, 3.9444, 4.0078, 3.6638], device='cuda:0')
12
+ 2023-12-20 10:45:58,714 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
13
+ 2023-12-20 10:46:01,921 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
14
+ 2023-12-20 10:46:02,232 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6676, 3.0229, 2.6733, 2.2383], device='cuda:0')
15
+ 2023-12-20 10:46:05,112 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
16
+ 2023-12-20 10:46:08,517 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
17
+ 2023-12-20 10:46:11,587 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
18
+ 2023-12-20 10:46:14,763 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
19
+ 2023-12-20 10:46:17,868 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
20
+ 2023-12-20 10:46:21,122 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
21
+ 2023-12-20 10:46:24,276 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
22
+ 2023-12-20 10:46:25,509 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.9611, 3.4158, 3.4273, 3.1896], device='cuda:0')
23
+ 2023-12-20 10:46:27,356 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
24
+ 2023-12-20 10:46:30,495 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
25
+ 2023-12-20 10:46:33,521 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
26
+ 2023-12-20 10:46:36,478 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
27
+ 2023-12-20 10:46:39,585 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
28
+ 2023-12-20 10:46:42,701 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
29
+ 2023-12-20 10:46:44,071 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0002, 3.5787, 3.7091, 3.3797], device='cuda:0')
30
+ 2023-12-20 10:46:45,796 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
31
+ 2023-12-20 10:46:48,943 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
32
+ 2023-12-20 10:46:51,884 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0760, 5.9018, 6.0406, 6.0741], device='cuda:0')
33
+ 2023-12-20 10:46:52,019 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
34
+ 2023-12-20 10:46:55,266 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
35
+ 2023-12-20 10:46:58,336 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
36
+ 2023-12-20 10:47:01,382 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
37
+ 2023-12-20 10:47:04,531 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
38
+ 2023-12-20 10:47:07,400 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8208, 2.8864, 1.8047, 2.8759, 2.6535, 2.6208, 2.7454, 2.8134],
39
+ device='cuda:0')
40
+ 2023-12-20 10:47:07,628 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
41
+ 2023-12-20 10:47:10,862 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
42
+ 2023-12-20 10:47:11,160 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
43
+ 2023-12-20 10:47:12,839 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4678342733391697
44
+ 2023-12-20 10:47:12,839 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-25-avg-1-use-averaged-model-2023-12-20-10-44-03 ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 10:44:03,655 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 10:44:03,655 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 25, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-25-avg-1-use-averaged-model'}
3
+ 2023-12-20 10:44:03,655 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 10:44:04,014 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 24 (excluded) to 25
5
+ 2023-12-20 10:44:10,539 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 10:44:10,540 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 10:44:10,596 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 10:44:10,925 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 10:44:14,773 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 10:44:18,064 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 10:44:21,320 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-20 10:44:21,852 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.7406, 4.7232, 4.8807, 4.8243], device='cuda:0')
13
+ 2023-12-20 10:44:23,359 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6484, 2.9601, 2.9909, 3.1589, 3.2593, 2.7683, 2.4617, 2.4030],
14
+ device='cuda:0')
15
+ 2023-12-20 10:44:24,527 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
16
+ 2023-12-20 10:44:27,693 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
17
+ 2023-12-20 10:44:27,814 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8447, 2.5617, 2.8367, 3.4541, 1.9861, 2.4823, 3.1923, 2.4085],
18
+ device='cuda:0')
19
+ 2023-12-20 10:44:31,252 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
20
+ 2023-12-20 10:44:32,273 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8479, 4.8294, 5.0751, 4.9540], device='cuda:0')
21
+ 2023-12-20 10:44:34,488 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
22
+ 2023-12-20 10:44:37,672 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
23
+ 2023-12-20 10:44:40,777 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
24
+ 2023-12-20 10:44:42,899 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7354, 2.9335, 3.0479, 3.1602, 3.2293, 2.7738, 2.5081, 2.4577],
25
+ device='cuda:0')
26
+ 2023-12-20 10:44:44,048 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6345, 2.7496, 2.8130, 3.0798], device='cuda:0')
27
+ 2023-12-20 10:44:44,260 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
28
+ 2023-12-20 10:44:47,361 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
29
+ 2023-12-20 10:44:50,460 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
30
+ 2023-12-20 10:44:53,676 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
31
+ 2023-12-20 10:44:56,756 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
32
+ 2023-12-20 10:44:59,667 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5948, 4.0466, 3.4720, 3.9710], device='cuda:0')
33
+ 2023-12-20 10:44:59,918 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
34
+ 2023-12-20 10:45:03,151 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
35
+ 2023-12-20 10:45:06,028 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.1141, 4.7818, 4.9903, 4.1771], device='cuda:0')
36
+ 2023-12-20 10:45:06,305 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
37
+ 2023-12-20 10:45:09,450 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
38
+ 2023-12-20 10:45:12,540 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
39
+ 2023-12-20 10:45:15,653 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
40
+ 2023-12-20 10:45:19,023 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
41
+ 2023-12-20 10:45:22,126 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
42
+ 2023-12-20 10:45:22,527 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.8982, 5.3673, 5.2870, 5.1760], device='cuda:0')
43
+ 2023-12-20 10:45:25,242 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
44
+ 2023-12-20 10:45:28,509 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
45
+ 2023-12-20 10:45:31,515 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
46
+ 2023-12-20 10:45:34,760 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
47
+ 2023-12-20 10:45:35,146 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
48
+ 2023-12-20 10:45:36,584 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46796178277007094
49
+ 2023-12-20 10:45:36,584 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-25-avg-2-use-averaged-model-2023-12-20-10-42-27 ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 10:42:27,103 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 10:42:27,103 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 25, 'iter': 0, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-25-avg-2-use-averaged-model'}
3
+ 2023-12-20 10:42:27,104 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 10:42:27,455 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 23 (excluded) to 25
5
+ 2023-12-20 10:42:33,987 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 10:42:33,987 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 10:42:34,038 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 10:42:34,371 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 10:42:38,261 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 10:42:41,595 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 10:42:44,723 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-20 10:42:47,856 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-20 10:42:48,845 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.7527, 3.2765, 3.3133, 2.8836], device='cuda:0')
14
+ 2023-12-20 10:42:51,061 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
15
+ 2023-12-20 10:42:54,410 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
16
+ 2023-12-20 10:42:57,585 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
17
+ 2023-12-20 10:43:00,675 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
18
+ 2023-12-20 10:43:02,721 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7943, 3.2720, 2.7736, 2.4867], device='cuda:0')
19
+ 2023-12-20 10:43:03,019 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8945, 2.7767, 3.5104, 3.0030, 2.7343, 3.0379, 3.6371, 3.3575],
20
+ device='cuda:0')
21
+ 2023-12-20 10:43:03,177 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5923, 4.0459, 3.9618, 4.1360], device='cuda:0')
22
+ 2023-12-20 10:43:03,780 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
23
+ 2023-12-20 10:43:07,151 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
24
+ 2023-12-20 10:43:08,121 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2074, 3.3094, 3.8335, 3.0254], device='cuda:0')
25
+ 2023-12-20 10:43:10,296 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
26
+ 2023-12-20 10:43:13,449 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
27
+ 2023-12-20 10:43:16,509 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
28
+ 2023-12-20 10:43:19,590 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
29
+ 2023-12-20 10:43:22,711 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
30
+ 2023-12-20 10:43:25,921 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
31
+ 2023-12-20 10:43:29,045 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
32
+ 2023-12-20 10:43:32,006 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5982, 3.9976, 4.0571, 4.1327], device='cuda:0')
33
+ 2023-12-20 10:43:32,102 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
34
+ 2023-12-20 10:43:32,323 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5829, 4.0430, 3.5962, 4.0216], device='cuda:0')
35
+ 2023-12-20 10:43:35,308 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
36
+ 2023-12-20 10:43:38,462 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
37
+ 2023-12-20 10:43:41,742 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
38
+ 2023-12-20 10:43:44,926 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
39
+ 2023-12-20 10:43:47,974 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
40
+ 2023-12-20 10:43:51,055 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
41
+ 2023-12-20 10:43:54,205 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
42
+ 2023-12-20 10:43:57,434 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
43
+ 2023-12-20 10:43:57,759 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
44
+ 2023-12-20 10:43:59,295 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46852371371488843
45
+ 2023-12-20 10:43:59,295 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-25-avg-3-use-averaged-model-2023-12-20-10-40-52 ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 10:40:52,203 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 10:40:52,203 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 25, 'iter': 0, 'avg': 3, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-25-avg-3-use-averaged-model'}
3
+ 2023-12-20 10:40:52,203 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 10:40:52,617 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 22 (excluded) to 25
5
+ 2023-12-20 10:40:58,282 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 10:40:58,283 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 10:40:58,325 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 10:40:58,654 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 10:41:02,813 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 10:41:06,112 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 10:41:09,269 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-20 10:41:12,130 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1242, 3.6203, 3.6501, 3.5436], device='cuda:0')
13
+ 2023-12-20 10:41:12,428 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
14
+ 2023-12-20 10:41:14,021 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6019, 4.0095, 3.4257, 3.8906], device='cuda:0')
15
+ 2023-12-20 10:41:15,575 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
16
+ 2023-12-20 10:41:18,967 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
17
+ 2023-12-20 10:41:22,145 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
18
+ 2023-12-20 10:41:25,288 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
19
+ 2023-12-20 10:41:25,365 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.4024, 3.9211, 3.5338, 3.8752], device='cuda:0')
20
+ 2023-12-20 10:41:28,491 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
21
+ 2023-12-20 10:41:31,599 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
22
+ 2023-12-20 10:41:34,743 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
23
+ 2023-12-20 10:41:37,364 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
24
+ 2023-12-20 10:41:38,193 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6002, 4.0276, 3.6396, 4.0477], device='cuda:0')
25
+ 2023-12-20 10:41:38,543 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7701, 3.1188, 2.9718, 3.2385, 3.2177, 2.7184, 2.5733, 2.7313],
26
+ device='cuda:0')
27
+ 2023-12-20 10:41:40,493 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
28
+ 2023-12-20 10:41:43,639 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
29
+ 2023-12-20 10:41:46,743 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
30
+ 2023-12-20 10:41:49,931 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
31
+ 2023-12-20 10:41:52,935 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
32
+ 2023-12-20 10:41:55,952 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
33
+ 2023-12-20 10:41:59,034 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
34
+ 2023-12-20 10:42:00,531 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5261, 3.9898, 4.1448, 3.7577], device='cuda:0')
35
+ 2023-12-20 10:42:02,176 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
36
+ 2023-12-20 10:42:05,470 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
37
+ 2023-12-20 10:42:05,665 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8968, 4.8723, 5.0816, 4.8382], device='cuda:0')
38
+ 2023-12-20 10:42:07,523 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8191, 4.8660, 5.0645, 4.9789], device='cuda:0')
39
+ 2023-12-20 10:42:08,326 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7500, 3.0832, 2.9960, 3.1950, 3.3016, 2.7407, 2.5224, 2.4873],
40
+ device='cuda:0')
41
+ 2023-12-20 10:42:08,567 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
42
+ 2023-12-20 10:42:11,709 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
43
+ 2023-12-20 10:42:14,776 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
44
+ 2023-12-20 10:42:17,813 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
45
+ 2023-12-20 10:42:20,933 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
46
+ 2023-12-20 10:42:21,237 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
47
+ 2023-12-20 10:42:22,749 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46854661978980017
48
+ 2023-12-20 10:42:22,750 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-25-avg-4-use-averaged-model-2023-12-20-10-39-15 ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-20 10:39:15,687 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-20 10:39:15,688 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 25, 'iter': 0, 'avg': 4, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-25-avg-4-use-averaged-model'}
3
+ 2023-12-20 10:39:15,688 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-20 10:39:16,039 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 21 (excluded) to 25
5
+ 2023-12-20 10:39:21,651 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-20 10:39:21,651 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-20 10:39:21,703 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-20 10:39:22,030 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-20 10:39:26,230 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-20 10:39:29,556 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-20 10:39:32,725 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-20 10:39:32,807 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.9127, 4.5355, 4.8060, 4.1871], device='cuda:0')
13
+ 2023-12-20 10:39:35,748 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
14
+ 2023-12-20 10:39:39,008 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
15
+ 2023-12-20 10:39:42,367 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
16
+ 2023-12-20 10:39:45,619 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
17
+ 2023-12-20 10:39:45,670 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0721, 5.8896, 6.0316, 6.0736], device='cuda:0')
18
+ 2023-12-20 10:39:48,751 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
19
+ 2023-12-20 10:39:51,897 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
20
+ 2023-12-20 10:39:55,298 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
21
+ 2023-12-20 10:39:58,348 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
22
+ 2023-12-20 10:40:01,326 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
23
+ 2023-12-20 10:40:04,488 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
24
+ 2023-12-20 10:40:07,552 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
25
+ 2023-12-20 10:40:07,944 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.3410, 3.6728, 3.7610, 3.6947], device='cuda:0')
26
+ 2023-12-20 10:40:10,624 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
27
+ 2023-12-20 10:40:14,020 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
28
+ 2023-12-20 10:40:17,092 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
29
+ 2023-12-20 10:40:20,269 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
30
+ 2023-12-20 10:40:23,369 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
31
+ 2023-12-20 10:40:26,379 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
32
+ 2023-12-20 10:40:29,713 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
33
+ 2023-12-20 10:40:32,819 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
34
+ 2023-12-20 10:40:35,961 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
35
+ 2023-12-20 10:40:37,733 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0667, 5.8529, 6.0123, 6.0612], device='cuda:0')
36
+ 2023-12-20 10:40:38,755 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0811, 5.8904, 6.0285, 6.0771], device='cuda:0')
37
+ 2023-12-20 10:40:39,163 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
38
+ 2023-12-20 10:40:39,321 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1344, 3.2885, 3.8087, 2.9868], device='cuda:0')
39
+ 2023-12-20 10:40:42,384 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
40
+ 2023-12-20 10:40:45,696 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
41
+ 2023-12-20 10:40:46,059 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
42
+ 2023-12-20 10:40:47,627 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46822492011825256
43
+ 2023-12-20 10:40:47,627 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-26-avg-1-use-averaged-model-2023-12-21-10-09-44 ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-21 10:09:44,364 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-21 10:09:44,364 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 26, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-26-avg-1-use-averaged-model'}
3
+ 2023-12-21 10:09:44,364 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-21 10:09:44,713 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 25 (excluded) to 26
5
+ 2023-12-21 10:09:51,952 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-21 10:09:51,953 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-21 10:09:52,014 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-21 10:09:52,364 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-21 10:09:56,954 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-21 10:10:00,423 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-21 10:10:03,626 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-21 10:10:06,708 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-21 10:10:09,936 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
14
+ 2023-12-21 10:10:12,588 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0104, 4.7009, 4.9490, 4.2600], device='cuda:0')
15
+ 2023-12-21 10:10:13,319 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
16
+ 2023-12-21 10:10:16,467 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
17
+ 2023-12-21 10:10:19,662 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
18
+ 2023-12-21 10:10:22,809 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
19
+ 2023-12-21 10:10:23,297 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8222, 4.8494, 5.0612, 4.9851], device='cuda:0')
20
+ 2023-12-21 10:10:26,143 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
21
+ 2023-12-21 10:10:29,236 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
22
+ 2023-12-21 10:10:32,234 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
23
+ 2023-12-21 10:10:35,006 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5908, 4.0796, 4.1266, 4.1072], device='cuda:0')
24
+ 2023-12-21 10:10:35,399 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
25
+ 2023-12-21 10:10:38,183 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7551, 2.8724, 2.9775, 3.1511, 3.2957, 2.6778, 2.4982, 2.5637],
26
+ device='cuda:0')
27
+ 2023-12-21 10:10:38,597 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
28
+ 2023-12-21 10:10:39,247 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5636, 3.8586, 4.0116, 3.8199], device='cuda:0')
29
+ 2023-12-21 10:10:41,658 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
30
+ 2023-12-21 10:10:44,998 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
31
+ 2023-12-21 10:10:48,174 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
32
+ 2023-12-21 10:10:51,269 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
33
+ 2023-12-21 10:10:53,153 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2941, 3.7671, 3.8608, 3.7132], device='cuda:0')
34
+ 2023-12-21 10:10:54,522 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
35
+ 2023-12-21 10:10:57,513 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
36
+ 2023-12-21 10:10:59,751 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0644, 5.8480, 6.0137, 6.0666], device='cuda:0')
37
+ 2023-12-21 10:11:00,766 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
38
+ 2023-12-21 10:11:03,885 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
39
+ 2023-12-21 10:11:07,046 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
40
+ 2023-12-21 10:11:10,229 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
41
+ 2023-12-21 10:11:11,098 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5812, 3.9835, 4.1119, 4.1392], device='cuda:0')
42
+ 2023-12-21 10:11:13,356 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
43
+ 2023-12-21 10:11:16,640 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
44
+ 2023-12-21 10:11:16,927 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
45
+ 2023-12-21 10:11:18,559 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4682307600852252
46
+ 2023-12-21 10:11:18,560 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-26-avg-2-use-averaged-model-2023-12-21-10-08-04 ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-21 10:08:04,487 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-21 10:08:04,487 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 26, 'iter': 0, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-26-avg-2-use-averaged-model'}
3
+ 2023-12-21 10:08:04,487 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-21 10:08:04,842 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 24 (excluded) to 26
5
+ 2023-12-21 10:08:11,670 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-21 10:08:11,670 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-21 10:08:11,712 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-21 10:08:12,036 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-21 10:08:18,223 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-21 10:08:19,224 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7747, 3.1252, 2.8471, 3.2428, 3.2816, 2.7313, 2.5164, 2.7262],
11
+ device='cuda:0')
12
+ 2023-12-21 10:08:21,646 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
13
+ 2023-12-21 10:08:23,185 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1466, 3.6017, 3.7274, 3.4732], device='cuda:0')
14
+ 2023-12-21 10:08:24,723 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
15
+ 2023-12-21 10:08:25,906 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5412, 3.8553, 4.0265, 3.8036], device='cuda:0')
16
+ 2023-12-21 10:08:26,986 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.7513, 4.7607, 4.9647, 4.8572], device='cuda:0')
17
+ 2023-12-21 10:08:27,875 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
18
+ 2023-12-21 10:08:31,066 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
19
+ 2023-12-21 10:08:34,389 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
20
+ 2023-12-21 10:08:34,661 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1225, 3.2481, 3.7342, 3.0375], device='cuda:0')
21
+ 2023-12-21 10:08:37,535 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
22
+ 2023-12-21 10:08:40,662 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
23
+ 2023-12-21 10:08:41,108 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2563, 3.6291, 3.8419, 3.6304], device='cuda:0')
24
+ 2023-12-21 10:08:43,881 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
25
+ 2023-12-21 10:08:47,290 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
26
+ 2023-12-21 10:08:49,651 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6078, 2.6298, 2.8203, 3.1245], device='cuda:0')
27
+ 2023-12-21 10:08:50,364 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
28
+ 2023-12-21 10:08:52,860 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8480, 4.8740, 5.0857, 5.0089], device='cuda:0')
29
+ 2023-12-21 10:08:53,664 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
30
+ 2023-12-21 10:08:56,869 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
31
+ 2023-12-21 10:09:00,048 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
32
+ 2023-12-21 10:09:03,216 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
33
+ 2023-12-21 10:09:06,491 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
34
+ 2023-12-21 10:09:09,595 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
35
+ 2023-12-21 10:09:12,388 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8729, 2.6922, 3.4986, 2.9640, 2.6369, 3.0005, 3.5436, 3.3496],
36
+ device='cuda:0')
37
+ 2023-12-21 10:09:12,802 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
38
+ 2023-12-21 10:09:15,931 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
39
+ 2023-12-21 10:09:18,982 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
40
+ 2023-12-21 10:09:22,368 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
41
+ 2023-12-21 10:09:25,543 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
42
+ 2023-12-21 10:09:28,634 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
43
+ 2023-12-21 10:09:30,594 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8398, 2.4404, 2.8863, 3.4149, 2.1525, 2.4322, 3.1655, 2.4614],
44
+ device='cuda:0')
45
+ 2023-12-21 10:09:31,157 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8333, 3.4095, 2.7902, 2.6259], device='cuda:0')
46
+ 2023-12-21 10:09:31,764 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
47
+ 2023-12-21 10:09:34,913 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
48
+ 2023-12-21 10:09:36,651 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0493, 5.8395, 6.0072, 6.0639], device='cuda:0')
49
+ 2023-12-21 10:09:38,091 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
50
+ 2023-12-21 10:09:38,347 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.4995, 2.7430, 1.8274, 2.6845, 2.5754, 2.3457, 2.6209, 2.1624],
51
+ device='cuda:0')
52
+ 2023-12-21 10:09:38,448 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
53
+ 2023-12-21 10:09:39,915 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4689302397877799
54
+ 2023-12-21 10:09:39,915 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-26-avg-3-use-averaged-model-2023-12-21-10-06-27 ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-21 10:06:27,219 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-21 10:06:27,220 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 26, 'iter': 0, 'avg': 3, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-26-avg-3-use-averaged-model'}
3
+ 2023-12-21 10:06:27,220 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-21 10:06:27,587 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 23 (excluded) to 26
5
+ 2023-12-21 10:06:33,944 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-21 10:06:33,944 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-21 10:06:33,988 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-21 10:06:34,314 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-21 10:06:38,669 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-21 10:06:42,034 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-21 10:06:42,734 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5608, 3.9583, 4.0339, 3.7556], device='cuda:0')
12
+ 2023-12-21 10:06:45,224 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
13
+ 2023-12-21 10:06:48,431 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
14
+ 2023-12-21 10:06:50,582 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5502, 3.8334, 4.0239, 3.8760], device='cuda:0')
15
+ 2023-12-21 10:06:51,585 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
16
+ 2023-12-21 10:06:55,080 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
17
+ 2023-12-21 10:06:57,323 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.3048, 3.7751, 3.8590, 3.7009], device='cuda:0')
18
+ 2023-12-21 10:06:58,299 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
19
+ 2023-12-21 10:07:01,434 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
20
+ 2023-12-21 10:07:02,201 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7745, 2.9140, 2.0035, 2.8298, 2.6247, 2.8467, 2.7518, 3.0069],
21
+ device='cuda:0')
22
+ 2023-12-21 10:07:04,569 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
23
+ 2023-12-21 10:07:07,883 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
24
+ 2023-12-21 10:07:10,976 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
25
+ 2023-12-21 10:07:14,190 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
26
+ 2023-12-21 10:07:16,239 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0872, 4.7674, 4.9864, 4.1322], device='cuda:0')
27
+ 2023-12-21 10:07:17,446 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
28
+ 2023-12-21 10:07:20,589 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
29
+ 2023-12-21 10:07:23,762 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
30
+ 2023-12-21 10:07:26,814 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6632, 2.8210, 2.9482, 3.0180], device='cuda:0')
31
+ 2023-12-21 10:07:27,015 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
32
+ 2023-12-21 10:07:29,490 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5902, 4.0582, 3.5101, 4.0841], device='cuda:0')
33
+ 2023-12-21 10:07:30,166 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
34
+ 2023-12-21 10:07:30,382 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.1128, 4.7023, 5.0088, 4.1128], device='cuda:0')
35
+ 2023-12-21 10:07:33,257 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
36
+ 2023-12-21 10:07:36,288 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
37
+ 2023-12-21 10:07:39,412 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
38
+ 2023-12-21 10:07:42,617 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6485, 2.6128, 2.9208, 2.9382], device='cuda:0')
39
+ 2023-12-21 10:07:42,643 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
40
+ 2023-12-21 10:07:45,796 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
41
+ 2023-12-21 10:07:47,556 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0978, 4.7686, 4.9777, 4.2449], device='cuda:0')
42
+ 2023-12-21 10:07:48,883 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
43
+ 2023-12-21 10:07:52,017 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
44
+ 2023-12-21 10:07:55,084 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
45
+ 2023-12-21 10:07:58,255 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
46
+ 2023-12-21 10:07:58,540 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
47
+ 2023-12-21 10:08:00,276 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46871671578548435
48
+ 2023-12-21 10:08:00,276 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-26-avg-4-use-averaged-model-2023-12-21-10-04-46 ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-21 10:04:46,570 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-21 10:04:46,571 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 26, 'iter': 0, 'avg': 4, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-26-avg-4-use-averaged-model'}
3
+ 2023-12-21 10:04:46,571 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-21 10:04:46,917 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 22 (excluded) to 26
5
+ 2023-12-21 10:04:56,907 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-21 10:04:56,908 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-21 10:04:56,951 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-21 10:04:57,273 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-21 10:05:01,873 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-21 10:05:05,390 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-21 10:05:05,684 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.9093, 5.3915, 5.2996, 5.0347], device='cuda:0')
12
+ 2023-12-21 10:05:08,488 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
13
+ 2023-12-21 10:05:11,614 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
14
+ 2023-12-21 10:05:12,462 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0845, 3.2483, 3.7607, 2.9707], device='cuda:0')
15
+ 2023-12-21 10:05:14,835 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
16
+ 2023-12-21 10:05:18,118 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
17
+ 2023-12-21 10:05:21,250 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
18
+ 2023-12-21 10:05:24,417 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
19
+ 2023-12-21 10:05:27,655 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
20
+ 2023-12-21 10:05:30,394 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.9172, 5.3680, 5.2351, 5.0528], device='cuda:0')
21
+ 2023-12-21 10:05:31,037 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
22
+ 2023-12-21 10:05:31,780 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.9756, 3.6607, 3.6230, 3.5251], device='cuda:0')
23
+ 2023-12-21 10:05:32,467 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8184, 4.8431, 5.0762, 4.9683], device='cuda:0')
24
+ 2023-12-21 10:05:34,155 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
25
+ 2023-12-21 10:05:35,451 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0612, 5.8611, 6.0046, 6.0729], device='cuda:0')
26
+ 2023-12-21 10:05:35,556 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5869, 2.6869, 2.7413, 3.0225], device='cuda:0')
27
+ 2023-12-21 10:05:37,351 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
28
+ 2023-12-21 10:05:40,465 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
29
+ 2023-12-21 10:05:43,539 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
30
+ 2023-12-21 10:05:46,733 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
31
+ 2023-12-21 10:05:49,928 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
32
+ 2023-12-21 10:05:52,943 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
33
+ 2023-12-21 10:05:56,096 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
34
+ 2023-12-21 10:05:59,192 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
35
+ 2023-12-21 10:06:02,270 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
36
+ 2023-12-21 10:06:05,524 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
37
+ 2023-12-21 10:06:08,566 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
38
+ 2023-12-21 10:06:10,252 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5743, 4.0152, 3.4849, 3.9127], device='cuda:0')
39
+ 2023-12-21 10:06:11,305 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8363, 4.8446, 5.0875, 4.9781], device='cuda:0')
40
+ 2023-12-21 10:06:11,683 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
41
+ 2023-12-21 10:06:11,945 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0735, 4.7463, 5.0115, 4.2239], device='cuda:0')
42
+ 2023-12-21 10:06:14,759 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
43
+ 2023-12-21 10:06:17,840 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
44
+ 2023-12-21 10:06:21,144 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
45
+ 2023-12-21 10:06:21,549 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
46
+ 2023-12-21 10:06:22,952 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46883498968842685
47
+ 2023-12-21 10:06:22,952 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-27-avg-1-use-averaged-model-2023-12-21-10-03-10 ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-21 10:03:10,150 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-21 10:03:10,151 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 27, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-27-avg-1-use-averaged-model'}
3
+ 2023-12-21 10:03:10,151 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-21 10:03:10,504 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 26 (excluded) to 27
5
+ 2023-12-21 10:03:16,501 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-21 10:03:16,501 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-21 10:03:16,549 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-21 10:03:16,879 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-21 10:03:21,241 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-21 10:03:23,250 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.7653, 4.8229, 4.9813, 4.8833], device='cuda:0')
11
+ 2023-12-21 10:03:24,668 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
12
+ 2023-12-21 10:03:27,823 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
13
+ 2023-12-21 10:03:30,950 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
14
+ 2023-12-21 10:03:34,016 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
15
+ 2023-12-21 10:03:37,275 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
16
+ 2023-12-21 10:03:40,517 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
17
+ 2023-12-21 10:03:43,629 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
18
+ 2023-12-21 10:03:46,824 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
19
+ 2023-12-21 10:03:50,047 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
20
+ 2023-12-21 10:03:53,206 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
21
+ 2023-12-21 10:03:56,362 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
22
+ 2023-12-21 10:03:56,425 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.9008, 5.3781, 5.2745, 5.0700], device='cuda:0')
23
+ 2023-12-21 10:03:59,403 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
24
+ 2023-12-21 10:04:02,471 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
25
+ 2023-12-21 10:04:05,472 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
26
+ 2023-12-21 10:04:08,734 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
27
+ 2023-12-21 10:04:11,887 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
28
+ 2023-12-21 10:04:14,892 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
29
+ 2023-12-21 10:04:18,038 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
30
+ 2023-12-21 10:04:21,176 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
31
+ 2023-12-21 10:04:24,334 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
32
+ 2023-12-21 10:04:25,879 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5413, 3.9883, 3.4874, 3.9503], device='cuda:0')
33
+ 2023-12-21 10:04:27,476 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
34
+ 2023-12-21 10:04:30,453 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
35
+ 2023-12-21 10:04:33,586 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
36
+ 2023-12-21 10:04:36,762 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
37
+ 2023-12-21 10:04:39,957 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
38
+ 2023-12-21 10:04:40,332 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
39
+ 2023-12-21 10:04:41,958 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46914914064529467
40
+ 2023-12-21 10:04:41,958 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-27-avg-2-use-averaged-model-2023-12-21-10-01-31 ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-21 10:01:31,463 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-21 10:01:31,464 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 27, 'iter': 0, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-27-avg-2-use-averaged-model'}
3
+ 2023-12-21 10:01:31,464 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-21 10:01:31,838 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 25 (excluded) to 27
5
+ 2023-12-21 10:01:37,586 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-21 10:01:37,586 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-21 10:01:37,626 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-21 10:01:37,954 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-21 10:01:42,673 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-21 10:01:46,057 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-21 10:01:49,251 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-21 10:01:52,430 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-21 10:01:55,646 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
14
+ 2023-12-21 10:01:59,105 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
15
+ 2023-12-21 10:02:02,306 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
16
+ 2023-12-21 10:02:03,790 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6468, 3.3425, 3.4066, 3.0997], device='cuda:0')
17
+ 2023-12-21 10:02:05,560 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
18
+ 2023-12-21 10:02:08,816 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
19
+ 2023-12-21 10:02:11,215 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5423, 4.0246, 4.1581, 4.1566], device='cuda:0')
20
+ 2023-12-21 10:02:12,260 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
21
+ 2023-12-21 10:02:15,595 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
22
+ 2023-12-21 10:02:18,830 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
23
+ 2023-12-21 10:02:21,974 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
24
+ 2023-12-21 10:02:25,032 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
25
+ 2023-12-21 10:02:28,089 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
26
+ 2023-12-21 10:02:31,337 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
27
+ 2023-12-21 10:02:34,437 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
28
+ 2023-12-21 10:02:37,557 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
29
+ 2023-12-21 10:02:40,065 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7578, 3.0141, 3.0550, 3.2143, 3.2802, 2.7740, 2.5490, 2.4100],
30
+ device='cuda:0')
31
+ 2023-12-21 10:02:40,801 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
32
+ 2023-12-21 10:02:40,829 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.9156, 5.3364, 5.2101, 5.1832], device='cuda:0')
33
+ 2023-12-21 10:02:43,984 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
34
+ 2023-12-21 10:02:47,303 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
35
+ 2023-12-21 10:02:50,485 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
36
+ 2023-12-21 10:02:51,543 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6188, 2.5122, 2.8825, 3.0200], device='cuda:0')
37
+ 2023-12-21 10:02:53,671 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
38
+ 2023-12-21 10:02:56,818 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
39
+ 2023-12-21 10:03:00,050 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
40
+ 2023-12-21 10:03:03,378 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
41
+ 2023-12-21 10:03:03,742 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
42
+ 2023-12-21 10:03:05,219 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4693110646579182
43
+ 2023-12-21 10:03:05,219 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-27-avg-3-use-averaged-model-2023-12-21-09-59-54 ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-21 09:59:54,624 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-21 09:59:54,624 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 27, 'iter': 0, 'avg': 3, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-27-avg-3-use-averaged-model'}
3
+ 2023-12-21 09:59:54,624 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-21 09:59:54,995 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 24 (excluded) to 27
5
+ 2023-12-21 10:00:01,170 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-21 10:00:01,170 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-21 10:00:01,217 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-21 10:00:01,661 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-21 10:00:06,018 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-21 10:00:09,311 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-21 10:00:12,500 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-21 10:00:13,347 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1546, 3.2562, 3.7439, 3.0531], device='cuda:0')
13
+ 2023-12-21 10:00:15,632 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
14
+ 2023-12-21 10:00:18,804 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
15
+ 2023-12-21 10:00:22,130 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
16
+ 2023-12-21 10:00:24,569 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0700, 5.8802, 6.0203, 6.0673], device='cuda:0')
17
+ 2023-12-21 10:00:24,623 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5722, 3.9754, 4.1106, 4.1309], device='cuda:0')
18
+ 2023-12-21 10:00:25,297 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
19
+ 2023-12-21 10:00:26,233 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.1059, 4.7218, 5.0172, 4.1710], device='cuda:0')
20
+ 2023-12-21 10:00:28,518 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
21
+ 2023-12-21 10:00:31,612 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
22
+ 2023-12-21 10:00:34,933 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
23
+ 2023-12-21 10:00:36,325 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8979, 3.4746, 2.9439, 2.6445], device='cuda:0')
24
+ 2023-12-21 10:00:38,018 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
25
+ 2023-12-21 10:00:41,232 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
26
+ 2023-12-21 10:00:43,419 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8177, 4.8439, 5.0677, 4.9932], device='cuda:0')
27
+ 2023-12-21 10:00:43,968 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0929, 4.8215, 5.0335, 4.1925], device='cuda:0')
28
+ 2023-12-21 10:00:44,148 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
29
+ 2023-12-21 10:00:47,285 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
30
+ 2023-12-21 10:00:50,446 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
31
+ 2023-12-21 10:00:53,685 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
32
+ 2023-12-21 10:00:56,464 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.8925, 5.3322, 5.2908, 4.9654], device='cuda:0')
33
+ 2023-12-21 10:00:56,846 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
34
+ 2023-12-21 10:00:59,995 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
35
+ 2023-12-21 10:01:03,145 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
36
+ 2023-12-21 10:01:04,741 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5964, 4.0765, 4.1647, 4.1931], device='cuda:0')
37
+ 2023-12-21 10:01:06,225 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
38
+ 2023-12-21 10:01:07,219 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.9080, 5.3474, 5.1997, 5.0847], device='cuda:0')
39
+ 2023-12-21 10:01:09,485 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
40
+ 2023-12-21 10:01:10,565 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0743, 5.8716, 6.0213, 6.0708], device='cuda:0')
41
+ 2023-12-21 10:01:12,566 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
42
+ 2023-12-21 10:01:15,649 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
43
+ 2023-12-21 10:01:18,831 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
44
+ 2023-12-21 10:01:20,657 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7766, 2.9509, 2.9689, 3.2262, 3.2520, 2.7654, 2.5121, 2.6241],
45
+ device='cuda:0')
46
+ 2023-12-21 10:01:22,017 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
47
+ 2023-12-21 10:01:24,092 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1148, 3.2627, 3.8485, 2.9889], device='cuda:0')
48
+ 2023-12-21 10:01:25,181 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
49
+ 2023-12-21 10:01:25,527 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
50
+ 2023-12-21 10:01:27,102 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4691632671962012
51
+ 2023-12-21 10:01:27,102 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-27-avg-4-use-averaged-model-2023-12-21-09-58-17 ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-21 09:58:17,626 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-21 09:58:17,626 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 27, 'iter': 0, 'avg': 4, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-27-avg-4-use-averaged-model'}
3
+ 2023-12-21 09:58:17,626 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-21 09:58:18,005 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 23 (excluded) to 27
5
+ 2023-12-21 09:58:24,304 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-21 09:58:24,304 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-21 09:58:24,375 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-21 09:58:24,773 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-21 09:58:29,135 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-21 09:58:30,798 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6543, 2.7635, 2.8834, 3.0904], device='cuda:0')
11
+ 2023-12-21 09:58:31,228 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5860, 4.0347, 3.5323, 3.9696], device='cuda:0')
12
+ 2023-12-21 09:58:32,434 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
13
+ 2023-12-21 09:58:35,564 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
14
+ 2023-12-21 09:58:36,191 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6483, 2.9546, 2.7025, 2.2927], device='cuda:0')
15
+ 2023-12-21 09:58:38,712 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
16
+ 2023-12-21 09:58:41,936 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
17
+ 2023-12-21 09:58:45,330 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
18
+ 2023-12-21 09:58:48,553 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
19
+ 2023-12-21 09:58:51,736 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
20
+ 2023-12-21 09:58:54,844 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
21
+ 2023-12-21 09:58:58,135 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
22
+ 2023-12-21 09:59:01,270 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
23
+ 2023-12-21 09:59:04,208 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
24
+ 2023-12-21 09:59:07,271 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
25
+ 2023-12-21 09:59:10,456 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
26
+ 2023-12-21 09:59:13,470 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
27
+ 2023-12-21 09:59:16,703 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
28
+ 2023-12-21 09:59:19,851 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
29
+ 2023-12-21 09:59:21,921 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.9018, 5.3499, 5.2723, 5.1631], device='cuda:0')
30
+ 2023-12-21 09:59:22,946 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
31
+ 2023-12-21 09:59:25,987 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
32
+ 2023-12-21 09:59:29,190 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
33
+ 2023-12-21 09:59:32,122 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0032, 3.7156, 3.6279, 3.3448], device='cuda:0')
34
+ 2023-12-21 09:59:32,369 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
35
+ 2023-12-21 09:59:35,524 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
36
+ 2023-12-21 09:59:38,592 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
37
+ 2023-12-21 09:59:41,250 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6864, 3.2005, 2.8777, 2.3494], device='cuda:0')
38
+ 2023-12-21 09:59:41,795 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
39
+ 2023-12-21 09:59:42,107 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.3005, 3.8172, 3.7213, 3.7009], device='cuda:0')
40
+ 2023-12-21 09:59:44,890 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
41
+ 2023-12-21 09:59:48,181 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
42
+ 2023-12-21 09:59:48,529 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
43
+ 2023-12-21 09:59:49,950 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4690633200378112
44
+ 2023-12-21 09:59:49,951 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-28-avg-1-use-averaged-model-2023-12-21-09-56-41 ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-21 09:56:41,107 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-21 09:56:41,107 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 28, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-28-avg-1-use-averaged-model'}
3
+ 2023-12-21 09:56:41,108 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-21 09:56:41,481 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 27 (excluded) to 28
5
+ 2023-12-21 09:56:47,339 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-21 09:56:47,339 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-21 09:56:47,385 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-21 09:56:47,797 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-21 09:56:51,805 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-21 09:56:55,255 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-21 09:56:58,375 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-21 09:57:01,580 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-21 09:57:04,709 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
14
+ 2023-12-21 09:57:06,374 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.3329, 3.6721, 3.8099, 3.7312], device='cuda:0')
15
+ 2023-12-21 09:57:08,110 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
16
+ 2023-12-21 09:57:11,340 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
17
+ 2023-12-21 09:57:12,207 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6721, 2.9349, 2.9971, 3.1657, 3.3171, 2.6957, 2.5383, 2.4756],
18
+ device='cuda:0')
19
+ 2023-12-21 09:57:14,543 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
20
+ 2023-12-21 09:57:17,746 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
21
+ 2023-12-21 09:57:20,968 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
22
+ 2023-12-21 09:57:23,129 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6126, 4.0663, 3.5653, 4.0274], device='cuda:0')
23
+ 2023-12-21 09:57:24,177 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
24
+ 2023-12-21 09:57:26,666 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.3179, 3.7019, 3.8489, 3.7341], device='cuda:0')
25
+ 2023-12-21 09:57:27,098 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
26
+ 2023-12-21 09:57:28,800 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0083, 3.5622, 3.4465, 3.3693], device='cuda:0')
27
+ 2023-12-21 09:57:30,236 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
28
+ 2023-12-21 09:57:30,910 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.1000, 4.8354, 4.9783, 4.3297], device='cuda:0')
29
+ 2023-12-21 09:57:33,330 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
30
+ 2023-12-21 09:57:36,456 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
31
+ 2023-12-21 09:57:39,655 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
32
+ 2023-12-21 09:57:42,792 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
33
+ 2023-12-21 09:57:45,903 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
34
+ 2023-12-21 09:57:47,450 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5514, 3.8452, 4.1110, 3.8486], device='cuda:0')
35
+ 2023-12-21 09:57:47,971 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5494, 3.8355, 4.0443, 3.7021], device='cuda:0')
36
+ 2023-12-21 09:57:49,001 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
37
+ 2023-12-21 09:57:52,085 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
38
+ 2023-12-21 09:57:54,728 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7289, 2.9894, 2.9983, 3.1778, 3.2674, 2.7101, 2.5714, 2.5275],
39
+ device='cuda:0')
40
+ 2023-12-21 09:57:55,449 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
41
+ 2023-12-21 09:57:58,614 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
42
+ 2023-12-21 09:58:00,560 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0713, 5.9004, 6.0332, 6.0730], device='cuda:0')
43
+ 2023-12-21 09:58:01,756 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
44
+ 2023-12-21 09:58:04,622 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8426, 4.8609, 5.0520, 4.9662], device='cuda:0')
45
+ 2023-12-21 09:58:04,893 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
46
+ 2023-12-21 09:58:08,062 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
47
+ 2023-12-21 09:58:11,270 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
48
+ 2023-12-21 09:58:11,594 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
49
+ 2023-12-21 09:58:13,061 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4681237133863272
50
+ 2023-12-21 09:58:13,061 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-28-avg-2-use-averaged-model-2023-12-21-09-55-03 ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-21 09:55:03,031 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-21 09:55:03,031 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 28, 'iter': 0, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-28-avg-2-use-averaged-model'}
3
+ 2023-12-21 09:55:03,032 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-21 09:55:03,414 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 26 (excluded) to 28
5
+ 2023-12-21 09:55:09,779 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-21 09:55:09,779 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-21 09:55:09,820 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-21 09:55:10,145 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-21 09:55:15,225 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-21 09:55:18,588 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
11
+ 2023-12-21 09:55:21,838 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
12
+ 2023-12-21 09:55:24,938 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
13
+ 2023-12-21 09:55:28,142 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
14
+ 2023-12-21 09:55:31,558 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
15
+ 2023-12-21 09:55:34,724 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
16
+ 2023-12-21 09:55:37,898 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
17
+ 2023-12-21 09:55:41,055 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
18
+ 2023-12-21 09:55:44,334 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
19
+ 2023-12-21 09:55:47,506 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
20
+ 2023-12-21 09:55:50,506 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
21
+ 2023-12-21 09:55:53,558 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
22
+ 2023-12-21 09:55:54,408 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5271, 3.7728, 4.0196, 3.7916], device='cuda:0')
23
+ 2023-12-21 09:55:56,659 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
24
+ 2023-12-21 09:55:59,804 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
25
+ 2023-12-21 09:56:03,098 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
26
+ 2023-12-21 09:56:06,366 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
27
+ 2023-12-21 09:56:09,415 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
28
+ 2023-12-21 09:56:10,216 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1280, 3.2450, 3.7466, 2.9804], device='cuda:0')
29
+ 2023-12-21 09:56:12,638 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
30
+ 2023-12-21 09:56:14,326 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([6.0492, 5.8010, 5.9952, 6.0608], device='cuda:0')
31
+ 2023-12-21 09:56:15,710 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
32
+ 2023-12-21 09:56:18,200 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2766, 3.6669, 3.9188, 3.5699], device='cuda:0')
33
+ 2023-12-21 09:56:19,033 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
34
+ 2023-12-21 09:56:22,229 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
35
+ 2023-12-21 09:56:25,380 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
36
+ 2023-12-21 09:56:26,234 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9108, 2.7723, 3.5391, 3.0054, 2.6941, 3.1459, 3.5539, 3.4188],
37
+ device='cuda:0')
38
+ 2023-12-21 09:56:28,480 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
39
+ 2023-12-21 09:56:31,534 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
40
+ 2023-12-21 09:56:34,747 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
41
+ 2023-12-21 09:56:35,088 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
42
+ 2023-12-21 09:56:36,529 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4694101768647912
43
+ 2023-12-21 09:56:36,529 INFO [inference_audio_tagging.py:456] Done
exp_md1000/inference_audio_tagging/log-decode-epoch-28-avg-3-use-averaged-model-2023-12-21-09-53-24 ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2023-12-21 09:53:24,600 INFO [inference_audio_tagging.py:316] Evaluation started
2
+ 2023-12-21 09:53:24,601 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD', 'icefall-git-sha1': 'a77761c2-dirty', 'icefall-git-date': 'Tue Nov 28 15:54:58 2023', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 28, 'iter': 0, 'avg': 3, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 768, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'full_libri': True, 'mini_libri': False, 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_audioset': False, 'audioset_subset': 'balanced', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': False, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_KD_960h_5fold+as_unbalanced+vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10_use_whisper_large-v3_1_scale_1.0_specaug0_musan0_with_task_ID_stop_earyl1_fixed_share_960h/inference_audio_tagging'), 'suffix': 'epoch-28-avg-3-use-averaged-model'}
3
+ 2023-12-21 09:53:24,601 INFO [inference_audio_tagging.py:324] About to create model
4
+ 2023-12-21 09:53:25,025 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 25 (excluded) to 28
5
+ 2023-12-21 09:53:31,667 INFO [inference_audio_tagging.py:421] Number of model parameters: 64264454
6
+ 2023-12-21 09:53:31,668 INFO [kd_datamodule.py:840] About to get the audioset eval cuts.
7
+ 2023-12-21 09:53:31,719 INFO [kd_datamodule.py:534] About to create dev dataset
8
+ 2023-12-21 09:53:32,041 INFO [kd_datamodule.py:555] About to create dev dataloader
9
+ 2023-12-21 09:53:36,902 INFO [inference_audio_tagging.py:289] Processed 60 cuts already.
10
+ 2023-12-21 09:53:39,635 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.3811, 3.8806, 3.9729, 4.0154], device='cuda:0')
11
+ 2023-12-21 09:53:40,260 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6018, 2.7627, 2.8067, 3.0557, 3.2197, 2.5979, 2.4210, 2.1756],
12
+ device='cuda:0')
13
+ 2023-12-21 09:53:40,329 INFO [inference_audio_tagging.py:289] Processed 660 cuts already.
14
+ 2023-12-21 09:53:43,636 INFO [inference_audio_tagging.py:289] Processed 1260 cuts already.
15
+ 2023-12-21 09:53:46,784 INFO [inference_audio_tagging.py:289] Processed 1860 cuts already.
16
+ 2023-12-21 09:53:49,801 INFO [inference_audio_tagging.py:289] Processed 2460 cuts already.
17
+ 2023-12-21 09:53:53,194 INFO [inference_audio_tagging.py:289] Processed 3060 cuts already.
18
+ 2023-12-21 09:53:56,350 INFO [inference_audio_tagging.py:289] Processed 3660 cuts already.
19
+ 2023-12-21 09:53:58,193 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.9162, 5.3229, 5.2482, 5.1000], device='cuda:0')
20
+ 2023-12-21 09:53:59,368 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0134, 3.5996, 3.5300, 3.2870], device='cuda:0')
21
+ 2023-12-21 09:53:59,565 INFO [inference_audio_tagging.py:289] Processed 4260 cuts already.
22
+ 2023-12-21 09:54:02,814 INFO [inference_audio_tagging.py:289] Processed 4860 cuts already.
23
+ 2023-12-21 09:54:06,093 INFO [inference_audio_tagging.py:289] Processed 5460 cuts already.
24
+ 2023-12-21 09:54:09,285 INFO [inference_audio_tagging.py:289] Processed 6060 cuts already.
25
+ 2023-12-21 09:54:12,324 INFO [inference_audio_tagging.py:289] Processed 6660 cuts already.
26
+ 2023-12-21 09:54:15,457 INFO [inference_audio_tagging.py:289] Processed 7260 cuts already.
27
+ 2023-12-21 09:54:18,097 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7259, 2.7775, 2.9434, 3.2222], device='cuda:0')
28
+ 2023-12-21 09:54:18,651 INFO [inference_audio_tagging.py:289] Processed 7860 cuts already.
29
+ 2023-12-21 09:54:21,741 INFO [inference_audio_tagging.py:289] Processed 8460 cuts already.
30
+ 2023-12-21 09:54:25,022 INFO [inference_audio_tagging.py:289] Processed 9060 cuts already.
31
+ 2023-12-21 09:54:28,214 INFO [inference_audio_tagging.py:289] Processed 9660 cuts already.
32
+ 2023-12-21 09:54:31,522 INFO [inference_audio_tagging.py:289] Processed 10260 cuts already.
33
+ 2023-12-21 09:54:33,164 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5012, 2.3977, 2.7764, 3.0070], device='cuda:0')
34
+ 2023-12-21 09:54:34,591 INFO [inference_audio_tagging.py:289] Processed 10860 cuts already.
35
+ 2023-12-21 09:54:37,729 INFO [inference_audio_tagging.py:289] Processed 11460 cuts already.
36
+ 2023-12-21 09:54:41,017 INFO [inference_audio_tagging.py:289] Processed 12060 cuts already.
37
+ 2023-12-21 09:54:41,675 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5633, 2.5668, 2.9754, 3.0932], device='cuda:0')
38
+ 2023-12-21 09:54:44,177 INFO [inference_audio_tagging.py:289] Processed 12660 cuts already.
39
+ 2023-12-21 09:54:45,091 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0978, 4.8442, 5.0076, 4.3171], device='cuda:0')
40
+ 2023-12-21 09:54:47,361 INFO [inference_audio_tagging.py:289] Processed 13260 cuts already.
41
+ 2023-12-21 09:54:49,574 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.9288, 5.4167, 5.3015, 5.1472], device='cuda:0')
42
+ 2023-12-21 09:54:50,481 INFO [inference_audio_tagging.py:289] Processed 13860 cuts already.
43
+ 2023-12-21 09:54:53,638 INFO [inference_audio_tagging.py:289] Processed 14460 cuts already.
44
+ 2023-12-21 09:54:55,301 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0729, 3.8177, 3.6418, 3.5116], device='cuda:0')
45
+ 2023-12-21 09:54:56,921 INFO [inference_audio_tagging.py:289] Processed 15060 cuts already.
46
+ 2023-12-21 09:54:57,299 INFO [inference_audio_tagging.py:290] Finish collecting audio logits
47
+ 2023-12-21 09:54:58,709 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.46956851326074556
48
+ 2023-12-21 09:54:58,709 INFO [inference_audio_tagging.py:456] Done