For example, if your checkpoint folder looked like this: | |
$ ls -l output_dir/checkpoint-1/ | |
-rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json | |
drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/ | |
-rw-rw-r-- 1 stas stas 12 Mar 27 13:16 latest | |
-rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt | |
-rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin | |
-rw-rw-r-- 1 stas stas 623 Mar 27 20:42 scheduler.pt | |
-rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json | |
-rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model | |
-rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json | |
-rw-rw-r-- 1 stas stas 339 Mar 27 20:42 trainer_state.json | |
-rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin | |
-rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py* | |
To reconstruct the fp32 weights from the DeepSpeed checkpoint (ZeRO-2 or ZeRO-3) subfolder global_step1, run the following command to create and consolidate the full fp32 weights from multiple GPUs into a single pytorch_model.bin file. |