|
Setting ds_accelerator to cuda (auto detect) |
|
Unable to find hostfile, will proceed with training with local resources only. |
|
Detected CUDA_VISIBLE_DEVICES=0,1,2,3 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed. |
|
cmd = /data/jiongxiao_wang/anaconda3/envs/safe-rlhf/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=56337 --module --enable_each_rank_log=None safe_rlhf.finetune --train_datasets alpaca --model_name_or_path huggyllama/llama-7b --max_length 512 --trust_remote_code True --epochs 3 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 16 --gradient_checkpointing --learning_rate 2e-5 --lr_scheduler_type cosine --lr_warmup_ratio 0.03 --weight_decay 0.0 --seed 42 --output_dir /data/jiongxiao_wang/rlhf_attack/safe-rlhf/output/sft --log_type wandb --log_project Safe-RLHF-SFT --zero_stage 3 --bf16 True --tf32 True |
|
Setting ds_accelerator to cuda (auto detect) |
|
WORLD INFO DICT: {'localhost': } |
|
nnodes=1, num_local_procs=4, node_rank=0 |
|
global_rank_mapping=defaultdict(<class 'list'>, {'localhost': }) |
|
dist_world_size=4 |
|
Setting CUDA_VISIBLE_DEVICES=0,1,2,3 |
|
Setting ds_accelerator to cuda (auto detect) |
|
Setting ds_accelerator to cuda (auto detect) |
|
Setting ds_accelerator to cuda (auto detect) |
|
Setting ds_accelerator to cuda (auto detect) |
|
cdb=None |
|
cdb=None |
|
Initializing TorchBackend in DeepSpeed with backend nccl |
|
cdb=None |
|
cdb=None |
|
Set logger level to WARNING. |
|
ninja: no work to do. |
|
Time to load fused_adam op: 0.14865803718566895 seconds |
|
Time to load fused_adam op: 0.2057504653930664 seconds |
|
Time to load fused_adam op: 0.20213913917541504 seconds |
|
Time to load fused_adam op: 0.2022261619567871 seconds |
|
Parameter Offload: Total persistent parameters: 266240 in 65 params |
|
***** Running training ***** |
|
Saving model to "/data/jiongxiao_wang/rlhf_attack/safe-rlhf/output/sft" ... |
|
Saving DeepSpeed Checkpoints... |
|
Converting DeepSpeed Checkpoints to Hugging Face format... |
|
Setting ds_accelerator to cuda (auto detect) |
|
Processing zero checkpoint './global_step609' |
|
Detected checkpoint of type zero stage 3, world_size: 4 |
|
Parsing checkpoint created by deepspeed==0.12.6 |
|
Reconstructed Trainable fp32 state dict with 291 params 6738423808 elements |
|
Saving fp32 state dict to pytorch_model.bin |
|
Model saved! |
|
Process 189883 exits successfully. |
|
Process 189885 exits successfully. |
|
Process 189884 exits successfully. |
|
Process 189882 exits successfully. |
|
|