Ahmadzei's picture
update 1
57bdca5
raw
history blame contribute delete
459 Bytes
By sharding the model parameters, optimizer and gradient states, and even offloading them to the CPU when they're inactive, FSDP can reduce the high cost of large-scale training. If you're interested in learning more, the following may be helpful:
Follow along with the more in-depth Accelerate guide for FSDP.
Read the Introducing PyTorch Fully Sharded Data Parallel (FSDP) API blog post.
Read the Scaling PyTorch models on Cloud TPUs with FSDP blog post.
.