By sharding the model parameters, optimizer and gradient states, and even offloading them to the CPU when they're inactive, FSDP can reduce the high cost of large-scale training. If you're interested in learning more, the following may be helpful: | |
Follow along with the more in-depth Accelerate guide for FSDP. | |
Read the Introducing PyTorch Fully Sharded Data Parallel (FSDP) API blog post. | |
Read the Scaling PyTorch models on Cloud TPUs with FSDP blog post. | |
. |