Spaces:

Ahmadzei
/

RAG

Runtime error

update 1

57bdca5 over 1 year ago

459 Bytes

	By sharding the model parameters, optimizer and gradient states, and even offloading them to the CPU when they're inactive, FSDP can reduce the high cost of large-scale training. If you're interested in learning more, the following may be helpful:

	Follow along with the more in-depth Accelerate guide for FSDP.
	Read the Introducing PyTorch Fully Sharded Data Parallel (FSDP) API blog post.
	Read the Scaling PyTorch models on Cloud TPUs with FSDP blog post.
	.