Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
For example, choosing fp32 adds a small amount of overhead but ensures the reduction operation is accumulated in fp32 and when it is ready, it is downcasted to whichever half-precision dtype you're training in.