ZeRO works in several stages: | |
ZeRO-1, optimizer state partioning across GPUs | |
ZeRO-2, gradient partitioning across GPUs | |
ZeRO-3, parameteter partitioning across GPUs | |
In GPU-limited environments, ZeRO also enables offloading optimizer memory and computation from the GPU to the CPU to fit and train really large models on a single GPU. |