it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name) otherwise the model will first be loaded normally and only partitioned at forward time which is less efficient and when there is little CPU RAM may fail dschf = HfDeepSpeedConfig(ds_config) # keep this object alive now a model can be loaded.