it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)

otherwise the model will first be loaded normally and only partitioned at forward time which is
less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive
now a model can be loaded.