Unable to run the model in VLLM: KeyError: 'layers.14.mlp.gate.qweight'

#1
by fredericodeveloper - opened

I’d like to begin by expressing my sincere appreciation for you taking your time to quantize this model to AWQ, and for taking the time to review this message.
While attempting to run the model using vllm serve stelterlab/Qwen3-Coder-30B-A3B-Instruct-AWQ, the process exits with errors. Please find the full log below:

INFO 07-31 15:14:07 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 07-31 15:14:08 [gpu_model_runner.py:1921] Starting to load model stelterlab/Qwen3-Coder-30B-A3B-Instruct-AWQ...
INFO 07-31 15:14:08 [gpu_model_runner.py:1953] Loading model from scratch...
INFO 07-31 15:14:08 [cuda.py:305] Using Flash Attention backend on V1 engine.
INFO 07-31 15:14:08 [weight_utils.py:296] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
ERROR 07-31 15:14:09 [core.py:667] EngineCore failed to start.
ERROR 07-31 15:14:09 [core.py:667] Traceback (most recent call last):
ERROR 07-31 15:14:09 [core.py:667]   File "/root/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 658, in run_engine_core
ERROR 07-31 15:14:09 [core.py:667]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 07-31 15:14:09 [core.py:667]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-31 15:14:09 [core.py:667]   File "/root/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 474, in __init__
ERROR 07-31 15:14:09 [core.py:667]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 07-31 15:14:09 [core.py:667]   File "/root/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 79, in __init__
ERROR 07-31 15:14:09 [core.py:667]     self.model_executor = executor_class(vllm_config)
ERROR 07-31 15:14:09 [core.py:667]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-31 15:14:09 [core.py:667]   File "/root/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 53, in __init__
ERROR 07-31 15:14:09 [core.py:667]     self._init_executor()
ERROR 07-31 15:14:09 [core.py:667]   File "/root/.venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
ERROR 07-31 15:14:09 [core.py:667]     self.collective_rpc("load_model")
ERROR 07-31 15:14:09 [core.py:667]   File "/root/.venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
ERROR 07-31 15:14:09 [core.py:667]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-31 15:14:09 [core.py:667]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-31 15:14:09 [core.py:667]   File "/root/.venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2987, in run_method
ERROR 07-31 15:14:09 [core.py:667]     return func(*args, **kwargs)
ERROR 07-31 15:14:09 [core.py:667]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-31 15:14:09 [core.py:667]   File "/root/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 212, in load_model
ERROR 07-31 15:14:09 [core.py:667]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
ERROR 07-31 15:14:09 [core.py:667]   File "/root/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1954, in load_model
ERROR 07-31 15:14:09 [core.py:667]     self.model = model_loader.load_model(
ERROR 07-31 15:14:09 [core.py:667]                  ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-31 15:14:09 [core.py:667]   File "/root/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
ERROR 07-31 15:14:09 [core.py:667]     self.load_weights(model, model_config)
ERROR 07-31 15:14:09 [core.py:667]   File "/root/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 259, in load_weights
ERROR 07-31 15:14:09 [core.py:667]     loaded_weights = model.load_weights(
ERROR 07-31 15:14:09 [core.py:667]                      ^^^^^^^^^^^^^^^^^^^
ERROR 07-31 15:14:09 [core.py:667]   File "/root/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_moe.py", line 661, in load_weights
ERROR 07-31 15:14:09 [core.py:667]     return loader.load_weights(weights)
ERROR 07-31 15:14:09 [core.py:667]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-31 15:14:09 [core.py:667]   File "/root/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 291, in load_weights
ERROR 07-31 15:14:09 [core.py:667]     autoloaded_weights = set(self._load_module("", self.module, weights))
ERROR 07-31 15:14:09 [core.py:667]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-31 15:14:09 [core.py:667]   File "/root/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 249, in _load_module
ERROR 07-31 15:14:09 [core.py:667]     yield from self._load_module(prefix,
ERROR 07-31 15:14:09 [core.py:667]   File "/root/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
ERROR 07-31 15:14:09 [core.py:667]     loaded_params = module_load_weights(weights)
ERROR 07-31 15:14:09 [core.py:667]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-31 15:14:09 [core.py:667]   File "/root/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_moe.py", line 534, in load_weights
ERROR 07-31 15:14:09 [core.py:667]     param = params_dict[name]
ERROR 07-31 15:14:09 [core.py:667]             ~~~~~~~~~~~^^^^^^
ERROR 07-31 15:14:09 [core.py:667] KeyError: 'layers.14.mlp.gate.qweight'
Process EngineCore_0:
Traceback (most recent call last):
  File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 671, in run_engine_core
    raise e
  File "/root/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 658, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 474, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/root/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 79, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 53, in __init__
    self._init_executor()
  File "/root/.venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
    self.collective_rpc("load_model")
  File "/root/.venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2987, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 212, in load_model
    self.model_runner.load_model(eep_scale_up=eep_scale_up)
  File "/root/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1954, in load_model
    self.model = model_loader.load_model(
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
    self.load_weights(model, model_config)
  File "/root/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 259, in load_weights
    loaded_weights = model.load_weights(
                     ^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_moe.py", line 661, in load_weights
    return loader.load_weights(weights)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 291, in load_weights
    autoloaded_weights = set(self._load_module("", self.module, weights))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 249, in _load_module
    yield from self._load_module(prefix,
  File "/root/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
    loaded_params = module_load_weights(weights)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_moe.py", line 534, in load_weights
    param = params_dict[name]
            ~~~~~~~~~~~^^^^^^
KeyError: 'layers.14.mlp.gate.qweight'
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]

[rank0]:[W731 15:14:10.198554999 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=2253919) Traceback (most recent call last):
(APIServer pid=2253919)   File "/root/.venv/bin/vllm", line 10, in <module>
(APIServer pid=2253919)     sys.exit(main())
(APIServer pid=2253919)              ^^^^^^
(APIServer pid=2253919)   File "/root/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=2253919)     args.dispatch_function(args)
(APIServer pid=2253919)   File "/root/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 52, in cmd
(APIServer pid=2253919)     uvloop.run(run_server(args))
(APIServer pid=2253919)   File "/root/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
(APIServer pid=2253919)     return __asyncio.run(
(APIServer pid=2253919)            ^^^^^^^^^^^^^^
(APIServer pid=2253919)   File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=2253919)     return runner.run(main)
(APIServer pid=2253919)            ^^^^^^^^^^^^^^^^
(APIServer pid=2253919)   File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=2253919)     return self._loop.run_until_complete(task)
(APIServer pid=2253919)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2253919)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2253919)   File "/root/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=2253919)     return await main
(APIServer pid=2253919)            ^^^^^^^^^^
(APIServer pid=2253919)   File "/root/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1817, in run_server
(APIServer pid=2253919)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2253919)   File "/root/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1837, in run_server_worker
(APIServer pid=2253919)     async with build_async_engine_client(
(APIServer pid=2253919)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2253919)   File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=2253919)     return await anext(self.gen)
(APIServer pid=2253919)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2253919)   File "/root/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 166, in build_async_engine_client
(APIServer pid=2253919)     async with build_async_engine_client_from_engine_args(
(APIServer pid=2253919)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2253919)   File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=2253919)     return await anext(self.gen)
(APIServer pid=2253919)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2253919)   File "/root/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 206, in build_async_engine_client_from_engine_args
(APIServer pid=2253919)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=2253919)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2253919)   File "/root/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 164, in from_vllm_config
(APIServer pid=2253919)     return cls(
(APIServer pid=2253919)            ^^^^
(APIServer pid=2253919)   File "/root/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 118, in __init__
(APIServer pid=2253919)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=2253919)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2253919)   File "/root/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 100, in make_async_mp_client
(APIServer pid=2253919)     return AsyncMPClient(*client_args)
(APIServer pid=2253919)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2253919)   File "/root/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 731, in __init__
(APIServer pid=2253919)     super().__init__(
(APIServer pid=2253919)   File "/root/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 420, in __init__
(APIServer pid=2253919)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=2253919)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2253919)   File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=2253919)     next(self.gen)
(APIServer pid=2253919)   File "/root/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 697, in launch_core_engines
(APIServer pid=2253919)     wait_for_engine_startup(
(APIServer pid=2253919)   File "/root/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
(APIServer pid=2253919)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=2253919) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

I would greatly appreciate any insight you might have into the cause of this error and any suggestions on how it might be resolved. It’s possible the issue is on my end, and I’d be grateful for any help.

I'm sorry. I get the same error with vLLM 0.9.2 and latest (directly from git installed via pip).

Seems that although the quantization did run with the "old" AutoAWQ without errors, the MoE archtitecture does not like this form of quantization.

I did the upload before testing, because I was confident that it would run after the quantization process. Will try different tools (llm-compressor, autoround) tomorrow.

Some did already a successful quant with llm-compressor: https://huggingface.co/cpatonn/Qwen3-Coder-30B-A3B-Instruct-AWQ

Will try to reproduce that at the weekend for future versions.

I did a rerun with llm-compressor. This one does work with vLLM v0.9.1. Should be not much difference to the one of cpatonn.

Sign up or log in to comment