Text Generation
Transformers
Safetensors
PyTorch
nvidia
conversational

OOM with vllm==0.10.1 on GPU L40S

#7
by qingfu - opened

I use L40S and vllm==0.10.1 to deploy this model, but OOM occurred. Here are the logs:

INFO 08-20 03:09:59 [default_loader.py:262] Loading weights took 2.66 seconds
INFO 08-20 03:10:00 [model_runner.py:1112] Model loading took 16.5557 GiB and 2.803817 seconds
ERROR 08-20 03:10:01 [engine.py:467] CUDA out of memory. Tried to allocate 33.75 GiB. GPU 0 has a total capacity of 44.40 GiB of which 26.88 GiB is free. Including non-PyTorch memory, this process has 17.52 GiB memory in use. Of the allocated memory 17.03 GiB is allocated by PyTorch, and 6.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
ERROR 08-20 03:10:01 [engine.py:467] Traceback (most recent call last):
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 455, in run_mp_engine
ERROR 08-20 03:10:01 [engine.py:467]     engine = MQLLMEngine.from_vllm_config(
ERROR 08-20 03:10:01 [engine.py:467]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/vllm/utils/__init__.py", line 1557, in inner
ERROR 08-20 03:10:01 [engine.py:467]     return fn(*args, **kwargs)
ERROR 08-20 03:10:01 [engine.py:467]            ^^^^^^^^^^^^^^^^^^^
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 144, in from_vllm_config
ERROR 08-20 03:10:01 [engine.py:467]     return cls(
ERROR 08-20 03:10:01 [engine.py:467]            ^^^^
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 88, in __init__
ERROR 08-20 03:10:01 [engine.py:467]     self.engine = LLMEngine(*args, **kwargs)
ERROR 08-20 03:10:01 [engine.py:467]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 260, in __init__
ERROR 08-20 03:10:01 [engine.py:467]     self._initialize_kv_caches()
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 402, in _initialize_kv_caches
ERROR 08-20 03:10:01 [engine.py:467]     self.model_executor.determine_num_available_blocks())
ERROR 08-20 03:10:01 [engine.py:467]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 105, in determine_num_available_blocks
ERROR 08-20 03:10:01 [engine.py:467]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 08-20 03:10:01 [engine.py:467]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
ERROR 08-20 03:10:01 [engine.py:467]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 08-20 03:10:01 [engine.py:467]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/vllm/utils/__init__.py", line 3007, in run_method
ERROR 08-20 03:10:01 [engine.py:467]     return func(*args, **kwargs)
ERROR 08-20 03:10:01 [engine.py:467]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 08-20 03:10:01 [engine.py:467]     return func(*args, **kwargs)
ERROR 08-20 03:10:01 [engine.py:467]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/vllm/worker/worker.py", line 257, in determine_num_available_blocks
ERROR 08-20 03:10:01 [engine.py:467]     self.model_runner.profile_run()
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 08-20 03:10:01 [engine.py:467]     return func(*args, **kwargs)
ERROR 08-20 03:10:01 [engine.py:467]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1204, in profile_run
ERROR 08-20 03:10:01 [engine.py:467]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1330, in _dummy_run
ERROR 08-20 03:10:01 [engine.py:467]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 08-20 03:10:01 [engine.py:467]     return func(*args, **kwargs)
ERROR 08-20 03:10:01 [engine.py:467]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1701, in execute_model
ERROR 08-20 03:10:01 [engine.py:467]     hidden_or_intermediate_states = model_executable(
ERROR 08-20 03:10:01 [engine.py:467]                                     ^^^^^^^^^^^^^^^^^
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 08-20 03:10:01 [engine.py:467]     return self._call_impl(*args, **kwargs)
ERROR 08-20 03:10:01 [engine.py:467]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 08-20 03:10:01 [engine.py:467]     return forward_call(*args, **kwargs)
ERROR 08-20 03:10:01 [engine.py:467]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/vllm/model_executor/models/nemotron_h.py", line 596, in forward
ERROR 08-20 03:10:01 [engine.py:467]     self.mamba_cache = MambaCacheManager(self.vllm_config,
ERROR 08-20 03:10:01 [engine.py:467]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-20 03:10:01 [engine.py:467]   File "/home/qingfu/py312/lib/python3.12/site-packages/vllm/model_executor/models/mamba_cache.py", line 50, in __init__
ERROR 08-20 03:10:01 [engine.py:467]     temporal_state = torch.empty(size=(num_mamba_layers, max_batch_size) +
ERROR 08-20 03:10:01 [engine.py:467]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-20 03:10:01 [engine.py:467] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 33.75 GiB. GPU 0 has a total capacity of 44.40 GiB of which 26.88 GiB is free. Including non-PyTorch memory, this process has 17.52 GiB memory in use. Of the allocated memory 17.03 GiB is allocated by PyTorch, and 6.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The only GPU I got this working on so far is RTX6000 Pro:

INFO 08-21 13:28:39 [worker.py:295] the current vLLM instance can use total_gpu_memory (94.97GiB) x gpu_memory_utilization (0.92) = 87.37GiB
INFO 08-21 13:28:39 [worker.py:295] model weights take 16.58GiB; non_torch_memory takes 0.12GiB; PyTorch activation peak memory takes 19.27GiB; the rest of the memory reserved for KV Cache is 51.40GiB.                                       INFO 08-21 13:28:40 [executor_base.py:114] # cuda blocks: 210550, # CPU blocks: 16384
INFO 08-21 13:28:40 [executor_base.py:119] Maximum concurrency for 8192 tokens per request: 411.23x
INFO 08-21 13:28:40 [model_runner.py:1383] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 19/19 [00:09<00:00,  1.95it/s]INFO 08-21 13:28:50 [model_runner.py:1535] Graph capturing finished in 10 secs, took 0.27 GiB       

Looks like it needs almost 40GB for just weights and activations so probably 48GB is not enough (I also got OOM on my 2x3090).

To get a vllm that works on blackwell try uv pip install -U vllm==0.10.1.1 --torch-backend=cu128

EDIT: After playing with it for a bit, add --max-num-seqs 64 to vllm command and the model loads fine on 48GB!

Hi @qingfu @mike-ravkine

Thank you for sharing. Yes, we need to specify --max-num-seqs and lower it if it doesn't fit on the GPU).
We'll add the description to the Model Card. Thank you for your support!

Sign up or log in to comment