Frequent interruptions during reasoning with vllm 0.8.1
#9
by
alwinzhang
- opened
Start command:
VLLM_USE_V1=0 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_MARLIN_USE_ATOMIC_ADD=1 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 9005 --max-model-len 65536 --max-seq-len-to-capture 65536 --enable-chunked-prefill --enable-prefix-caching --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.9 --served-model-name deepseek-v1 --model /mnt/sfs_turbo/models/pytorch/DeepSeek-V3-0324-AWQ/