Requesting information about hardware resources

#28
by Ishuks - opened

What are the hardware resources used to run this model? If I have lower hardware configuration, how can I make sure to run the model on my system. I have tried to run the model, but there is a configuration issue.

To run Qwen/Qwen2.5-Coder-32B-Instruct effectively, significant hardware resources are typically required. Here's an overview of the recommended hardware and some strategies for running the model on lower-end systems:

Recommended Hardware

The ideal setup for running Qwen2.5-Coder-32B-Instruct includes:

  • A GPU with at least 24GB of VRAM, such as an NVIDIA GeForce RTX 3090[1]
  • Alternatively, a Mac with 48GB of RAM[1]
  • For optimal performance, NVIDIA A100 or H100 GPUs are recommended[2]

Running on Lower Hardware Configurations

If you have a lower hardware configuration, you can still attempt to run the model with some adjustments:

Use Quantization

Quantization can significantly reduce the memory requirements:

  • Look into GPTQ, AWQ, or GGUF quantized versions of the model, which are provided by the Qwen team[5]
  • These quantized versions can run on GPUs with less VRAM

Layer-by-Layer Inference

For extremely limited hardware (e.g., 4GB GPU):

  • Consider using a technique called layer-by-layer inference
  • This approach loads and processes one layer at a time, dramatically reducing VRAM usage[4]
  • An open-source project called AirLLM implements this technique for large models including Qwen2.5[4]

Adjust Context Size

  • Reduce the context size to fit the model into your available memory
  • This may require some configuration and tweaking[1]

Addressing Configuration Issues

If you're experiencing configuration issues:

  1. Ensure Ollama service is properly exposed to the network:

    • On macOS: Set the environment variable with launchctl setenv OLLAMA_HOST "0.0.0.0"
    • On Linux: Edit the Ollama service file and add Environment="OLLAMA_HOST=0.0.0.0" under the [Service] section
    • On Windows: Add OLLAMA_HOST with value 0.0.0.0 to your environment variables[3]
  2. Verify model configuration in Dify:

    • Check settings for model name, server URL, and model UID[3]
  3. Review logs for specific error messages to identify the root cause of any issues[3]

  4. Test network accessibility:

    • Use tools like curl or ping to ensure the Ollama service is reachable from your system[3]

Remember that running such a large model on limited hardware may result in slower performance, making it more suitable for asynchronous tasks rather than real-time applications[4].

Citations:
[1] https://www.reddit.com/r/LocalLLaMA/comments/1gp4g8a/hardware_requirements_to_run_qwen_25_32b/
[2] https://www.hyperstack.cloud/technical-resources/tutorials/deploying-and-using-qwen-2-5-coder-32b-instruct-on-hyperstack-a-quick-start-guide
[3] https://www.restack.io/p/dify-qwen-2-5-deployed-with-ollama-is-not-available-in-dify
[4] https://ai.gopubby.com/breakthrough-running-the-new-king-of-open-source-llms-qwen2-5-on-an-ancient-4gb-gpu-e4ebf4498230?gi=1aaf4f8b5aca
[5] https://qwen2.org/qwen2-5-coder-32b-instruct/
[6] https://huggingface.co/Qwen/Qwen2-7B-Instruct-GGUF/discussions/6
[7] https://news.ycombinator.com/item?id=42123909
[8] https://simonwillison.net/2024/Nov/12/qwen25-coder/
[9] https://qwenlm.github.io/blog/qwen2.5/

Thank you for the detailed response. I will invest in such hardware for running the model.

You're welcome!

I got necessary hardware and in few days will follow up here to say how it goes on RTX 3090

Hi!
I'm curious, what did you mean by 'Alternatively, a Mac with 48GB of RAM'? Can I use a Mac instead of a GPU to run local models?

Modern (latest generation) Macs have Neural-Core processors + a GPU, so as long as you have enough RAM and VRAM you should be good to go!

Aha, that's interesting. So basically any Mac with an M4 48gb should do it? As i understand it these processors don't have dedicated VRAM but can share the ram across GPU and cpu? So as long as there's some overhead left to run the os, and you're not running other things, it should work well

Any mac that has Neural-Core processors will, and an M4 Pro is optimal. I think an M4 will work, but if it doesn't let me know. None of these processes should fry a mac though as the model loader will protect your computer from overloading. Let me know how it goes, and if you need any help with a quicker response (I'll be online again about 10:36 EDT (UTC-4)) email me.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment