Hey fellow llama enthusiasts! Great to see that not all of lemmy is AI sceptical.
I’m in the process of upgrading my server with a bunch of GPUs. I’m really excited about the new Mistral / Magistral Small 3.2 models and would love to serve them for me and a couple of friends. My research led me to vLLM with which I was able to double inference speed compared to ollama at least for qwen3-32b-awq.
Now sadly, the most common quantization methods (GGUF, EXL, BNB) are either not fully (GGUF) or not at all (EXL) supported in vLLM, or multi-gpu inference thouth tensor parallelism is not supported (BNB). And especially for new models it’s hard to find pre-quantized models in different, more broadly supported formats (AWQ, GPTQ).
Does any of you guys face a similar problem? Do you quantize models yourself? Are there any up-to-date guides you would recommend? Or did I completely overlook another, obvious solution?
It feels like when I’ve researched something yesterday, it’s already outdated again today, since the landscape is so rapidly evolving.
Anyways, thank you for reading and sharing your thoughts or experience if you feel like it.
Thanks for the tip about kobold, didn’t know about that.
And yeah, I can understand that building your own rig might feel overwhelming at first, but there’s tons of information online that I’m sure will help you get there!