Hey fellow llama enthusiasts! Great to see that not all of lemmy is AI sceptical.

I’m in the process of upgrading my server with a bunch of GPUs. I’m really excited about the new Mistral / Magistral Small 3.2 models and would love to serve them for me and a couple of friends. My research led me to vLLM with which I was able to double inference speed compared to ollama at least for qwen3-32b-awq.

Now sadly, the most common quantization methods (GGUF, EXL, BNB) are either not fully (GGUF) or not at all (EXL) supported in vLLM, or multi-gpu inference thouth tensor parallelism is not supported (BNB). And especially for new models it’s hard to find pre-quantized models in different, more broadly supported formats (AWQ, GPTQ).

Does any of you guys face a similar problem? Do you quantize models yourself? Are there any up-to-date guides you would recommend? Or did I completely overlook another, obvious solution?

It feels like when I’ve researched something yesterday, it’s already outdated again today, since the landscape is so rapidly evolving.

Anyways, thank you for reading and sharing your thoughts or experience if you feel like it.

  • SmokeyDope@lemmy.worldM
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    22 hours ago

    Thank you for deciding to engage with our community here! You’re in good company.

    Kobold just released a bunch of tools for quant making you may want to check out.

    Kcpp_tools

    I have not made my own quants. I usually just find whatever imatrix gguf bartowlski or the other top makers on HF release.

    I too am in the process of upgrading my homelab and opening up my model engine as a semi public service. The biggest performance gains ive found are using CUDA and loading everything in vram. So far just been working with my old nvidia 1070ti 8gb card.

    Havent tried vllm engine just kobold. I hear good things about vllm it will be something to look into sometime. I’m happy and comfortable with my model engine system as it got everything setup just the way I want is but I’m always open to performance optimization.

    If you havent already try running vllm with its CPU nicencess set to highest priority. If vllm can use flash attention try that too.

    I’m just enough of a computer nerd to get the gist of technical things and set everything up software/networking side. Bought a domain name, set up a web server and hardened it. Kobolds webui didnt come with https SSL/TLS cert handling so I needed to get a reverse proxy working to get the connection properly encrypted.

    I am really passionate about this even though so much of the technical nitty gritty under the hood behind models goes over my head. I was inspired enough to buy a p100 Tesla 16gb and try shoving it into an old gaming desktop which is my current homelab project. I dont have a lot of money so this was months of saving for the used server class GPU and the PSU to run it + the 1070ti 8gb I have later.

    The PC/server building hardware side scares me but I’m working on it. I’m not used to swapping parts out at all. when I tried to build my own PC a decade ago it didnt last long before something blew so there’s a bit of residual trauma there. I’m worried about things not fit right in the case, or destroying something or the the card not working and it all.

    Those are unhealthy worries when I’m trying to apply myself to this cutting edge stuff. I’m really trying to work past that anxiety and just try my best to install the stupid GPU. I figure if I fail I fail thats life it will be a learning experience either way.

    I want to document the upgrade process journey on my new self hosted site. I also want to open my kobold service to public use by fellow hobbyist. I’m not quite confident in sharing my domain on the public web though just yet I’m still cooking.

    • robber@lemmy.mlOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      17 hours ago

      Thanks for the tip about kobold, didn’t know about that.

      And yeah, I can understand that building your own rig might feel overwhelming at first, but there’s tons of information online that I’m sure will help you get there!