• iceberg314@midwest.social
    link
    fedilink
    arrow-up
    6
    ·
    14 hours ago

    That I why I like small, specialized, locally hosted AI. Runs acceptably fast and quite on my gaming PC, it’s private, and I can give it knowledge is small doses in specific topics and projects.

    • ctrl_alt_esc@lemmy.ml
      link
      fedilink
      arrow-up
      2
      ·
      3 hours ago

      Which model do you use and what are your specs? I ran a couple using an RTX5060 with 16gb and it’s too slow to be usable for larger models while the smaller ones are mostly useless.

      • iceberg314@midwest.social
        link
        fedilink
        arrow-up
        1
        ·
        3 hours ago

        I also have a 5060 (ti) with 16GB of RAM. I tend to use GPT-OSS:20B or Qwen3:14B with a context of ~30k. I have custom system prompt for my style of reponse I like on open web ui. That takes up about 14GB of my 16GB VRAM

        But yeah it is slower and not as “smart” as the cloud based models, but I think the inconvenience of the speed and having to fact check/test code is worth the privacy and environmental trade offs

        • Hexarei@beehaw.org
          link
          fedilink
          arrow-up
          1
          ·
          2 hours ago

          Ive had good success on similar hardware (5070 + more ram) with GLM-4.7-Flash, using llama.cpp’s --cpu-moe flag - I can get up to 150k context with it at 20ish tok/sec. I’ve found it to be a lot better for agentic use than GPT-OSS as well, it seems to do a much more in depth reasoning effort, so while it spends more tokens it seems worth it for the end result.