some local llms are switching from
8-bit to
4-bit for a boost. i've been digging into this and found some interesting stuff.
on one hand,keeping everything in full precision (aka staying at 8 bits) gives you the best quality but can be heavy on memory ️ ⚡
but then there's
going to half that with just as much power - i'm talking abt using only
4-bit quantization. turns out, it saves a ton of VRAM and speeds things up w/o hurting performance too bad.
i've tried both in my local model setup for voice assistants and found the 8-bit version to be smoother but not by much ⬆️
so what's your take? sticking with full precision or going light on memory usage like i did?
anyone else out there experimenting here, share some tips!
found this here:
https://www.sitepoint.com/quantized-local-llms-4bit-vs-8bit-analysis/?utm_source=rss