Effort Engine

from blog Tao of Mac, | ↗ original
I’ve been pointing out that LLMs are barely optimized for ages now, so here’s another example of possible inference speedups that seems very promising (it works somewhat like on-the-fly distillation). If this technique checks out and ends up implemented in mainstream tooling like ollama, it’s going to significantly lower compute and memory...