This is a bug in llama.cpp that struggles with converting fine-tuned llama3 models to GGUF. One of the devs:
Thanks for bringing this into the devs’ attention. It is sad for me to see this result. As it seems to be very personal I won’t ask you to share the gguf, but, if possible, could you try it on a different inference engline that also can load the gguf (like mistral.rs, which is based on candle instead of the ggml library), to see if the issue is the gguf format/conversion or the llama.cpp inference engine? If this si too complicated, ignore this reply. Thanks again for bringing this issue to our attention.
This seems to be the most cooperative (maybe this is the better word), since people are heading towards the same goal of fixing the bug. It is salient as it has the potential to fix inference for many models. Some suggestions:
- Was temperature 0 when you ran the side by side test?
They did end up finding the bug, which was the llama3 instruct template not being tokenized correctly. And while the entire resolution process is documented on GitHub (discussions), it was also communicated back to the Reddit thread. I don’t think the thread on Reddit is as notable as is the discussion on GH.