OP presents a hypothesis that quantizing the smaller 8B model of the Llama 3 series of models might result in more harm to the model than quantizing larger 70B models. While the nature of this post is a lot like a public service announcement, but at the same time, OP isn’t sure of their anecdotal findings, so they posted for others to confirm whether their experience is replicable.
One comment confirms the OP’s perception, but also admits that without “proper statistics”, their findings are also hard to trust. They also noticed impairment in instruction following for lower quant Llama 8B models, and also writes that they experience especially greater impairment when the context includes “rich and dense information”, while it doesn’t suffer as much with “low density information”.
One of the commenters provide a potential hypothesis for why this might be happening—that Llama 3 was trained with BF16 and not FP16, which has the same range (but not the same precision) as FP32. They further hypothesize that since BF16 is much like FP32 in that going from it to q8 is a more significant reduction than going from FP16, Llama 3 models suffer more from quantizations.
Others share useful papers on the exact topic they are discussing, like How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study, which has benchmark scores along multiple types of Llama 3 quantizations.
Others suggest that while models previous to Llama 3 was not making the best use of FP16, and that because Llama 3 is, they are seeing greater performance losses with quantized versions of it.
While much of these discussions may be speculation, people sharing information with each other to confirm experiences may help them make better decisions as to which quantization to use, and even go beyond that to do rigorous experiments on the model.