Because they know that the problem they solved is also a problem for others in the community
For example, we observed two different implementations of a tool that automatically detects the best retrieval-augmented generation (RAG) pipeline given the user’s database and needs. This is useful to the community because each component in a RAG pipeline—like document chunking strategies, retrievers, and re-rankers—are all subject to change and improvements over time, making it difficult for a beginner to find the exact ensemble that will suit their needs. In one of these posts, the poster explicitly asked the community, “What do you want to make with RAG?”, as the project at the time of posting was ready for feedback, but the developer specifically needed someone to test it out on a real-world scenario. In this scenario, posting their project on r/LocalLLaMA provides a testbed for the efficacy of their tool, while also giving the beta-tester the opportunity to have a RAG pipeline built for their need. Such win-win scenarios create ideal scenarios where meaningful collaboration can happen between developers and users, which is a pattern we observed across multiple posts.
At times, posters are encouraged to make their projects open source. One developer posted their app that allows community members to run small language models (SLM) on their phones locally, providing the community with a tool to test out new SLMs in a practical setting while also opening a channel for feedback. While at the time of posting, the developer only had their app on iOS and Android app stores, a commenter recommended the developer to make the project open source. This led to the developer publicly releasing the code on GitHub, and to this day of writing, the PocketPal AI repository remains in active development, with 15 contributors. As new contributors have added features such as import and export functions and Chinese localization translations, the application had the opportunity to expand beyond the developer’s initial scope, which would not have been possible without the nudge.
Some developers of open stacks develop a reputation for their expertise over time on r/LocalLLaMA. For instance, the developers of Unsloth—a fine-tuning library for open-weight models—began their history on r/LocalLLaMA posting the fine-tuning library itself, which achieved faster performance and less memory usage compared to its competitors. Over time, they posted more informative content, such as reporting on their finding that when fine-tuning certain models, you have to keep some layers in its original precision to avoid significant losses in performance. Furthermore, they provide educational material on top of the repository itself, like their technical blogs and Jupyter notebooks guides to fine-tuning a diverse range of models differing in size and architecture. Dedication to both tool access and knowledge gave them a reputation as the community’s fine-tuning experts, which at times grants a privileged voice in the community, allowing them to host “ask me anything” (AMA) sessions where members can ask questions spanning their journey to expertise, how to start fine-tuning, funding open-source projects, and expert takes on forward-looking AI speculations.
We noted themes of addressing shared and in-demand problems of the community through model releases as well. Many of the models shared by members on r/LocalLLaMA are “derivative models”, which expand on existing open models to implement a new capability, or make them more efficient. One such example are “community quants”. Quantization reduces the accuracy of weights in a model to a smaller floating point, such that it approximates the original model, while reducing the amount of memory needed to run it. Since quantized models are highly beneficial to those in the community who are running models on their own machines, members who are “GPU rich” upload quantizations of new model releases for the community. At times, the community quants are more trusted than those provided by the model developers themselves, because community experts adopt state-of-the-art quantization methods quicker.
Quantizations are not the only type of derivative models. We found posters converting existing models into formats that were more compatible with the most commonly used inference stacks by the community, fine-tunes of existing models that efficiently applies code updates without repeating generation on the entire file, and models fine-tuned on a bespoke datasetof chain-of-thought responses by more performant reasoning models, compressed into a size that local machines can take advantage of.
Validation of models occur with “model reviews”. In the context of quantized models, people post comparisons of different quantization levels of the same model, either through benchmarks or qualitative analysis. When these quantized models became more suitable for use on local computers, there was a lot of interest in the community to use them, but without much data to help them understand exactly how much performance they were losing at each level of quantization. One poster noticed this demand, and provided the community with comparisons of 4-bit, 5-bit, 6-bit, and 8-bit quantizations of the Llama 3 model. Notably, they found that 4-bit quantization outperformed 5, 6, and 8-bit quantizations in their testing, but wondered whether such trends hold true for larger models that they did not get to test, and whether the test could be formalized into an automatic script. Commenters jumped on to expand the poster’s work by making scripts that not only automated the process of benchmarking different models but accounted for the statistical significance of results, which provided even more reliable data that the community can use to determine what model best suits their needs.
Model reviews go beyond benchmarks on r/LocalLLaMA, as many believe that real-world use can look different from mere numbers on a graph. Some build a reputation with their large-scale model reviews, testing their capacities for qualitative tasks like roleplay across different model sizes and types. Some commenters express that they “vastly prefer this level of information”, as increasing cases of train-test contamination decreased their confidence in benchmarks. Others simply report on using a specific model for their specific use-case, like building Android applications. In this scenario, the poster tested whether using a coding-focused local model could help them with building an Android application in a programming language that they were not familiar with, to which they concluded the model indeed helped them perform a novel task. While these tests are not generalizable like benchmarks are, they cater to a specific audience that shares the same interest or need as the poster.
Posts that reported on different prompt formats were also highly useful to the community. For instance, one member compared different prompt templates on the Mixtral 8x7B model, and tested their effects on the outputs. They found that the official prompt format provided by the model provider was the “most censored”, and that “roleplay-oriented presets tend to give better outputs than strictly (bland) assistant-oriented ones.” Some can take the form of a public service announcement, pointing out that some models follow a very specific prompt formatting at training time, which needs to be respected at inference time. Still others experiment with using prompt engineering to make the model output consistent labels, finding that splitting the prompt into instruction, few-shot examples of labels, and hints that explain likely mislabeling reasons resulted in very good performance for them. (wait but these are educating the community)
Because they want feedback for their work
by nature of this topic i imagine seeing more interaction between poster-community
We also observed that members post their open stack projects as learning opportunities. For example, one member posted their first attempt at open source by sharing their project with the community, asking for feedback. In the comments, the poster not only found others who are working on and willing to contribute to their project, but also recommendations to integrate shallow and deep agents instead of letting one LLM handle everything. In this way, posts can attract experts to leave meaningful feedback to those who need them, and help beginner and intermediate developers alike improve their work.