One key contribution to mention:
RAG
Writing
Retrieval-augmented generation (RAG) is a technique to help models generate its output based on data relevant to the user prompt. Since RAG systems are augmentations of a model, they can be developed in isolation to the model, often supporting multiple API endpoints. This makes RAG systems more akin to traditional open source software development, with relatively lower barrier to contribution. From our sample of posts, 11 of them were related to RAG.
Due to this composite nature of RAG systems, some members in another post collectively expressed that while there are so many people experimenting with RAG, there isn’t a standardized benchmark that aims to compare their performance objectively. This discussion naturally built into ideation about what how RAG systems could be evaluated, like how the list of Q&A pairs from source documents should be built, how they would compare different solutions to determine whether answers match desired outcomes, and whether the evaluation could be adapted to bespoke RAG applications. However, this discussion also sits at tension with another perspective—that progress can organically emerge from early trials and errors being shared openly, as others learn form it, improve upon it, and the positive loop continues. In one commenter’s view, both should serve as checks and balances to one another: “default approaches” that the evaluations will benchmark should go hand-in-hand with the “creativity this chaos brings.” The first example exemplifies a system-level contribution, where the innovation lies in how RAG gets used within a novel application of a world building tool. The second example exemplifies a database architecture contribution, where the innovation lies in how information is organized for efficient retrieval. The third example exemplifies a data transformation contribution, where the innovation lies in how unstructured data gets transformed into a form that is interpretable for LLMs. As RAG is an ensemble of these parts and more, it enables a diversity of expertise to be meaningful contributions of a larger system. While we do not have explicit evidence that the poster brought these parts together into one coherent RAG system, having the projects shared in a central repository presents everyone opportunities to repurpose them for their needs and innovate further.
Inspired by the apparent gap of evaluation tools for RAG systems as well as the difficulty in choosing what to use for each part that makes up a RAG system, one commenter built a tool called AutoRAG, and later shared it with the community in a post. Their tool automatically evaluates and finds an optimal RAG pipeline based on the user’s data. Notably, the developer directly asked the community what they would like to make with RAG, as the project was at an early stage, and it would benefit from testing on a real-life scenario. This example exemplifies how ideation that happens within a comment section can turn into a project, and that project can then be shared back to the community for feedback, establishing a virtuous cycle.
Models and derivative models: Quantized, fine-tuned, translation, and beyond
now includes tags model/finetune, model/quant, and model/translation
- Introduce three observed types of derivative models
- Fine-tunes
- Quantizations
- Extension conversions
- Also mention models built ground-up
- Unlike RAG, training models and model derivatives is not a cross-model contribution, the effort being non-transferable and model-specific.
- Fine-tunes
- Quantizations
- Extension conversions
We observed two broad types of models. The first are models trained from scratch, where the author decides the architecture of the model, the training data, and the training strategy from the ground up, and second are derivative models, which build on top of existing models to add some new feature. The number of observed posts about derivative models (19) far outweighed those about models trained from scratch (3), as the former typically requires less computational power than the latter. Furthermore, the three models trained from scratch are all lower than 1 billion parameters in size, which indicates that authors tend to train only models that are feasible on modest compute budgets.
Of the 30 derivative models, we identified three archetypes: fine-tuned models, quantized models, and extension conversions. Fine-tuned models are typically trained on a dataset of input-output pairs that aims to either steer a model to act in a specific way or instill new knowledge that was not present in its original training data. Quantized models are created by reducing the precision of floating point numbers that represent each neuron’s weight inside the model, thereby reducing the amount of memory required to load a model onto a computer, but also reducing performance. Lastly, extension conversion is a process of converting one model format into another. This typically takes the form of converting safetensors—the industry standard for open weight model releases—into another format, such as GGUF, which may be more compatible with widely-used inference tools people use to run models. We will explore our findings for each type of derivative model, as well as the three models trained from scratch.
Training derivative models require
When one community member innovated on a novel training methodology and documented it well, it enabled others to extend the work, reproduce it at scale across varied model sizes, and return the outputs to the community, creating a cycle of iterative improvement rather than a single isolated contribution.
The earliest fine-tuned model that we found from our sample of posts is WizardVicunaLM, which demonstrates an attempt to combine the instruction-tuning techniques of WizardLM—a fine-tune of the first LLaMA model from Microsoft, and Vicuna—another fine-tune of LLaMA from LMSYS. The poster detailed their approach towards combining these two methods: the WizardLM technique of extending a single problem into multiple derivates through iterative, model-drive task expansion, and the Vicuna technique of training on multi-turn conversations rather than just single-turn conversations. They also shared three artifacts—the GitHub repository that contains documentation on how the model was trained and an initial set of evaluations, and the Hugging Face repository that contains both the training dataset and the model. Due to the report of promising results, people in the comments ask for applications of the same training principles to larger and smaller models. Notably, while the poster does not have the resources to train all of the models people requested, other members in the community who did contributed their compute to replicate the training process for 7 billion, 13 billion, and 30 billion models, as well as releasing them in different quantizations and model formats. In other words, when one community member innovated on a novel training methodology and documented it well, it enabled others to extend the work, reproduce it at scale across varied model sizes, and return the outputs to the community, creating a cycle of iterative improvement rather than a single isolated contribution.
Similarly…OpusV1 models for story-writing and role-playing…commenters suggesting improvements…novel training technique…SmallThinker-3B-Preview…
Prompting
Model review
now includes tag review/leaderboard
Inference tooling
now includes tags inference/extension, and inference/mobile