Nice! This is the author of Uncensor any LLM with abliteration, and post-tuning lead at LiquidAI.
The repo is a list of “good LLM datasets” for fine-tuning, hinging on three standards:
- Accuracy: factual correctness
- Diversity: cover many use cases
- Complexity: multi-turn, multilinguality, well-written
The call to action here is to contribute if anyone finds it interesting. This indeed has 6 contributors at this moment, with 4k stars.
People recommend adding a column specifying the license for each dataset, and how the author evaluates models post-tuning.