This is a benchmark that aims to measure how much censorship a model has, as well as their political leaning.

One comment caught my eye:

Thank you for the leaderboard! I check it every now and then. And I will miss the writing style - thanks to it I was able to find some really nice models I wouldn’t bother with otherwise. Will backup of old data be available?

And the OP replied:

Yep, you can find the old data in the leaderboard’s files.

I might call this behavior model browsing. Especially because there is such a wide variety, part of the fun is discovering new models that are interesting and exciting.

Thank you for your work. In my opinion, this is one of the most useful leaderboards, because if an LLM keeps arbitrarily keeps refusing to answer for unknown reasons, it instantly makes it useless for automatic processing of texts and any other automated workflows. That is a crucial detail that is being ignored in other benchmarks. And of course, censorship and alteration of facts is just bad in general.

One commenter critiques the approach that the OP took in averaging out columns that measure political lean, as it turned out they weighted some of the columns more than others (like multiculturalism and internationalism). The commenter said this is misleading, and the OP admitted that weighting of the categories could have been better.