This is a post comparing 39 models, ranging from 7B-70B, on a qualitative analysis of long-form role-play conversations that involves “complex instructions and scenes, designed to test ethical and intellectual limits”.
Some of the findings take the form of “little emoting and action descriptions lacked detail,” “switched from character to third-person storyteller and finished the session”, and “repetitive (patients differ, words differ, but structure and contents are always the same).”
These posts are appreciated by the community because it has a deeper level of information to the task of role-play that other benchmarks cannot easily get. From the OP, “And I’m glad when my reviews help others find their favorite models.”