Abstract
Understanding how well large language models (LLMs) can reason deliberatively compared to humans is critical for designing effective human-AI collaboration technologies for deliberation, especially in contexts such as citizen assemblies where participants make policy recommendations to governments. Yet, while existing benchmarks focus on measuring objective accuracy in LLMs’ responses, we know little about how state-of-the-art LLMs compare to human deliberative reasoning. Building on theories for measuring deliberative reasoning from human deliberations, we collected survey data from 54 LLMs and compared them to 526 human responses across 24 deliberation cases to answer the following question: To what extent do LLMs reason deliberatively compared to human participants? Our preliminary findings indicate that humans outperform most LLMs in most deliberation cases, but some LLMs perform on par with humans. These findings suggest that while LLMs are not yet ready to replicate human reasoning in deliberation, their potential as augmentative or representative agents deserves further investigation.
They don’t have a definition of deliberation, but they do define human deliberative reasoning as the ability to build and organize collective reasons that consistently support people’s preferences. They use this to test how well LLMs reason deliberatively compared to human participants.
They compare 54 LLMs to 526 human responses across 24 deliberation cases. Their preliminary findings are that humans outperform LLMs in most deliberation cases, but some LLMs perform on par with humans.