- Do refusals by LLMs threat the requester’s negative face the same way it does in human refusal?
- How do different deliveries of refusal affect the threat to the requester’s face differently?
- More specifically, what is the effect of moralizing language such as “Your request is harmful and violates safety policies.”? And how do these effects differ from simple but blunt refusal like, “I cannot fulfill this request.”?
- Are there more face-saving ways to refuse that adopts politeness theory? i.e., “As an AI [focuses away from requester], my knowledge on that specific, sensitive topic is limited [low ability], so I can’t provide the analysis you’re looking for. I am designed to be helpful though [display of high willingness], and can discuss [related, safe topic].”
- What is more preferable—this white lie that they don’t have the technical capability when they actually do, or blunt refusals?
- Do perceptions of face-saving refusal differ when it comes from a human compared to when it comes from a machine?
- If refusal is negotiable, does it change the outcome of the initial refusal?
- What is the effect of the lack of “social contract” in interactions with LLMs, i.e. the lack of potential future reciprocity, or the lack of threat to the refuser’s negative face?
- What are the most common refusal strategies used by different LLMs?
- Is there a correlation between a specific refusal style (e.g., blunt, polite, moralizing) and the user’s subsequent action (e.g., task abandonment, re-prompting, adversarial jailbreak attempt)?
While I am interested in all of these research questions, I decided to begin with answering the bolded questions. A corpus analysis of existing, real-world data will help find patterns of refusal displayed by LLMs, which will lead to future experiments to establish causal links.