MDP:

  • State: a list of agreement scores for all participants, win condition is when aggregate is > threshold
  • Action:
    1. Do nothing
    2. Generate facilitation message
    3. Prompt for additional opinions
  • Reward:
    • If the agent’s action leads the discussion to the “win” condition, it gets a large positive reward
    • If the discussion continues but agreement isn’t reached, it gets a smaller, standard unit reward.
  • They train this with 500 discussions generated with LLM. Finds good results.