Abstract
The measurement tasks involved in evaluating generative AI (GenAI) systems lack sufficient scientific rigor, leading to what has been described as “a tangle of sloppy tests [and] apples-to-oranges comparisons” (Roose, 2024). In this position paper, we argue that the ML community would benefit from learning from and drawing on the social sciences when developing and using measurement instruments for evaluating GenAI systems. Specifically, our position is that evaluating GenAI systems is a social science measurement challenge. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of GenAI systems. This framework has two important implications: First, it can broaden the expertise involved in evaluating GenAI systems by enabling stakeholders with different perspectives to participate in conceptual debates. Second, it brings rigor to both conceptual and operational debates by offering a set of lenses for interrogating validity.
- Adcock & Collier, measurement theory
- Background concept: broad constellation of meanings and understandings associated with the concept of interest
- Systematized concept: specific formulation of the concept, which commonly involves an explicit definition
- Measurement instruments: what are we going to use to measure
- Measurements: what are we measuring
- Linked (respectively) by four processes:
- Systematization: how an abstract concept is connected to observable phenomena in the real world
- which meanings and understandings are reflected in the systemized concept?
- Operationalization: drawing on systematized concept to develop instruments
- do the measurement instruments yield valid measurements of the systematized concept?
- Application: using instruments to measure
- Interrogation: interrogating the validity of systematized concept, the measurement instruments, and their resulting measurements
- Messick, see page 16 for different ways in which validity can be interrogated
- Systematization: how an abstract concept is connected to observable phenomena in the real world
Critique:
They give a funny example. That standard practice in the background concept of refusal to harmful prompts is to measurement instruments, which is a specific set of harmful prompts and a function for assessing refusal. But coming up with that specific set is operationalization, even though it is imperfect.