Abstract
Current model testing work has mostly focused on creating test cases. Identifying what to test is a step that is largely ignored and poorly supported. We propose Weaver, an interactive tool that supports requirements elicitation for guiding model testing. Weaver uses large language models to generate knowledge bases and recommends concepts from them interactively, allowing testers to elicit requirements for further testing. Weaver provides rich external knowledge to testers and encourages testers to systematically explore diverse concepts beyond their own biases. In a user study, we show that both NLP experts and non-experts identified more, as well as more diverse concepts worth testing when using Weaver. Collectively, they found more than 200 failing test cases for stance detection with zero-shot ChatGPT. Our case studies further show that Weaver can help practitioners test models in real-world settings, where developers define more nuanced application scenarios (e.g., code understanding and transcript summarization) using LLMs.
1. LLM-Generated Knowledge Base
WEAVER uses Large Language Models (LLMs) to generate a knowledge base of concepts related to the testing task.
- Seed Concept: The process begins with a user-provided “seed concept,” which is a high-level term representing the task (e.g., “online toxicity”).
- Structured Querying: The tool iteratively prompts the LLM to list entities or concepts related to the seed.
- ConceptNet Relations: To ensure the concepts are semantically meaningful, WEAVER uses 25 specific relations from ConceptNet (such as
MotivatedBy,LocatedAt, orTypeOf) to structure the prompts. For example, it might ask the LLM, “List some types of online toxicity”.
2. Diverse and Relevant Recommendations
To avoid overwhelming the user with too much information, WEAVER uses a graph-based recommendation system to present a manageable subset of concepts.
- Balancing Relevance and Diversity: The system aims to recommend concepts that are both relevant to the user’s query and diverse enough to offer new perspectives.
- Scoring:
- Relevance is measured using the perplexity of sentences connecting the concept to the query, calculated via GPT-2.
- Diversity is measured by calculating the cosine distance between concept embeddings using SentenceBERT.
- Selection Algorithm: It treats the selection process as a graph problem, attempting to find a subgraph that maximizes a weighted sum of diversity (edge weights) and relevance (node weights). It uses a greedy peeling algorithm to approximate this efficient subgraph in linear time.
3. Interactive User Interface
WEAVER provides a visual interface for users to navigate the generated knowledge base and move toward creating actual tests.
- Tree Structure: The interface visualizes the knowledge base as a tree, starting with the seed concept.
- Exploration: Users can expand nodes to see child concepts, select specific concepts to test, or manually add their own concepts.
- Test Case Integration: Once a requirement (concept) is identified, WEAVER integrates with tools like AdaTest (which uses LLMs to suggest test cases) to help the user generate specific input-output pairs for testing the model.