Getting models to consistently output a label through prompting only. OP says this is after “weeks of trial and error,” highlighting this is both pe and spc. The tactic is splitting the prompt into 3 sections: instruction, hints (explaining likely mislabeling reasons), and few-shot examples. They find that adding too many few-shot examples will make the model overfit to them (in their use-case). The important point one should test adding and removing the hints/examples to see the performance changes on the labeling task (i.e., did increasing the number of few-shot examples improve the accuracy?)