autoresearch-validation-set-prevents-score-noise

Without a fixed validation subset, score changes across autoresearch cycles can reflect item sampling variation rather than prompt improvement. Designate 3-5 fixed items that appear in every cycle for apples-to-apples comparison. For the remaining items, use coverage-first rotation — prefer untested items, only repeat after full coverage.