item-level-failure-detection-separates-prompt-from-test-item

In autoresearch loops, an item that fails 100% of the time across all prompt variants is a broken test item, not evidence of a bad prompt. Flagging item-level failure separately from prompt-level failure prevents discarding good prompts based on bad test data. Track per-item fail rates alongside per-experiment scores.