autoresearch-item-level-failure-vs-bad-prompt

A 100% failure rate on a specific test item is a broken test item — not evidence of a bad prompt. Discarding the prompt in response to item-level failures is a false signal. autoresearch v2.5.0 adds explicit item-level failure detection at experiment loop step 9 to flag these broken items before any discard decision. Distinguish item failure from prompt failure before triggering plateau logic.