Sakana AI, together with researchers from Oxford, Stanford, and Allen AI, has introduced CUSP—a benchmark of 4,760 real-world scientific events designed to test whether frontier LLMs can predict which research ideas will pay off. The results are mixed: models show a decent instinct for distinguishing promising directions from dead ends, but they struggle heavily to predict whether a line of research will reach a conclusion and when—and scaling up training data doesn’t close the gap.

The authors recommend using AI as a filter and research assistant—generating hypotheses, weeding out weak ones, and accelerating routine work—while leaving the decision of where to invest time and resources firmly in human hands. The study also serves as a measured counterpoint to the hype around fully autonomous “AI scientists” making breakthrough discoveries without human guidance.

Paper (arXiv) · Project page