Kamran Samudrala

As a starter project, I explored the GEPA + DSPy AIME tutorial with the goal of understanding what kinds of behaviors are realistically reachable through prompt-level optimization alone. I focused on how GEPA responds when the objective function itself is modified. Concretely, I asked: what happens when we optimize not just for correctness, but for properties like conciseness or uncertainty-aware behavior?

This was especially interesting to me because GEPA optimizes through language rather than gradients, which contrasts with the supervised fine-tuning methods we studied in Berkeley's CS189. This post summarizes a few small, targeted explorations and what they suggest about the limits of prompt-only optimization on AIME.

Reasoning length as an objective signal

One initial direction was to penalize reasoning length, measured purely as output verbosity (i.e., the length of the visible explanation, not internal computation). My hypothesis was that GEPA might learn to trade off explanation length against correctness.

In practice, this behaved mostly like a style regularizer. GEPA learned to produce longer or shorter explanations, but the problem-solving behavior did not qualitatively change.

On AIME, reasoning length is strongly correlated with correctness since harder problems tend to require longer explanations, and correct solutions are often more verbose. Because of this, optimizing for reasoning length mainly amounts to choosing where to sit on an existing correlation rather than inducing a new behavior. This was useful as a sanity check but it didn't surface anything that stood out.

Why abstention is subtle on AIME

A more interesting direction was abstention. I experimented with objectives that rewarded correct answers, penalized wrong answers, and did not penalize abstention (e.g. +1 correct, +0 abstain, −1 wrong). Intuitively, this seems like a reasonable way to encourage a model to recognize and act on its uncertainty.

In practice, on AIME specifically, abstention turned out to be a surprisingly unstable signal. AIME problems are difficult but well-posed: there is always a single correct numerical answer. Uncertainty comes from problem difficulty rather than ambiguity or underspecification. As a result, there isn't a clean distinction between "this question is ambiguous" and "this question is hard but solvable."

Under abstention-friendly objectives, GEPA often converged to locally reasonable but ultimately degenerate strategies, such as:

answering only on a small subset of easy cases and abstaining on almost everything else,
becoming extremely conservative overall,
or continuing to guess despite penalties for wrong answers.

Across the runs I explored, I did not observe a stable middle ground where the model selectively abstained on exactly the cases it "should." Instead, small changes in weighting tended to flip behavior between near-total abstention and near-total guessing.

Setup	Baseline Accuracy	Optimized Accuracy
Tutorial (0/1, always answer)	46.7%	56.7%
+1/-1 (always answer)	43.3% (26/60)	~53% (expected)
Abstain enabled (learned)	43.3% (26/60)	~45% (never learned to abstain)

Expected values based on the DSPy AIME tutorial and short exploratory runs (not fully converged re-runs in all settings).

What GEPA seems to learn: style more than uncertainty

Across these experiments, I found that the prompts GEPA discovers become:

more structured,
more explicit about step-by-step reasoning,
more verbose or more concise depending on the objective,
more confident in tone.

What seemed much harder to induce in this setup was a reliable notion of epistemic uncertainty. Not just sounding cautious, but actually knowing when not to answer.

On AIME, confidence, verbosity, and correctness already seem correlated. That correlation gives GEPA an easy optimization path to improve scores by adjusting how the model reasons and presents answers, without ever discovering a distinct "don't answer" mode.

Penalizing wrong answers versus rewarding abstention

I also tested simpler variants that penalized wrong answers without rewarding abstention. These objectives behaved similarly and GEPA optimized reasoning style and presentation, rather than learning a different failure behavior.

Scope and next steps

All observations here are specific to this setting: AIME, single-shot evaluation, and prompt-only optimization. A tentative takeaway is that on difficult benchmarks like AIME, abstention does not seem to naturally emerge as a stable behavior without additional structure.

That said, this exploration reinforced how powerful prompt optimization can be for shaping model behavior. Treating prompts as a search space optimized through language rather than gradients is a very different paradigm from supervised fine-tuning.

Next, I want to introduce more concrete uncertainty signals (like agreement across multiple runs) and see how GEPA behaves when abstention is tied to something richer than a single forward pass. I also plan to test on datasets with real ambiguity, where abstention may be a cleaner signal than it is on AIME.

Acknowledgements. Thanks to Jason Lee for suggesting this exploration and for helpful early discussion.