What Makes a Good RLHF Task? Lessons from Training Data Research

Since November 2024, I’ve been at Abundant AI designing training data for reinforcement learning from human feedback. Our datasets power some of the top AI labs in the world.

That context matters less than what I’ve learned doing the work. Designing tasks that actually expose weaknesses in state-of-the-art models, the ones that ace standard benchmarks, is a different kind of problem than I expected going in.

The saturation problem

Here’s the core tension: GPT-4, Claude, Gemini, and their peers all score above 90% on the benchmarks people typically use to evaluate them. MMLU, GSM8K, HumanEval. These are effectively solved.

That’s a data quality crisis for RL training. If a model gets everything right, there’s nothing to learn from. The reward signal is flat. You need tasks that sit right at the boundary of what the model can do, hard enough that it fails often enough to learn, structured enough that the failures carry useful signal.

The sweet spot, from what I’ve seen, is a 30-70% success rate on frontier models. Below 30% and noise overwhelms signal. Above 70% and there isn’t enough failure to train on.

Task difficulty sweet spot for RLHF training showing bell curve peaking at 30-70% success rate

What actually makes a task hard

I’ve iterated on hundreds of tasks, and the patterns that consistently break strong models are more specific than “make it harder.”

Stacking constraints is the most reliable approach. Simple tasks have one or two requirements. The tasks that expose real weaknesses need the model to juggle five or more interacting constraints at once. Design a database schema that normalizes to 3NF, supports specific query patterns, handles temporal versioning, maintains referential integrity across soft deletes, and optimizes for a read-heavy workload with particular index constraints. Models typically nail three or four and violate the rest. That partial failure is exactly the kind of signal that produces useful gradient.

Adversarial edge cases in familiar territory is another pattern that works well. Models have seen thousands of sorting algorithm implementations. They haven’t seen the numerical stability issues that emerge near the floating-point boundary. The learning happens at the outliers: concurrency bugs that only manifest under specific timing, privacy leaks in anonymization that looks safe on the surface, statistical fallacies embedded in realistic data analysis.

Combining domain expertise with reasoning depth trips up frontier models in a way that neither factor alone does. A straightforward finance question won’t. A straightforward logic puzzle won’t. But a complex derivative pricing problem under non-standard market conditions that requires multi-hop reasoning? That’s a different story. Same for debugging distributed training failures with subtle parameter interactions, or interpreting contracts across jurisdictions with conflicting clauses.

Precision requirements force the model out of its comfort zone of approximate correctness. Formal verification proofs where a single logical gap invalidates everything. Cryptographic protocol design where a small mistake is catastrophic. Numerical methods with strict error bounds. These tasks demand careful reasoning, not pattern matching.

What makes a task useful vs. just difficult

Difficulty alone isn’t enough. After enough iterations, I noticed a pattern in the tasks that consistently produced useful training signal.

There’s always a realistic context with enough domain grounding that the model can’t game the format. There are explicit constraints, plus implicit ones that follow from domain knowledge. There’s hidden complexity, meaning non-obvious interactions between stated requirements. And there’s objective verification: a clear pass/fail criterion, ideally automatable.

That last part is critical. If you can’t grade a task objectively, the reward signal is noisy. Tasks where human evaluators disagree on what “correct” means add confusion to training, not learning.

One more thing: the difficulty should come from genuine reasoning requirements, not ambiguous specs. If a model fails because the instructions were unclear, that’s a bug in the task design, not an exposed weakness in the model.

This is an AI safety problem

It might sound like narrow training infrastructure work. It isn’t.

RLHF with high-quality adversarial tasks is stress-testing model reasoning at scale. Models trained on tasks requiring careful constraint satisfaction learn that pattern-matching isn’t sufficient. They develop better calibration, because it gets harder to be confidently wrong when your training data punishes overconfidence. They improve at following complex instructions, which is directly relevant to safety-critical deployment.

The alternative, training on easy tasks, produces models that look strong on benchmarks but crumble under distribution shift. Overconfident, brittle, and bad at admitting uncertainty. Exactly the properties you don’t want in production.

Problems I haven’t solved

Scalability is the biggest open challenge. High-quality hard tasks require genuine domain expertise to create. You can’t crowdsource them to people who don’t deeply understand the domain, because the hidden complexity that makes a task valuable comes from expert intuition about where models actually fail.

Verification complexity is a close second. For code, you run tests. For formal proofs, you check validity. But for tasks involving judgment, design trade-offs, or open-ended analysis, grading is expensive and inconsistent.

Curriculum design is still more art than science. How do you sequence tasks from challenging to extremely hard to maximize learning? Too hard and the model doesn’t learn. Too easy and it plateaus. The optimal curriculum probably depends on the model’s current capability, but measuring that during RL training is its own research problem.

The bigger picture

The models we’ll use next year are being shaped by the training data we design today. As LLMs saturate existing benchmarks, the bottleneck is shifting from raw model capability to the quality of the signal we train them on.

Getting RLHF data right isn’t just about making models smarter. It’s about making them robust, calibrated, and trustworthy in the situations that matter most: edge cases, multi-constraint problems, the scenarios where approximate reasoning isn’t good enough.

That’s the job. It’s harder than I expected, and more consequential than it looks from outside.


Thanks to the Abundant AI team and the broader research community working on RLHF, Constitutional AI, and model robustness.