Sam Altman's Missing GPT-5 Benchmark

How Starting with Why Could Have Saved the Demo

Aug 11, 2025

The GPT-5 launch looked like a middle school morality play where nobody knew their lines. Stiff, soulless, over-rehearsed presentations that started with "what GPT-5 can do" instead of "why anyone should care." Yet between the poorly acted lines, GPT-5 has value—if you start with Why instead of What.

Bar chart comparing AI benchmarks with "Academic" scores on left (MMMU 84%, AIME 97%, Bar exam 72%) versus "Product Work" on right showing three equal bars for Research Synthesis, Stakeholder Simulation, and PRD Quality, with a dotted outline indicating "Missing: PM outcomes benchmark" — Congratulations, your AI can pass the bar exam but can't prioritize a backlog.

I was watching the GPT-5 announcement on a 7” tablet while huffing and puffing on a gym elliptical, so maybe that explains my perspective. Or maybe it's because I spent years as a professional opera singer before diving into AI and product management. Either way, that demo was a masterclass in how not to launch revolutionary technology.

The tensions were everywhere:

They had real use cases (healthcare, education) but buried them at the end instead of leading with human stories
They went full code-forward when they should have gone human-forward
They yanked previous GPT models from the UI without warning … jarring users instead of weaning them off gradually
They showcased technical capabilities while real work goes untested

But here's the thing: GPT-5 doesn't actually suck.

The launch does. The benchmarks do. The "GPT-5 sentiment sucks" hot takes definitely do.

GPT-5 works brilliantly … IF YOU #$^%&%*&%#^# START WITH WHY.

What's missing? A benchmark that tests whether these models can do real PM ‘stuff’ done. I've been experimenting with exactly that—you can try my PM benchmark prompt here.

GitHub screenshot showing Dean Peters' product-manager-prompts repository displaying the vibe.prd-generated-via-search-and-agentic-simulation.md file, which describes an autonomous AI product manager prompt that builds comprehensive PRDs through guided discovery, research, synthesis, and simulation across various AI platforms like ChatGPT, Gemini, Claude Code, and VS Code. — While other benchmarks test fruit-letter counts, this prompt gets PM stuff done.

Current benchmarks test party tricks. Real PM work requires orchestrated thinking.

The Problem: AI Benchmarks Are Product Management Malpractice

Every AI benchmark I've seen tests the wrong stuff:

Can it solve math problems? (Great, but when did you last calculate derivatives in a sprint planning meeting?)
Can it pass the bar exam? (Cool, but can it prioritize a backlog without legal jargon?)
Can it count letters in fruit names? (Fantastic. Now, can it count why customers churn?)

Meanwhile, zero benchmarks test whether AI can:

Research a market and synthesize actionable insights
Simulate stakeholder conflicts before they derail your roadmap
Scaffold a PRD that doesn't read like enterprise software documentation
Generate prototypes you can actually experience, not just read

The Real Test: Can AI Do The Work That Actually Matters?

I've been running my own PM Benchmark Test across GPT-5, Gemini 2.5 Reasoning, Claude Sonnet 4, and Claude Code.

Not counting strawberry letters.
Not solving logic puzzles.
But getting real product management work done.

Here's my test:

Can the model autonomously research → synthesize → simulate → scaffold a comprehensive PRD, then generate a working prototype—all in 1-2 hours?

The answer? Hell yes.

But only if you know how to prompt it properly.

The Missing Methodology: Show Me, Don't Tell Me

Most PMs still think of AI as a fancy autocomplete for user stories.

Wrong.

The breakthrough happens when you leverage advanced reasoning and research capabilities to create what I call an "Elevated PRD Experience":

Search: AI researches the market, competitors, and user problems
Synthesize: AI connects disparate data into coherent insights
Simulate: AI role-plays different stakeholders and user types
Scaffold: AI builds a comprehensive strategy document
Show: AI generates a working prototype you can touch and feel

This isn't Amazon's "Working Backwards" with a press release.
This is "Looking Around the Corner" with a tangible future you can experience.

The Bake-Off Results: Not All Models Are Created Equal

I've thrown the same mega-prompt at ChatGPT, Gemini, Claude, and even VS Code wired up to Qwen.

Think of it like a bake-off, but with fewer soggy bottoms and more existential dread about whether your roadmap makes sense.

What I learned:

GPT-5 excels at research synthesis but sometimes needs a nudge for prototype generation
Claude Code delivers the most coherent end-to-end workflow
Gemini 2.5 Reasoning surprises with stakeholder simulation accuracy
Most models can scaffold HTML5 learning guides that make PRDs actually engaging

The key insight? It's not a "them" problem—it's an "us" skills gap issue.

Why The Launch Failed: Code-Forward When They Should Have Gone Human-Forward

Sam Altman's team made the classic "Start with What" mistake.

Simon Sinek was right: great leaders start with Why.
The GPT-5 team started with What and wondered why nobody cared.

They had killer use cases—healthcare professionals diagnosing faster, teachers personalizing education, researchers breaking through complex problems. But they buried these human stories at the end instead of leading with them.

Better launch strategy:

Start with Why: "Doctors are drowning in diagnostic complexity" (human problem)
Show How: "GPT-5 helps them synthesize patient data in real-time" (solution)
Prove What: "Here's the benchmark showing 40% faster diagnosis" (evidence)

Instead, we got 45 minutes of technical capabilities followed by a rushed "oh, and here's why this matters" at the end.

Plus, the jarring UI changes. Yanking previous GPT models without warning feels like rearranging the furniture while someone's still sitting in the chair. They could have eased the transition instead of creating whiplash.

The Prompt That Actually Works

Want to test this yourself? Here's my mother-of-all kick-start-a-vibe-coded-agent prompt structure:

You are a senior product manager with 10+ years experience. 
I need you to research, synthesize, simulate, and scaffold a comprehensive PRD for [PRODUCT CONCEPT].

Phase 1 - RESEARCH: 
- Market analysis of [SPACE]
- Competitive landscape assessment  
- User problem validation through multiple sources

Phase 2 - SYNTHESIZE:
- Connect disparate research data into coherent insights
- Identify patterns and gaps in user needs
- Distill key findings to inform simulation scenarios

Phase 3 - SIMULATE:
- Role-play 3 different stakeholder types
- Identify potential objections and friction points
- Stress-test the core value proposition

Phase 4 - SCAFFOLD:
- Build comprehensive PRD with clear success metrics
- Include technical feasibility assessment
- Outline go-to-market considerations

Phase 5 - SHOW:
- Generate working HTML5 prototype/learning guide
- Make it interactive and experiential
- Focus on core user journey validation

Use advanced reasoning to log your decision-making process at each phase.

This isn't about replacing PM judgment—it's about augmenting PM speed and depth.

The Bottom Line: Stop Testing Engines With Paper Airplanes

Most GPT-5 criticism sounds like testing a jet engine with a paper airplane, then complaining it doesn't fly well.

The real question isn't whether GPT-5 is "better."
It's whether you know how to use reasoning and research capabilities to get actual work done.

I can scaffold a solid first-draft PRD, validate it with simulated stakeholder feedback, and generate a working prototype in under 2 hours.

That's not party tricks.
That's product velocity.

And if you're still counting letters in fruit names to judge AI capabilities, you're missing the point entirely.

The future of product management isn't about AI replacing PMs.
It's about PMs who know how to leverage AI, leaving behind those who don't.

Want to test your own PM Benchmark? Try my full prompt on GitHub and see what real PM work looks like when you stop playing with toy problems and start solving actual ones.

Share this if you're tired of AI benchmarks that test everything except what matters for getting real work done.