Testing Opus 4.5, GPT 5.1 Codex Max and Gemini 3 Pro on a Complex ProseMirror Problem

Over the last few weeks we kept hearing people talk about how strong Gemini 3 is for coding. At the same time, Claude Sonnet 4.5 and GPT 5.1 Codex arrived. We finally had time to run a real test: put these models in front of a problem that is not simple at all-a complex API exercise in a complex library where shallow reasoning is obvious within seconds.

If you have ever built anything meaningful with ProseMirror, you know the pain. It looks clean on the surface, but the deeper you go into schema logic, node relationships, composition and transforms, the more difficult it becomes. This becomes even harder when the editor coordinates with strict rules in the background.

For this benchmark, we focused on two core problems.

Problem One

We needed a reliable way to identify the insertion point in a ProseMirror document that behaves like a branching tree. The document can contain multiple branches that merge later or break out into deep leaf nodes. The local structure around the insertion point matters a lot. One incorrect assumption leads to broken branches or invalid transitions. The model had to read the node tree, track siblings, understand context, and reason about the intended state rather than just raw text.

Problem Two

We needed an algorithm that could insert new nodes and edges at that exact point. This is essentially merging two trees together inside a live document. The editor had to stay consistent with rules enforced elsewhere. Every new node has consequences for the rest of the document. Everything from branch repair to edge cleanup must run smoothly.

The surface description sounds simple. The real depth is in handling tangled branches, preserving referential integrity, fixing node relationships, and ensuring the final workflow still matches strict rules. The code also interacts with services, types, stores and UI components, so the model must understand how local changes ripple across the system.

The Models We Tested

We compared three systems.

Gemini 3 Pro Preview, Opus 4.5, GPT 5.1 Codex Max with Extra High Reasoning.

After working through both problems, this was our ranking.

#1 Opus 4.5

#2 GPT 5.1 Codex Max

#3 Gemini 3 Pro

What Stood Out in Practice

For both problems, we told each model to ask for clarification if anything felt unclear. Only Opus 4.5 did that. It slowed down, checked assumptions, and asked questions that actually improved the final plan. This was the first time an AI model stepped back and asked questions that showed it understood the deeper structure.

GPT 5.1 Codex Max was close to the correct solution, but it moved too fast. It jumped straight into code before confirming the rules and produced something that looked correct but lacked full depth. It still wrote precise code and kept things tight.

Gemini 3 Pro felt more like a junior engineer. It produced output but missed the logic under the surface. It did not fail completely, but it did not demonstrate the level of reasoning this task needed. We have heard that Gemini is very strong in creative work, design and visual tasks, so we want to test it for pointer move animations and UI concepts.

Opus produced the most production ready code. It wrote full implementations with edge case handling, try blocks and guard clauses. Some of it was more than we needed, but it worked. GPT felt cleaner and more concise. Gemini still felt shallow for this type of work.

For daily use, GPT 5.1 is still our favourite. It has the best mix of speed, clarity and reliability for most tasks. If we need deeper planning, we now go to Opus first. It creates a stronger plan. Then we return to GPT 5.1 for the actual implementation because it usually writes code that fits our style.

We also asked Claude and Codex to evaluate the solutions. Both picked the plan from Opus 4.5. We think GPT 5.1 Pro would have done well too, but the cost and duration are high for what we needed here. Gemini needs more testing from our side, especially on visual and creative challenges.

Our Rule of Thumb Right Now

Claude for deep programming, planning and architecture
Codex for daily coding and code review
Gemini for creative work, visuals, animations and UI concepts

If Claude is the backend of AI tools, Gemini is the frontend. Codex sits cleanly in the middle.

Frequently Asked Questions

Which AI is better, ChatGPT or Claude?

For coding, GPT 5.1 Codex Max is better for speed and daily use - it writes fast, clean code that fits most workflows. Claude Opus 4.5 is better for complex problems that require deep reasoning, edge case handling, and architectural planning. In our test, Opus asked clarifying questions that improved the final solution, while GPT moved straight to implementation.

Is Claude better than ChatGPT for programming?

Claude Opus 4.5 produces more production-ready code with comprehensive error handling, try blocks, and guard clauses. However, it can be more verbose than needed. GPT 5.1 Codex Max produces cleaner, more concise code that often matches coding style better. Our workflow: use Claude for planning and architecture, then GPT for implementation.

Which AI tool should I use for my team?

It depends on your use case:

For teams new to AI: Start with GPT 5.1 Codex Max (easiest to learn, most versatile)
For complex systems: Use Claude Opus 4.5 for architecture decisions
For frontend/design work: Try Gemini 3 Pro for creative tasks

Most teams benefit from using multiple AI tools for different purposes. Learn how to set up a multi-agent workflow in our workshops.

What's the difference between Opus and Sonnet?

Opus 4.5 is Claude's most capable model for deep reasoning and complex tasks. Sonnet 4.5 is faster and more cost-effective for everyday use. For the complex ProseMirror problem we tested, Opus significantly outperformed Sonnet by asking clarifying questions and planning more thoroughly.

Should I learn ChatGPT, Claude, or Gemini first?

Start with ChatGPT (GPT 5.1 Codex Max) - it has the best balance of speed, capability, and ease of use for most tasks. Once comfortable, add Claude for complex reasoning and Gemini for creative work. Learning to use multiple AI tools together amplifies your effectiveness.

Want to master AI tools for your specific work? Join our hands-on workshops where we teach you to use ChatGPT, Claude, and other AI tools effectively in small groups.

ChatGPT vs Claude vs Gemini: Which AI Coding Tool is Best for Developers?