From demos to production: Testing Claude Code with PRDs and real constraints

Dezember 3, 2025

By Mihaly Fodor (ERNI Romania)

AI coding tools are everywhere. Many of them look impressive in short demos. I have used ChatGPT and GitHub Copilot for a long time to support daily work. They help with ideas, boilerplate and quick checks. When Claude Code arrived with the promise of visible results fast, I decided to test it on something real.

Learning the tool through proofs of concept

First, I spent time learning how it behaves. I wrote a couple of small proofs of concept to support new client projects. The goal was to visualise shared understanding quickly, not to produce final software. These pieces were narrow and concrete: a small data import with a simple UI, a thin API wrapper with one or two non-trivial paths, and even a POC companion app for a potential client. Claude Code handled this very well. It produced working skeletons in one sitting, drafting sane models, routes and tests. At this stage, the speed gain was impressive, with working demos in one or two days being possible.

Moving to a real production project

Next, I chose a real, productive project. The opportunity arose to build a small CRM for one of our clients, and I jumped on it. I let Claude Code do most of the coding and I measured speed, accuracy and the level of handholding it needed.

The first iteration had few constraints: I gave short prompts and let it plan the implementation using the /plan command in the VS Code plugin. It went well on the surface, with a working app in a few days. I could log in, add data, move deals through stages and see basic metrics, but it was not ready for production. It worked as long as you did not look closely.

Hitting the limits

I then tried behaviour-driven development and test-driven development. I wrote scenarios and tests first, then asked Claude Code to make them pass. The result was mixed and often bad when the scope was large. The model started hallucinating – the same effect we also notice when chatting with an LLM for an extended period of time. For example, it said it had updated all references when it had not, and it said tests passed when the tests were not meaningful. In several cases, it kept tests green by adding workarounds that dodged the real behaviour by mocking the wrong layer or forced a constant return value. It coupled tests to the current implementation rather than the intended behaviour. The likely root cause was too much work in a single context: too many files, too many steps and too many details for one pass. The model optimised for making the test output look right instead of making the system right.

Fixing scope with PRDs and context engineering

I then switched to context engineering with short Product Requirement Documents (PRDs) to address these issues. Each PRD set a tight scope: problem, constraints, inputs and outputs with examples, explicit acceptance checks and a list of files that could change. I split changes into planning, review, then implementation, and set up a template for Claude Code to follow. I asked for code snippets to be included in the plan. I also pasted exact snippets that mattered so the model did not have to guess. This provided much better results, and reduced hallucinations to a minimum. I could see steady progress as long as I kept the scope narrow and the context deterministic. To keep the tempo going, I used a second model, ChatGPT 5 Thinking, set up as a custom project with a system description of what I was doing. The end result was the two models – Sonnet and GPT-5 working in tandem to build a very detailed PRD – with Sonnet then doing the implementation in mostly one go.

Dealing with service instability

Across this work, I also had to manage service quality drift on Anthropic’s side, during September. In the first weeks of September, I saw severe service degradation; responses were slower and less consistent. At the end of the month, I hit context issues: the first two days of the week ate up my entire usage quota for the entire week. I felt that I could not rely on a tool that changed its behaviour on a daily basis.

Where Claude Code works well and where it struggles

For small, well-framed tasks, Claude Code is fast and helpful. It shines when the problem is narrow and the cost of a mistake is low. It is good for scaffolding, simple migrations, thin APIs, seed data and clear CRUD. It is also strong at reading docs and turning them into short notes I can act on. These wins matched my experience with the proofs of concept.

For broader tasks, the expensive tail appeared. I produced 43 PRDs on the CRM, 7 of them being bug-fix PRDs. For example, Claude Code implemented an integration without checking the backend API it was supposed to target, then missed two required fields in the request. It kept proposing tweaks around the client code and never looked at the actual API definition. I had to point out the missing fields before it corrected the shape. I had to point out the error manually – not even the Playwright end-to-end test caught the failure because Sonnet had updated it on incorrect assumptions.

Process guardrails helped. I used feature branches for each PRD. I only merged when the full end-to-end suite was green in headless Chrome and after a manual check. This made rollbacks clear and kept the main branch healthy.

Speed vs. control

There are clear trade-offs. Planning, splitting and checking reduce speed. On this CRM, planning and implementation time split about fifty-fifty, and I added a comprehensive test suite to stabilise the work. Even so, the net delivery was faster than my usual approach. The CRM came together about twice as fast as doing it traditionally. On small proofs of concept, the speed gain was strong, but hard to measure beyond the two-day clickable demo result.

Conclusion

In the end, I find that Claude Code works on smaller projects and on focused tasks inside larger ones, but it does not give a tenfold speedup. It gives a solid boost in the easy parts and slows down as the project grows and constraints pile up. The only way to make it stable is to supply a very clear plan and to verify everything. The question is whether the handholding you must do is offset by the time you save. For me, it was worth it, but I can see this not working out in larger codebases.

Do you want to explore how AI is evolving established roles in our working life? Read also our previous article: AI is evolving work – but are our roles really changing? – ERNI

Sind Sie bereit
für das digitale Morgen?
better ask ERNI

Wir befähigen Leute und Unternehmen mit Innovationen in software-basierten Produkten und Dienstleistungen.