AI Software Factories & Learning to Let Go

In the summer of 2025, Google's DORA team released a set of numbers that showed AI adoption among engineering teams had spiked nearly 90% year-over-year. Optimistically, you might expect productivity to follow - code shipping faster, teams leaner, bugs vanishing. Instead, bug rates went up 9%, code review time increased 91% and the average pull request size swelled 154%.

The machines were writing more code than ever, while the humans spent most of their time trying to figure out if any of it was any good.

Fast forward to this summer - Boris Cherny - the creator of Claude Code - recently said he doesn't write prompts anymore, he writes loops. Naturally, within 24 hours, "loop engineering" was all over X/Twitter as the hot new way to create software. The idea that you stop prompting AI to get work done and instead build a system that does it for you.

If you squint, it's the same pitch as AI software factories. Fleets of agents picking up work from a backlog, testing it, deploying it, observing it, 24/7, until the job is done. Sounds like the ultimate delegation. But if you've ever seen what happens when you optimise a metric without thinking about what the metric actually measures, you might want to slow down.

Goals and loops

The approach of giving AI agents a high-level goal to hill-climb towards rather than specific tasks is built into the tools now. Claude Code and OpenAI's Codex both ship with /goal and /loop commands. You could type /goal "migrate the authentication module to the new API" and the harness will plan, execute, check its own work, and repeat until it decides the goal is met. /loop does the same thing but on a schedule or a trigger, running continuously in the background.

So if a goal is the thing you want to achieve, a loop is the mechanism that keeps the AI iterating until it gets there. The mental model shifts from "complete this task" to "here's the objective, strive until you arrive." The basic shape: trigger → act → verify → repeat.

A trigger could be a prompt you write once, a cron job, a webhook, or an automation that fires on a schedule. Addy Osmani - director of Google Cloud AI - describes a good loop as needing five things:

A heartbeat - an automation that fires on a cadence and looks for what needs doing
A workspace - isolated environments (e.g. git worktrees) so multiple agents working in parallel don't collide with each other
Skills - general models need domain expertise to do their best work, and giving the agent the right skills at the right time has a massive impact on quality
Connections - access to the systems and data it needs
Sub-agents - specific agents for specific tasks, which is really an exercise in context engineering - but that's a blog for another day

The final ingredient is memory. Models forget everything between runs, so a memory layer has to live somewhere external. In its simplest form, a markdown file in a GitHub repository will do the trick.

And then the most fragile part: the exit condition. How does the loop know when to stop? What precisely does "done" mean?

"Build Windows 12, make no mistakes" probably won't work as a loop.

"Every Monday at 9am: scan all repositories for outdated or vulnerable dependencies. For each one found, use Context7 to retrieve the latest stable version and its migration guide. Apply the migration following the documented best practices. Verify the build passes and no new test failures are introduced. Raise a pull request with a summary of what changed and why. Stop when all repositories are clean or a pull request is open for each outstanding issue." - that probably will.

The "phone/wallet/keys" of loops being: Clear trigger. Specific outcome. Checkable exit condition.

The cobra effect of code

There's a cautionary tale set in colonial India about a government worried about the cobra population in Delhi, so they offer a bounty for every dead cobra. It works for a while, bunch of snakes get killed, then locals start breeding cobras to claim the bounty. When the government catches on and cancels the programme, the breeders release their now-worthless snakes into the wild, resulting in the cobra population exploding.

This is Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.

Let's think about that in the context of the AI software factory. Tell a loop to "reduce the number of open issues" and it could systematically close every ticket as "Won't Fix." The exit condition - zero open issues - is met. The goal - a healthy, triaged codebase - is subjective.

A non-deterministic system running unattended is a system periodically making mistakes.

Goal drift. The loop produces technically correct output that drifted from what you actually wanted. It passed the test it wrote, but it didn't build the thing you meant. Imagine you asked a loop to "improve the onboarding flow and reduce drop-off" - so it removes the steps where drop-off happens.

Comprehension debt. The faster a loop ships code you didn't write, the wider the gap between what's in your repo and what you actually understand. The longer it drifts, the more technical debt and cognitive load accrued.

The nature of the bugs is changing. Increasingly, they're failures of intent. At the companies pushing hardest on AI software development lifecycle, the engineer's job has already shifted from writing code to either attempting to control the firehose, or review and course correct the damage.

So the thing to focus on with loops is verification design. What does the exit condition actually prove? Is the checker checking the right thing? Can you read what got shipped and understand it?

This is the worst it will ever be

Is the line I keep in my head whenever I catch myself looking at a new AI trend with scepticism. It's what stops me dismissing it and makes me try the thing instead.

I will say - the loudest voices on loop engineering all work at frontier labs with zero concern for token spend and certainly no weekly usage limits to worry about. When I started writing this article, I was about to argue that goal-driven agents were still slightly out of reach for the everyday business.

But thinking about it - we've been running loops at Hypership for months. Our internal agent runs on Hermes, and it's been handling a chunk of our business operations since our first week.

It scans our website every week for search engine optimisation, generative engine optimisation, and answer engine optimisation recommendations, then automatically adjusts the content of our websites to see how it impacts our Google Search Console rankings - which it also has access to.

It monitors the latest news and trends in AI and sends us weekly digests.

It has access to all of our GitHub repositories, and we can message it from our phones - or it can message us - any time we need to do something on the move.

It's connected to our pipeline, and will remind us when we have outstanding actions or should pick back up on a conversation.

None of that required a frontier lab budget or a team of engineers. It required being clear about the goal, building the exit condition, and letting the thing run on a schedule.

On the other hand - I've built my own versions of an AI software factory, tried products like Factory, and I currently don't believe the models or harnesses are good enough to stop paying attention to the software they're producing. Eventually they will be. But today I'm still code reviewing everything that ships - after the agents have done their own initial review.

The promise of the software factory - fleets of agents working while we sleep - is one to keep an eye on. The models will get better, the harnesses will get smarter and eventually it will seem like AI is getting better not at coding, but at caring what the code is for. Today, that's on you.