ENTRY_06

I Built a QA Bot. Is It an Agent?

The product I broke (and built a bot to watch) is VectorFlow. Configure your RAG pipeline through conversation, preview it, run it. All on VectorFlow.

I pushed code meant to improve my testing infrastructure. It broke production. I found out when someone messaged me a screenshot.

For an early-stage product, this is close to existential. If people can’t sign up, there’s no pipeline. Visitors don’t become users. Users don’t become customers. The thing you’re building just sits there, broken, while you sleep.

The obvious response was to build something that would catch this. An intelligent QA bot.

What the bot does

It monitors three things:

Stalling PRs. Sometimes a test hangs. Memory issue, infinite loop, something weird. It doesn’t fail, it just runs forever. The bot detects when a PR has been running past a threshold and tries to fix it.

Failing PRs. When a PR fails, it just notifies me. No fix attempt.

Production health checks. This is the important one. If a health check fails, the bot rolls back the deployment to the last stable version, then attempts to diagnose and fix the issue. If it succeeds, it promotes the fix. If it fails, it notifies me. Either way, I get time to investigate while production stays up.

The fix-it-just-enough philosophy is deliberate. The bot isn’t trying to ship perfect code. It’s trying to buy me time to investigate without the system being down.

After building this, I started wondering whether I’d actually built an AI agent.

The term has been stretched thin

“Agent” now covers everything from chatbots to autonomous research systems. A paper out of Duke (shoutout to Bull City, my American hometown) put it bluntly: the term has been “diluted beyond utility.” Different research communities apply it to fundamentally different systems, and the ambiguity makes it hard to reason about what we’re actually building.

But somewhere in that spectrum is a meaningful distinction.

What makes something an agent

The clearest framework I’ve found proposes three minimum requirements:

Environmental impact. The system takes actions that meaningfully alter its environment. Not just generating text, but doing something.

Goal-directed behavior. It operates in service of defined objectives through multi-step planning.

State awareness. It maintains representations of environmental state that influence its decisions.

Beyond these minimums, systems exist on a spectrum across autonomy, learning, temporal coherence, and complexity of goals.

Where my bot lands

By these criteria, my bot is somewhere in the middle.

It has environmental impact. It literally rolls back production and pushes fixes. It’s goal-directed. It follows multi-step plans to diagnose and resolve issues. It has some state awareness. It knows what’s healthy and what’s not.

But it doesn’t set its own goals. It doesn’t learn from past fixes. It doesn’t decide what to care about. The triggers are deterministic. When one of three predefined conditions is met, then it kicks off something that looks like an agent: reasoning about the problem, generating solutions, taking action.

It’s a deterministic script that invokes agent-like behavior when triggered.

Why the distinction matters

This isn’t semantic nitpicking. The framing changes how you build.

If I thought I was building a “full agent,” I might give it more autonomy. Let it decide what constitutes a problem, expand its own scope, take action without guardrails. That would be irresponsible for a production system. I don’t want software that surprises me at 3am.

What I actually built is bounded autonomy. Intelligent within constraints. Smart enough to fix simple problems, humble enough to know when to call for help.

Maybe that’s the right framing for most production AI systems right now: not “is it an agent?” but “how much agency should it have?”

The irony

The code that broke production was supposed to be the foundation for better testing. That infrastructure will pay off eventually. But who watches the tests?

Now I have something that does. I’m still not sure what to call it. But it works.

← Back to Home