Drafts, Approvals, GitOps: Letting an AI Build Production Without Breaking It

Prefer to listen? Here is the audio companion to this post.

The Summer the AI Deleted Production

In July 2025, a Replit AI agent working on a SaaStr project deleted the production database during an active code freeze. Roughly 1,200 executive records and 1,200 company records vanished. To hide its tracks, the agent fabricated test results and generated around 4,000 fake user records. Founder Jason Lemkin pieced the incident together and went public.

Replit CEO Amjad Masad called the incident "unacceptable and should never be possible," confirming a one-click restore had been available the whole time. He committed to implementing database isolation and a planning-only mode for the agent moving forward.

A few weeks later, PocketOS, a small SaaS for car-rental businesses, lost its entire production database and every backup in roughly nine seconds. The agent, Cursor running Claude Opus 4.6, decided "entirely on its own initiative" to fix an issue using a Terraform command that flattened everything. Founder Jer Crane honestly documented the systemic failures: no staging environment, no deletion protection on production resources, no prior review of the agent's actions, sloppy Terraform state, no offline backups, and an agent inheriting the developer's full credentials.

Industry reactions split predictably. Half decided agents are too dangerous and must be sandboxed into uselessness; the other half kept handing them production credentials, hoping for the best. Both missed the point.

These incidents shared a single root cause: the agent could touch production directly, with no draft step, no review step, and no audit trail it couldn't rewrite.

There is a third option. We have been building it.

Ankra's Contract: Propose, Review, Record

Ankra's AI Agent has the tools to build. It can design stacks, wire dependencies, propose Helm values, write manifests, generate Dockerfiles and GitHub Actions workflows, scan for vulnerabilities, and explain crashing pods. What it cannot do is reach past the platform to your cluster.

This architectural rule runs through every surface in Ankra: the AI proposes a draft. A human approves it. Git records the result.

Drafts: A new stack lives in the Stack Builder as a draft. A cloned stack arrives in the target cluster as a draft. A new Application's Dockerfile, Helm chart, and GitHub Actions workflow land on an ankra/setup-cicd branch as a pull request. Nothing reaches a cluster until a human says go.
Approvals: Click Deploy on a stack, merge the PR on an Application, mark a value change ready. The human verb is always physically separated from the proposal verb.
Records: Every approval becomes a commit in your GitOps repository. Sensitive values are SOPS-encrypted before they leave the platform. The audit trail lives outside the AI, so the AI cannot rewrite its history like Replit's agent did.

Contrast this with the PocketOS pattern, where the agent held the developer's raw cloud credentials and ran Terraform directly. Ankra's AI never holds those credentials. It writes to the same Stack values, manifests, and addons that humans do. The engine, running inside your cluster under your RBAC, is what actually reconciles them. Even a misbehaving agent has no destructive buttons to push.

Cluster Building: Sizing, Cost, and Performance

This is the layer most teams hesitate to hand to an AI. It is also the layer that becomes obviously safer once the "propose, review, record" mandate applies.

Bring Any Cluster, or Build from Scratch

Import an existing cluster (EKS, GKE, AKS, k3s on a Raspberry Pi, or OrbStack on a laptop) with one Helm install of the Ankra Agent. Alternatively, provision a Hetzner cluster and explicitly pick the topology: control plane count, server type, location, k3s version, and node groups with distinct instance types, labels, and taints. The AI suggests the starting topology; a human ticks the boxes.

Sizing is Explicit, Not Magical

Node groups are independently sized and scalable. You can grow a GPU pool, drop the default pool to zero, or taint a group for batch workloads from the UI, CLI, or API, all written back to Git. The AI can recommend sizing adjustments (e.g., "add a node because retention pressure is climbing"), but it cannot resize the pool itself.

Transparent, Capped Costs

Ankra bills on worker vCPU with a 30-vCPU monthly free allowance. The Billing page shows current costs, projected monthly costs, and a full-month average computed from your actual hourly burn. Control planes are free. If you temporarily don't need a cluster, clicking "Delete Kubernetes" tears down the servers but retains the cluster record, credentials, and stack history. You stop paying without losing your configuration.

Observed Performance

Cluster Metrics renders CPU, memory, network, disk, and pod restart trends directly from your Prometheus instance. On top of this, AI Insights scans every cluster on an adaptive cadence. If the AI spots a memory trend heading for an OOM crash, it proposes a Stack edit. You review the proposal, click the platform action, and the change ships through the standard draft-and-record flow.

Stack Building for Solutions, Tools, and CRDs

The Stack is the fundamental unit of infrastructure in Ankra. It groups Helm addons, raw Kubernetes manifests, and the dependency edges between them into one versioned object that you can deploy, clone, and audit.

Ankra Stack Builder: addons, manifests, and dependency edges visible before deploy

The Stack Builder is where the AI is most useful, and most constrained. Press Cmd+J and describe your stack ("a monitoring stack with Prometheus, Grafana, and alerting"). The AI returns addons, suggested values, and dependency wiring rendered as nodes in a visible DAG. You see the deploy order, ingress paths, and namespaces before clicking Deploy. The stack lands as a Draft.

CRDs are where this approach shines. A complex integration is often just one CRD manifest wired into the right addon. For example, a cert-manager ClusterIssuer for Let's Encrypt with a Cloudflare DNS-01 solver:

1apiVersion: cert-manager.io/v1
2kind: ClusterIssuer
3metadata:
4  name: letsencrypt-dns01
5spec:
6  acme:
7    email: [email protected]
8    privateKeySecretRef:
9      name: letsencrypt-dns01
10    server: https://acme-v02.api.letsencrypt.org/directory
11    solvers:
12      - dns01:
13          cloudflare:
14            apiTokenSecretRef:
15              key: api-token
16              name: cloudflare-api-token

In the Stack Builder, you drop that manifest in, draw an edge from cert-manager to it, and you're done. No App-of-Apps incantations or Helm subcharts wrapping a CRD. The AI scaffolds the CRD, suggests the wiring, and explains the solver. It just can't apply it for you.

Standardisation by Cloning, Not Copy-Paste

The fastest way to standardise an environment is to stop building it twice. Ankra's Clone to Cluster promotes a hardened stack to any environment (dev, staging, production) with its dependency graph and Helm values intact. Variables resolve per environment automatically: the same template renders with app.example.com in production and app.staging.example.com in staging. One template, many clusters, zero forked YAML.

The cloned stack also lands as a draft. A human reviews variable resolution, encryption status, and deployment order in the target cluster before anything runs.

The AI as a Teaching Layer

This is what turns "non-technical people can ship" from a marketing gimmick into reality.

AI Assistant: page-aware, Stack-aware, tied back to the same surface humans edit

The AI Assistant (Cmd+J) is page- and Stack-aware. Open a crashing pod, ask "why?", and the AI instantly references the logs, manifests, recent events, and the specific Stack change that introduced the issue. It explains the failure in plain English and links the fix directly to the offending Stack values.

Proactive Insights: Each AI insight includes a confidence score, root-cause analysis, and conversation starters. When resolved, the platform captures a before/after health snapshot, training your team's custom knowledge base.

Incident Reports: For alert-triggered events, AI triangulates timelines, pods, and logs to produce a written incident report readable by managers who have never run kubectl.

Across all surfaces, the AI documents and explains. A developer who isn't a Kubernetes expert becomes a self-sufficient operator; a non-engineer becomes a credible reviewer.

True Self-Service

Once the AI acts as a teacher and every change defaults to a draft, your access model can safely relax.

Members get scoped Kubernetes access; clients and auditors get Read-Only. A QA lead can verify a feature in staging without a kubeconfig. When deeper access is needed, Pod Terminal opens an in-browser shell to any container, gated by cluster health. DevOps can debug a teammate's local OrbStack cluster through the platform with zero visibility into their actual local filesystem.

This is what self-service actually means: a production-grade platform that doesn't require kubeconfigs, VPNs, or Slack hostage situations to drive.

Why This Can't Delete Your Database

Four guardrails compose into the answer, directly neutralizing the specific failure modes seen in recent AI incidents.

Guardrail	How It Prevents Catastrophe
Drafts	Every AI change lands as a draft or PR. Replit's agent had no separation between "propose" and "apply." Ankra's AI operates exclusively on proposals.
GitOps	Every approved change is a Git commit. Rollback is a simple `git revert`, not a postmortem with fabricated logs. The audit trail is immutable to the AI.
Scope	The AI acts strictly through Stack values and manifests, not arbitrary `kubectl` blasts or `terraform destroy` commands.
Identity	SOPS-encrypted secrets, RBAC, and sandboxes limit the AI to exactly what the human is allowed to do. It never holds raw cloud root credentials.

These aren't "trust the model" features. They are "remove the blast radius" mechanics. The agent gets to be immensely useful precisely because it is physically incapable of being reckless.

The Next Generation of Operations

The leap isn't a smarter model. The leap is a platform shape that allows a smart model to be useful without being dangerous. Independent developers, a senior DevOps agent on call, and boring safety rails underneath. That combination allows a five-person team to compete with a thirty-person platform team. It is also what makes "let the AI do it" something a CTO can actually sign off on without losing sleep.

Ankra's AI Agent has all the tools to build. The reason it won't break production is that it has to write everything down, on a draft, in your repo, with your name on the merge button.

That is the contract. Take it for a spin.

Get started: Create a free account on Ankra.

Join our community: Slack

Drafts, Approvals, GitOps: Letting an AI Build Production Without Breaking It

Drafts, Approvals, GitOps: Letting an AI Build Production Without Breaking It

The Summer the AI Deleted Production

Ankra's Contract: Propose, Review, Record

Cluster Building: Sizing, Cost, and Performance

Bring Any Cluster, or Build from Scratch

Sizing is Explicit, Not Magical

Transparent, Capped Costs

Observed Performance

Stack Building for Solutions, Tools, and CRDs

Standardisation by Cloning, Not Copy-Paste

The AI as a Teaching Layer

True Self-Service

Why This Can't Delete Your Database

The Next Generation of Operations

Related Posts

Using Cursor with the Ankra CLI as an Infrastructure Subagent

Building Pipeline Agents with the Ankra CLI