Building Pipeline Agents with the Ankra CLI
Most posts about AI agents stop at the architecture diagram. This one walks through three working agents, built from plain CI jobs and the Ankra CLI, that you can copy into a repo today.
What they do:
- PR review agent. When a pull request touches your cluster definition, it validates the change against the live platform and comments on the PR with what will actually change on the cluster and what looks risky.
- Deploy watcher. On merge it applies the change, waits for the rollout to finish, verifies health, and posts the result to Slack. When the deploy fails, it posts a root cause instead of a red X.
- Scheduled infra watcher. Runs every 30 minutes, independent of CI. It stays silent while things are healthy and posts to Slack only when something degrades.
There is no framework here, no orchestration layer, and no vector database. Each agent is a trigger, a few deterministic CLI calls to gather facts, one LLM call for the judgment step, and a write to somewhere humans actually look: a PR thread or a Slack channel.
You also never wire up a model yourself. ankra chat is a one-shot, scriptable command, and the AI behind it already has server-side access to your cluster context: logs, events, manifests, stack history, operations, and metrics. You ask a question from CI and it answers against live state. The pipeline never holds a model API key or a kubeconfig.
Prerequisites
You need an Ankra-managed cluster, imported or provisioned, with its definition versioned in the repo the pipelines run in. That is the standard GitOps layout: a cluster.yaml plus a stacks/ directory.
Create an API token for the agent. Do this from a machine where you are logged in:
1ankra login
2ankra tokens create ci-agentStore the returned token as ANKRA_API_TOKEN in your CI secret store (GitHub Actions secrets, GitLab CI/CD variables). Also add SLACK_WEBHOOK_URL, a plain Slack incoming webhook for the channel you want reports in.
The CLI installs in one line on any runner:
1bash <(curl -sL https://github.com/ankraio/ankra-cli/releases/latest/download/install.sh)One gotcha before you script anything. The CLI resolves its token in this order: the --token flag, then a saved login from ankra login, then ANKRA_API_TOKEN. On a throwaway CI runner the environment variable works fine. On a self-hosted runner where someone once ran ankra login, the saved login silently wins over your CI token. If you run on shared machines, pass --token "$ANKRA_API_TOKEN" explicitly.
Agent 1: PR review comments
The job triggers when a pull request touches the cluster definition. It does three things: server-side validation, an impact question against the live cluster, and a comment on the PR.
.github/workflows/infra-review.yml:
1name: infra-review
2on:
3 pull_request:
4 paths:
5 - "cluster.yaml"
6 - "stacks/**"
7
8permissions:
9 contents: read
10 pull-requests: write
11
12jobs:
13 review:
14 runs-on: ubuntu-latest
15 steps:
16 - uses: actions/checkout@v4
17 with:
18 fetch-depth: 0
19
20 - name: Install the Ankra CLI
21 run: bash <(curl -sL https://github.com/ankraio/ankra-cli/releases/latest/download/install.sh)
22
23 - name: Validate against the platform
24 env:
25 ANKRA_API_TOKEN: ${{ secrets.ANKRA_API_TOKEN }}
26 run: |
27 ankra cluster validate -f cluster.yaml --strict-secrets 2>&1 | tee validation.txt
28
29 - name: Ask what the change does to the live cluster
30 env:
31 ANKRA_API_TOKEN: ${{ secrets.ANKRA_API_TOKEN }}
32 run: |
33 DIFF=$(git diff "origin/${GITHUB_BASE_REF}"...HEAD -- cluster.yaml stacks/ | head -c 12000)
34 ankra chat --cluster production "A pull request changes our cluster definition.
35 Review the diff below against the current state of this cluster.
36 Answer in three short sections:
37 1. What changes on the cluster (which stacks, addons, manifests).
38 2. Which workloads will restart or be replaced.
39 3. Anything risky given live state right now (capacity, dependencies, ordering).
40 Be specific. If nothing is risky, say so in one line.
41
42 ${DIFF}" > impact.md
43
44 - name: Comment on the pull request
45 env:
46 GH_TOKEN: ${{ github.token }}
47 run: |
48 {
49 echo "## Infra review"
50 echo
51 echo "**Validation** (\`ankra cluster validate --strict-secrets\`)"
52 echo '```'
53 cat validation.txt
54 echo '```'
55 echo
56 echo "**Impact against live cluster state**"
57 echo
58 cat impact.md
59 } > comment.md
60 gh pr comment "${{ github.event.pull_request.number }}" --body-file comment.mdWhy each piece is there:
ankra cluster validate -fruns local structural and dependency checks first, then sends the file to the API for the checks you cannot do offline: whether the referenced charts exist in your connected Helm registries, whether there are plaintext Secrets or unencrypted addon values, and whetherparentsreferences resolve against the cluster's deployed resources.--strict-secretsturns a plaintext secret from a warning into a failed job, which is exactly what you want on a PR.- The
ankra chatstep is the only LLM call. The diff is in the prompt; the cluster state is on the server side. That combination is what makes the review better than a generic LLM looking at YAML. It knows the addon you are reparenting is mid-rollout, or that the node group you are scaling down is already at 85% memory. - The comment is plain
gh pr comment. No app installation, no webhook receiver. The defaultGITHUB_TOKENwithpull-requests: writeis enough.
A real comment from this job looks like:
Infra review
Validation (
ankra cluster validate --strict-secrets)cluster.yaml: OK stacks: payments (2 addons, 3 manifests), monitoring (unchanged) chart payments-api 1.4.2 found in registry harbor-internal no plaintext secrets detectedImpact against live cluster state
- Bumps the
payments-apichart from 1.4.1 to 1.4.2 and raises its memory limit from 256Mi to 512Mi.- The
payments-apiDeployment (3 replicas, namespacepayments) rolls. Nothing else restarts.- Worker group
defaulthas 1.1Gi allocatable headroom; three new 512Mi pods fit, but only just. If you also merge PR #214 (raisescheckoutlimits), schedule them apart.
That last point, the cross-reference against live headroom, is the check a human reviewer skips at 5pm on a Friday.
The same agent on GitLab
The structure is identical. The one piece of friction is that CI_JOB_TOKEN cannot post notes on merge requests. Create a project access token with the api scope and Reporter role, store it as GITLAB_AGENT_TOKEN, and post to the notes API:
1infra-review:
2 stage: review
3 image: alpine:3.20
4 rules:
5 - if: $CI_PIPELINE_SOURCE == "merge_request_event"
6 changes:
7 - cluster.yaml
8 - stacks/**/*
9 script:
10 - apk add --no-cache bash curl git jq
11 - bash <(curl -sL https://github.com/ankraio/ankra-cli/releases/latest/download/install.sh)
12 - ankra cluster validate -f cluster.yaml --strict-secrets 2>&1 | tee validation.txt
13 - |
14 DIFF=$(git diff "origin/${CI_MERGE_REQUEST_TARGET_BRANCH_NAME}"...HEAD -- cluster.yaml stacks/ | head -c 12000)
15 ankra chat --cluster production "A merge request changes our cluster definition. Review the diff against live state. What changes, what restarts, what is risky right now? ${DIFF}" > impact.md
16 - |
17 BODY=$(printf '## Infra review\n\n```\n%s\n```\n\n%s\n' "$(cat validation.txt)" "$(cat impact.md)")
18 curl -sS --request POST \
19 --header "PRIVATE-TOKEN: ${GITLAB_AGENT_TOKEN}" \
20 --data-urlencode "body=${BODY}" \
21 "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/merge_requests/${CI_MERGE_REQUEST_IID}/notes"Agent 2: merge, deploy, verify, report to Slack
On merge to main, this agent applies the cluster definition, blocks until the platform finishes, verifies health, and reports. The interesting part is the failure path: instead of "pipeline failed, go look", the Slack message contains the failed operation and a root cause pulled from live state.
.github/workflows/deploy-watch.yml:
1name: deploy-watch
2on:
3 push:
4 branches: [main]
5 paths:
6 - "cluster.yaml"
7 - "stacks/**"
8
9permissions:
10 contents: read
11 pull-requests: write
12
13jobs:
14 deploy:
15 runs-on: ubuntu-latest
16 steps:
17 - uses: actions/checkout@v4
18
19 - name: Install the Ankra CLI
20 run: bash <(curl -sL https://github.com/ankraio/ankra-cli/releases/latest/download/install.sh)
21
22 - name: Select the target cluster
23 env:
24 ANKRA_API_TOKEN: ${{ secrets.ANKRA_API_TOKEN }}
25 run: ankra cluster select production
26
27 - name: Apply and wait
28 id: apply
29 env:
30 ANKRA_API_TOKEN: ${{ secrets.ANKRA_API_TOKEN }}
31 run: |
32 set +e
33 ankra cluster apply -f cluster.yaml --wait --timeout 15m 2>&1 | tee apply.txt
34 echo "exit_code=${PIPESTATUS[0]}" >> "$GITHUB_OUTPUT"
35
36 - name: Verify health after deploy
37 if: steps.apply.outputs.exit_code == '0'
38 env:
39 ANKRA_API_TOKEN: ${{ secrets.ANKRA_API_TOKEN }}
40 run: |
41 ankra chat health > report.md
42 ankra cluster metrics query \
43 'sum(rate(http_requests_total{code=~"5.."}[5m]))' \
44 -o json > error-rate.json
45
46 - name: Investigate failure
47 if: steps.apply.outputs.exit_code != '0'
48 env:
49 ANKRA_API_TOKEN: ${{ secrets.ANKRA_API_TOKEN }}
50 run: |
51 ankra cluster operations list 2>&1 | head -n 20 > operations.txt
52 ankra chat --cluster production "The apply for commit ${GITHUB_SHA::12} failed or timed out.
53 Find the most recent failed operation on this cluster, look at its jobs,
54 and tell me: the exact resource that failed, the error, and whether a
55 retry is likely to succeed or the change itself is broken.
56 Apply output for reference: $(tail -c 4000 apply.txt)" > report.md
57
58 - name: Report to Slack
59 if: always()
60 env:
61 SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
62 run: |
63 STATUS="DEPLOYED"
64 if [ "${{ steps.apply.outputs.exit_code }}" != "0" ]; then STATUS="FAILED"; fi
65 jq -n \
66 --arg status "$STATUS" \
67 --arg commit "${GITHUB_SHA::12}" \
68 --arg url "${{ github.server_url }}/${{ github.repository }}/commit/${{ github.sha }}" \
69 --rawfile report report.md \
70 '{text: "[\($status)] production <\($url)|\($commit)>\n\($report)"}' \
71 | curl -sS -X POST -H 'Content-type: application/json' --data @- "$SLACK_WEBHOOK_URL"
72
73 - name: Comment on the merged PR
74 if: always()
75 env:
76 GH_TOKEN: ${{ github.token }}
77 run: |
78 PR=$(gh api "repos/${GITHUB_REPOSITORY}/commits/${GITHUB_SHA}/pulls" --jq '.[0].number' || true)
79 if [ -n "$PR" ]; then
80 gh pr comment "$PR" --body-file report.md
81 fiDetails worth knowing:
ankra cluster applyis asynchronous by default. Without--waitthe command exits 0 the moment the platform accepts the change, while the actual rollout continues in the background. For a deploy watcher that exit code is useless on its own. Always pass--waitwith a--timeoutthat matches your slowest stack.ankra cluster select productionis run once, by name, non-interactively. Bareankra cluster selectopens an interactive picker and will hang a CI job until the timeout kills it. The same applies to every command with an interactive mode: in CI, always pass names.- The failure path does not parse error strings or grep logs. It collects the apply output and the recent operations as raw facts and hands the diagnosis to
ankra chat, which can also see the operation's jobs, pod events, and container logs on the server side. The one thing the prompt insists on is a verdict on "retry or broken change", because that is the decision the on-call human has to make. Make the agent answer it. - The last step closes the loop: the verdict lands back on the PR that caused it. Whoever merged gets the outcome in the same thread as the review, without watching the pipeline.
The Slack message on a bad day reads like this:
[FAILED] production
a3f8c91b02d4Operation
op-7c21failed at jobhelm-upgrade payments-api. The new pod is inCrashLoopBackOff: the 1.4.2 image readsPAYMENTS_DB_POOL_SIZEat startup and the variable is not set in this environment. Rolled-back replicas from 1.4.1 are still serving; no user impact. A retry will fail the same way because the change itself is broken. Add the variable to the stack, or revert the image bump, before re-merging.
Compare that to what a failed pipeline normally gives you: a red X and twenty minutes of kubectl describe.
Agent 3: a scheduled watcher that only speaks when something is wrong
CI-triggered agents miss everything that breaks between deploys: node pressure, a certificate that expired, an addon that drifted. This one runs on a schedule and posts to Slack only on a degraded verdict.
1name: infra-watch
2on:
3 schedule:
4 - cron: "*/30 * * * *"
5
6jobs:
7 watch:
8 runs-on: ubuntu-latest
9 steps:
10 - name: Install the Ankra CLI
11 run: bash <(curl -sL https://github.com/ankraio/ankra-cli/releases/latest/download/install.sh)
12
13 - name: Check cluster health
14 env:
15 ANKRA_API_TOKEN: ${{ secrets.ANKRA_API_TOKEN }}
16 run: |
17 ankra cluster select production
18 ankra chat --cluster production "Check this cluster now: failed or stuck operations
19 in the last 30 minutes, pods not Running or Succeeded, nodes not Ready,
20 and anything alarming in recent events. The FIRST LINE of your answer
21 must be exactly OK or DEGRADED. Then the details." > verdict.md
22
23 - name: Post to Slack only when degraded
24 env:
25 SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
26 run: |
27 if ! head -n 1 verdict.md | grep -qx "OK"; then
28 jq -n --rawfile v verdict.md '{text: "[infra-watch] production\n\($v)"}' \
29 | curl -sS -X POST -H 'Content-type: application/json' --data @- "$SLACK_WEBHOOK_URL"
30 fiThe verdict-first-line trick is the load-bearing part. You cannot branch a shell script on prose, so force the model to emit a machine-checkable first line and treat anything that is not exactly OK as degraded. If the model ignores the format, the worst case is a spurious Slack message rather than a missed incident. That is the right direction to fail in.
If this overlaps with alerting you already have, keep the alerts. The watcher's value is the paragraph under the verdict: which operation, which pod, what the events say. Ankra's built-in alerts and webhooks cover the push-based version of this with AI analysis attached; the scheduled agent is for the checks you want to phrase yourself.
The rules that keep this safe
The agents are read-only by design. Everything above validates, asks, and reports. The only mutation is ankra cluster apply on merge, which is the deploy your GitOps flow was going to run anyway, gated by the PR review. If you want an agent to propose fixes, use ankra cluster draft -f. It stages every stack in the file as reviewable drafts in Ankra instead of deploying, and a human ships them from the stack builder. Never give a pipeline agent a path to mutate the cluster outside Git.
Treat the model output as untrusted text. It goes into a PR comment or a Slack message and stops there. Never pipe ankra chat output into a shell, an apply, or anything that executes. The diff you feed in came from a PR author; the output inherits that trust level.
Scope the credentials. The CI token is created with ankra tokens create under a real account, so use a dedicated machine user with the minimum org role rather than your own login. The Slack webhook posts to one channel and can do nothing else. The GitHub side runs on the default GITHUB_TOKEN with one extra permission. There is no kubeconfig and no cloud credential anywhere in the pipeline.
Forked PRs do not get secrets. On GitHub, pull_request runs from forks without access to repository secrets, so the review agent simply does not run there. That is the behavior you want, since untrusted diffs should not reach your ankra chat context anyway. Run it on internal branches and have maintainers re-trigger for external contributions.
Control the noise. One comment per PR update, one Slack message per merge, silence from the watcher when healthy. The fastest way to kill an agent like this is to let it chat. If nobody would act on the message, do not send it.
Where this goes next
Everything here flows one way, from pipeline to humans. The natural next step is letting humans answer back: replying in the Slack thread with "scale it down" or "show me the logs" and having an agent act on it. That takes a long-running process rather than a CI job. The OpenClaw integration covers deploying a chat-native agent on your cluster with the Ankra CLI installed as a skill, using the same token model as above.
Start with the pipeline agents, though. They are three YAML files, one token, and an afternoon of work, and they put live-cluster context into the two places your team already looks: the PR thread and the Slack channel.
Get started: Create a free account on Ankra.
Join our community: Slack
Follow us on: LinkedIn | GitHub
Contact us: [email protected]
Related Posts
Kubernetes on the Edge: One Platform to Rule Them All
Edge Kubernetes is exploding, but our operational tooling has not kept up. Here is how Ankra delivers a true single pane of glass combining AI-powered troubleshooting, native GitOps, and stack cloning to turn weeks of cluster setup into minutes.
Cursor is good at application code but loses context the moment a change crosses into Kubernetes, Helm, and CD pipelines. Adding the Ankra CLI as an infrastructure subagent gives it cluster-aware grounding so developers and platform teams can work on the same artifacts.