Building Pipeline Agents with the Ankra CLI

Most posts about AI agents stop at the architecture diagram. This one walks through three working agents, built from plain CI jobs and the Ankra CLI, that you can copy into a repo today.

What they do:

PR review agent. When a pull request touches your cluster definition, it validates the change against the live platform and comments on the PR with what will actually change on the cluster and what looks risky.
Deploy watcher. On merge it applies the change, waits for the rollout to finish, verifies health, and posts the result to Slack. When the deploy fails, it posts a root cause instead of a red X.
Scheduled infra watcher. Runs every 30 minutes, independent of CI. It stays silent while things are healthy and posts to Slack only when something degrades.

There is no framework here, no orchestration layer, and no vector database. Each agent is a trigger, a few deterministic CLI calls to gather facts, one LLM call for the judgment step, and a write to somewhere humans actually look: a PR thread or a Slack channel.

You also never wire up a model yourself. ankra chat is a one-shot, scriptable command, and the AI behind it already has server-side access to your cluster context: logs, events, manifests, stack history, operations, and metrics. You ask a question from CI and it answers against live state. The pipeline never holds a model API key or a kubeconfig.

Prerequisites

You need an Ankra-managed cluster, imported or provisioned, with its definition versioned in the repo the pipelines run in. That is the standard GitOps layout: a cluster.yaml plus a stacks/ directory.

Create an API token for the agent. Do this from a machine where you are logged in:

1ankra login
2ankra tokens create ci-agent

Store the returned token as ANKRA_API_TOKEN in your CI secret store (GitHub Actions secrets, GitLab CI/CD variables). Also add SLACK_WEBHOOK_URL, a plain Slack incoming webhook for the channel you want reports in.

The CLI installs in one line on any runner:

1bash <(curl -sL https://github.com/ankraio/ankra-cli/releases/latest/download/install.sh)

One gotcha before you script anything. The CLI resolves its token in this order: the --token flag, then a saved login from ankra login, then ANKRA_API_TOKEN. On a throwaway CI runner the environment variable works fine. On a self-hosted runner where someone once ran ankra login, the saved login silently wins over your CI token. If you run on shared machines, pass --token "$ANKRA_API_TOKEN" explicitly.

Agent 1: PR review comments

The job triggers when a pull request touches the cluster definition. It does three things: server-side validation, an impact question against the live cluster, and a comment on the PR.

.github/workflows/infra-review.yml:

1name: infra-review
2on:
3  pull_request:
4    paths:
5      - "cluster.yaml"
6      - "stacks/**"
7
8permissions:
9  contents: read
10  pull-requests: write
11
12jobs:
13  review:
14    runs-on: ubuntu-latest
15    steps:
16      - uses: actions/checkout@v4
17        with:
18          fetch-depth: 0
19
20      - name: Install the Ankra CLI
21        run: bash <(curl -sL https://github.com/ankraio/ankra-cli/releases/latest/download/install.sh)
22
23      - name: Validate against the platform
24        env:
25          ANKRA_API_TOKEN: ${{ secrets.ANKRA_API_TOKEN }}
26        run: |
27          ankra cluster validate -f cluster.yaml --strict-secrets 2>&1 | tee validation.txt
28
29      - name: Ask what the change does to the live cluster
30        env:
31          ANKRA_API_TOKEN: ${{ secrets.ANKRA_API_TOKEN }}
32        run: |
33          DIFF=$(git diff "origin/${GITHUB_BASE_REF}"...HEAD -- cluster.yaml stacks/ | head -c 12000)
34          ankra chat --cluster production "A pull request changes our cluster definition.
35          Review the diff below against the current state of this cluster.
36          Answer in three short sections:
37          1. What changes on the cluster (which stacks, addons, manifests).
38          2. Which workloads will restart or be replaced.
39          3. Anything risky given live state right now (capacity, dependencies, ordering).
40          Be specific. If nothing is risky, say so in one line.
41
42          ${DIFF}" > impact.md
43
44      - name: Comment on the pull request
45        env:
46          GH_TOKEN: ${{ github.token }}
47        run: |
48          {
49            echo "## Infra review"
50            echo
51            echo "**Validation** (\`ankra cluster validate --strict-secrets\`)"
52            echo '```'
53            cat validation.txt
54            echo '```'
55            echo
56            echo "**Impact against live cluster state**"
57            echo
58            cat impact.md
59          } > comment.md
60          gh pr comment "${{ github.event.pull_request.number }}" --body-file comment.md

Why each piece is there:

ankra cluster validate -f runs local structural and dependency checks first, then sends the file to the API for the checks you cannot do offline: whether the referenced charts exist in your connected Helm registries, whether there are plaintext Secrets or unencrypted addon values, and whether parents references resolve against the cluster's deployed resources. --strict-secrets turns a plaintext secret from a warning into a failed job, which is exactly what you want on a PR.
The ankra chat step is the only LLM call. The diff is in the prompt; the cluster state is on the server side. That combination is what makes the review better than a generic LLM looking at YAML. It knows the addon you are reparenting is mid-rollout, or that the node group you are scaling down is already at 85% memory.
The comment is plain gh pr comment. No app installation, no webhook receiver. The default GITHUB_TOKEN with pull-requests: write is enough.

A real comment from this job looks like:

Infra review

Validation (ankra cluster validate --strict-secrets)

cluster.yaml: OK stacks: payments (2 addons, 3 manifests), monitoring (unchanged) chart payments-api 1.4.2 found in registry harbor-internal no plaintext secrets detected

Impact against live cluster state

Bumps the payments-api chart from 1.4.1 to 1.4.2 and raises its memory limit from 256Mi to 512Mi.

The payments-api Deployment (3 replicas, namespace payments) rolls. Nothing else restarts.

Worker group default has 1.1Gi allocatable headroom; three new 512Mi pods fit, but only just. If you also merge PR #214 (raises checkout limits), schedule them apart.

That last point, the cross-reference against live headroom, is the check a human reviewer skips at 5pm on a Friday.

The same agent on GitLab

The structure is identical. The one piece of friction is that CI_JOB_TOKEN cannot post notes on merge requests. Create a project access token with the api scope and Reporter role, store it as GITLAB_AGENT_TOKEN, and post to the notes API:

1infra-review:
2  stage: review
3  image: alpine:3.20
4  rules:
5    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
6      changes:
7        - cluster.yaml
8        - stacks/**/*
9  script:
10    - apk add --no-cache bash curl git jq
11    - bash <(curl -sL https://github.com/ankraio/ankra-cli/releases/latest/download/install.sh)
12    - ankra cluster validate -f cluster.yaml --strict-secrets 2>&1 | tee validation.txt
13    - |
14      DIFF=$(git diff "origin/${CI_MERGE_REQUEST_TARGET_BRANCH_NAME}"...HEAD -- cluster.yaml stacks/ | head -c 12000)
15      ankra chat --cluster production "A merge request changes our cluster definition. Review the diff against live state. What changes, what restarts, what is risky right now? ${DIFF}" > impact.md
16    - |
17      BODY=$(printf '## Infra review\n\n```\n%s\n```\n\n%s\n' "$(cat validation.txt)" "$(cat impact.md)")
18      curl -sS --request POST \
19        --header "PRIVATE-TOKEN: ${GITLAB_AGENT_TOKEN}" \
20        --data-urlencode "body=${BODY}" \
21        "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/merge_requests/${CI_MERGE_REQUEST_IID}/notes"

Agent 2: merge, deploy, verify, report to Slack

On merge to main, this agent applies the cluster definition, blocks until the platform finishes, verifies health, and reports. The interesting part is the failure path: instead of "pipeline failed, go look", the Slack message contains the failed operation and a root cause pulled from live state.

.github/workflows/deploy-watch.yml:

1name: deploy-watch
2on:
3  push:
4    branches: [main]
5    paths:
6      - "cluster.yaml"
7      - "stacks/**"
8
9permissions:
10  contents: read
11  pull-requests: write
12
13jobs:
14  deploy:
15    runs-on: ubuntu-latest
16    steps:
17      - uses: actions/checkout@v4
18
19      - name: Install the Ankra CLI
20        run: bash <(curl -sL https://github.com/ankraio/ankra-cli/releases/latest/download/install.sh)
21
22      - name: Select the target cluster
23        env:
24          ANKRA_API_TOKEN: ${{ secrets.ANKRA_API_TOKEN }}
25        run: ankra cluster select production
26
27      - name: Apply and wait
28        id: apply
29        env:
30          ANKRA_API_TOKEN: ${{ secrets.ANKRA_API_TOKEN }}
31        run: |
32          set +e
33          ankra cluster apply -f cluster.yaml --wait --timeout 15m 2>&1 | tee apply.txt
34          echo "exit_code=${PIPESTATUS[0]}" >> "$GITHUB_OUTPUT"
35
36      - name: Verify health after deploy
37        if: steps.apply.outputs.exit_code == '0'
38        env:
39          ANKRA_API_TOKEN: ${{ secrets.ANKRA_API_TOKEN }}
40        run: |
41          ankra chat health > report.md
42          ankra cluster metrics query \
43            'sum(rate(http_requests_total{code=~"5.."}[5m]))' \
44            -o json > error-rate.json
45
46      - name: Investigate failure
47        if: steps.apply.outputs.exit_code != '0'
48        env:
49          ANKRA_API_TOKEN: ${{ secrets.ANKRA_API_TOKEN }}
50        run: |
51          ankra cluster operations list 2>&1 | head -n 20 > operations.txt
52          ankra chat --cluster production "The apply for commit ${GITHUB_SHA::12} failed or timed out.
53          Find the most recent failed operation on this cluster, look at its jobs,
54          and tell me: the exact resource that failed, the error, and whether a
55          retry is likely to succeed or the change itself is broken.
56          Apply output for reference: $(tail -c 4000 apply.txt)" > report.md
57
58      - name: Report to Slack
59        if: always()
60        env:
61          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
62        run: |
63          STATUS="DEPLOYED"
64          if [ "${{ steps.apply.outputs.exit_code }}" != "0" ]; then STATUS="FAILED"; fi
65          jq -n \
66            --arg status "$STATUS" \
67            --arg commit "${GITHUB_SHA::12}" \
68            --arg url "${{ github.server_url }}/${{ github.repository }}/commit/${{ github.sha }}" \
69            --rawfile report report.md \
70            '{text: "[\($status)] production <\($url)|\($commit)>\n\($report)"}' \
71            | curl -sS -X POST -H 'Content-type: application/json' --data @- "$SLACK_WEBHOOK_URL"
72
73      - name: Comment on the merged PR
74        if: always()
75        env:
76          GH_TOKEN: ${{ github.token }}
77        run: |
78          PR=$(gh api "repos/${GITHUB_REPOSITORY}/commits/${GITHUB_SHA}/pulls" --jq '.[0].number' || true)
79          if [ -n "$PR" ]; then
80            gh pr comment "$PR" --body-file report.md
81          fi

Details worth knowing:

ankra cluster apply is asynchronous by default. Without --wait the command exits 0 the moment the platform accepts the change, while the actual rollout continues in the background. For a deploy watcher that exit code is useless on its own. Always pass --wait with a --timeout that matches your slowest stack.
ankra cluster select production is run once, by name, non-interactively. Bare ankra cluster select opens an interactive picker and will hang a CI job until the timeout kills it. The same applies to every command with an interactive mode: in CI, always pass names.
The failure path does not parse error strings or grep logs. It collects the apply output and the recent operations as raw facts and hands the diagnosis to ankra chat, which can also see the operation's jobs, pod events, and container logs on the server side. The one thing the prompt insists on is a verdict on "retry or broken change", because that is the decision the on-call human has to make. Make the agent answer it.
The last step closes the loop: the verdict lands back on the PR that caused it. Whoever merged gets the outcome in the same thread as the review, without watching the pipeline.

The Slack message on a bad day reads like this:

[FAILED] production a3f8c91b02d4

Operation op-7c21 failed at job helm-upgrade payments-api. The new pod is in CrashLoopBackOff: the 1.4.2 image reads PAYMENTS_DB_POOL_SIZE at startup and the variable is not set in this environment. Rolled-back replicas from 1.4.1 are still serving; no user impact. A retry will fail the same way because the change itself is broken. Add the variable to the stack, or revert the image bump, before re-merging.

Compare that to what a failed pipeline normally gives you: a red X and twenty minutes of kubectl describe.

Agent 3: a scheduled watcher that only speaks when something is wrong

CI-triggered agents miss everything that breaks between deploys: node pressure, a certificate that expired, an addon that drifted. This one runs on a schedule and posts to Slack only on a degraded verdict.

1name: infra-watch
2on:
3  schedule:
4    - cron: "*/30 * * * *"
5
6jobs:
7  watch:
8    runs-on: ubuntu-latest
9    steps:
10      - name: Install the Ankra CLI
11        run: bash <(curl -sL https://github.com/ankraio/ankra-cli/releases/latest/download/install.sh)
12
13      - name: Check cluster health
14        env:
15          ANKRA_API_TOKEN: ${{ secrets.ANKRA_API_TOKEN }}
16        run: |
17          ankra cluster select production
18          ankra chat --cluster production "Check this cluster now: failed or stuck operations
19          in the last 30 minutes, pods not Running or Succeeded, nodes not Ready,
20          and anything alarming in recent events. The FIRST LINE of your answer
21          must be exactly OK or DEGRADED. Then the details." > verdict.md
22
23      - name: Post to Slack only when degraded
24        env:
25          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
26        run: |
27          if ! head -n 1 verdict.md | grep -qx "OK"; then
28            jq -n --rawfile v verdict.md '{text: "[infra-watch] production\n\($v)"}' \
29              | curl -sS -X POST -H 'Content-type: application/json' --data @- "$SLACK_WEBHOOK_URL"
30          fi

The verdict-first-line trick is the load-bearing part. You cannot branch a shell script on prose, so force the model to emit a machine-checkable first line and treat anything that is not exactly OK as degraded. If the model ignores the format, the worst case is a spurious Slack message rather than a missed incident. That is the right direction to fail in.

If this overlaps with alerting you already have, keep the alerts. The watcher's value is the paragraph under the verdict: which operation, which pod, what the events say. Ankra's built-in alerts and webhooks cover the push-based version of this with AI analysis attached; the scheduled agent is for the checks you want to phrase yourself.

The rules that keep this safe

The agents are read-only by design. Everything above validates, asks, and reports. The only mutation is ankra cluster apply on merge, which is the deploy your GitOps flow was going to run anyway, gated by the PR review. If you want an agent to propose fixes, use ankra cluster draft -f. It stages every stack in the file as reviewable drafts in Ankra instead of deploying, and a human ships them from the stack builder. Never give a pipeline agent a path to mutate the cluster outside Git.

Treat the model output as untrusted text. It goes into a PR comment or a Slack message and stops there. Never pipe ankra chat output into a shell, an apply, or anything that executes. The diff you feed in came from a PR author; the output inherits that trust level.

Scope the credentials. The CI token is created with ankra tokens create under a real account, so use a dedicated machine user with the minimum org role rather than your own login. The Slack webhook posts to one channel and can do nothing else. The GitHub side runs on the default GITHUB_TOKEN with one extra permission. There is no kubeconfig and no cloud credential anywhere in the pipeline.

Forked PRs do not get secrets. On GitHub, pull_request runs from forks without access to repository secrets, so the review agent simply does not run there. That is the behavior you want, since untrusted diffs should not reach your ankra chat context anyway. Run it on internal branches and have maintainers re-trigger for external contributions.

Control the noise. One comment per PR update, one Slack message per merge, silence from the watcher when healthy. The fastest way to kill an agent like this is to let it chat. If nobody would act on the message, do not send it.

Where this goes next

Everything here flows one way, from pipeline to humans. The natural next step is letting humans answer back: replying in the Slack thread with "scale it down" or "show me the logs" and having an agent act on it. That takes a long-running process rather than a CI job. The OpenClaw integration covers deploying a chat-native agent on your cluster with the Ankra CLI installed as a skill, using the same token model as above.

Start with the pipeline agents, though. They are three YAML files, one token, and an afternoon of work, and they put live-cluster context into the two places your team already looks: the PR thread and the Slack channel.

Get started: Create a free account on Ankra.

Join our community: Slack