AI Coding Governance: Same Thesis, Bigger Surface

Miguel Martinez
Three pillars for governing AI coding agents: adoption visibility, framework-shaped readiness controls, and PR-level developer integration. Plus AI Session Score for the patterns no policy catches.

It’s a Tuesday. You’re in a planning meeting. Your CISO leans back from the laptop and asks the simplest question of the year:

“How much of our code is being written by AI?”

Silence.

You all know the answer is “a lot.” You can feel it. Every PR description these days has the cadence of an LLM somewhere upstream. Half the team is on Claude Code, the other half is on Cursor, and someone keeps promising they’ll “try Codex this sprint.” But the actual number? Nobody knows.

Then the follow-ups come, each worse than the last:

  • Which models are people running? Are the skills and rules approved, or copy-pasted from a thread?
  • How are people running their agents? Sandboxed, or with credentials, network, and untrusted content all in one session — the lethal trifecta?
  • Did any session this week touch .env or a secret? Would we know if something left?
  • Did the agent volunteer to bypass a pre-commit hook?
  • Did someone actually verify that fix in last week’s PR, or did we just trust the green check?
  • Are we even sure the diff in this PR matches what the agent said it did?

You realize the real problem. This isn’t shadow IT in the usual sense. The agents are authorized; they’re just invisible. Your CI/CD telemetry has nothing to say about them, because they ran upstream of CI, on developer laptops, in someone’s terminal at 11pm.

Same governance question we’ve always asked about CI/CD. The pipeline just grew a new front door. And it can barely hold.

The Same Thesis, A Bigger Surface

Chainloop has always been a control plane for software delivery, built for regulated industries, security-critical products, and enterprises where “trust me” isn’t an answer when an auditor or a customer asks what changed. Instrument the tools you already use. Collect tamper-evident evidence. Run policies. Gate releases.

The recent supply chain posts (Trivy, LiteLLM) walked through what happens when the tools inside CI get compromised. Same lesson in both: best practices aren’t a checkbox, they’re continuous verification.

AI agents are the next link in that chain, one step further upstream, and at a different scale. A paradigm shift, not a tooling upgrade. The thesis still holds; the control plane just has more ground to cover, and that ground is where the work actually gets decided: in the agent’s session.

Our previous post on agent governance introduced the two evidence types we use to make that ground visible: the agent’s static configuration, captured at attestation time, and the full session trace, captured on every git push.

Both flow in through chainloop trace, a CLI you install once per repo. It hooks into the agent at session time and into git at push time, so every AI session lands in the platform as signed evidence without anyone changing how they work. The data also gets correlated with your pull requests, so every AI-assisted PR is tied back to the sessions that produced it. From there, Chainloop has something to show, evaluate, and gate on.

This post is about what you actually do with that evidence once it’s flowing.

Four Pillars

AI coding governance in Chainloop has four parts:

  1. Adoption visibility. See who’s using what, where, how much. Org-wide.
  2. Compliance and enforcement. Frameworks, controls, and policies over tools, models, configs, and sessions.
  3. Developer integration. PR-level surfaces, merge gates, continuous attestation.
  4. Chainloop AI Coding Score. Deep analysis of agent behavior during the coding session.

One thing to call out before we go any further. These pillars compose; you don’t have to adopt all of them. A team can run the dashboard alone. Or run policies without it. Or wire up the PR check without enforcing org-wide allowlists yet. Compliance is a destination some teams are headed toward, not a precondition for getting started.

1. See Who’s Using AI, And How Much

You can’t govern what you can’t see. So before anything else, let’s just look.

The AI Coding dashboard rolls up every AI session captured across your org into one view: total sessions, active users, AI-assisted PRs, AI-authored line share, top users, model breakdown. One picture spanning your organization, your products, your developers, and the agents they’re running.

That’s enough to answer the questions from the meeting:

  • Is the eng org actually adopting AI? Look at the trend.
  • Which teams? Which users? Top Users panel.
  • Which models? Approved or not? Model Breakdown.
  • What’s the cost trend? Same Model Breakdown card.

No surveys. No “please reply to this thread with what you’re using.” No Slack archeology.

One caveat: the dashboard only shows what’s flowing in. If half the org isn’t sending sessions yet, the picture lies. Which leads us to enforcement.

2. Frameworks, Controls, And Policies

Visibility alone doesn’t tell you whether your team is using AI safely.

Every session and every config sweep lands in Chainloop as just another piece of evidence, like an SBOM, vulnerability report, or a container image. Every governance mechanism that already works on evidence works on these too.

Auditors and compliance leads don’t read Rego. They read frameworks. Chainloop’s governance model has three layers:

  • Framework. The named posture. SLSA. NIST SSDF. Chainloop Best Practices. And now, AI Readiness.
  • Controls. The named requirements inside a framework. “Approved models.” “No dangerous commands.” “Signed commits on AI-assisted changes.”
  • Policies. The deterministic checks or Rego that actually evaluate evidence and report a verdict for each control.

Same shape we already use for source-code and artifact governance, now applied to AI sessions and configs.

Built-ins for most of the row, custom Rego when the org has its own rules. The policy reference walks through each one.

Evidence is signed and tamper-proof. Controls are framework-shaped, so a compliance lead can read them without learning Rego.

3. Where Governance Meets The Developer

Policies that fire in a backend somewhere are no good if the developer never sees them. The third pillar is about putting governance where the work actually happens, on the PR.

Connect your repositories to Chainloop and run chainloop trace in them, and every AI-assisted PR gets correlated with the sessions that produced it: which sessions contributed, what each one did, files touched, lines changed, and whether any tripped a policy check. The data sits on the PR itself, alongside the diff that reviewers were already going to read.

Two surfaces.

The PR summary comment. Every AI-assisted PR gets one. An aggregate table per session (agent, model, AI Session Score, attribution %, files, lines, tokens, cost, duration), plus a per-session file breakdown listing exactly which files the agent touched and the line ranges.

The Chainloop AI Policies check run. When a session violates an attached policy, Chainloop publishes a GitHub check on the head commit (success, neutral, or failure). AI policy compliance becomes a required merge check.

Three layers of strictness, in increasing severity:

  • Push-time. Fail the local git push if evidence can’t be produced.
  • Platform-side. Block the push when a policy fails.
  • Missing-session. Flag commits whose referenced sessions never landed.

For compliance, every merge carries attested evidence of what the agent did. No quarterly surveys. The audit trail is the same artifact reviewers already see, and drift gets caught the next time the build runs.

4. Chainloop AI Coding Score

AI Session Score showing an 82% quality assessment with per-criterion scores and findings

Policies catch rules. Allowlists, budgets, banned commands, signature checks. They do that well.

But most of what goes wrong with AI-generated code isn’t a policy violation. It’s everything else. The patterns nobody writes a rule for, but everyone has lived:

  • Premature done. The agent declared the change finished. It wasn’t.
  • Claim-vs-reality miss. The summary sounds right. The diff is subtly different.
  • Silent error swallow. A try/except quietly maps an unknown error to a known meaning.
  • Volunteered bypass. The pre-commit hook failed, so the agent disabled it.
  • Drive-by fix. The diff has a feature plus a renamed variable in an unrelated file the user never asked about.
  • Plan landed too late. A “plan” appeared in the session, but only after the code was already written. The plan documented work that had happened. It didn’t guide work that was about to happen.

You’ve seen these. We’ve seen these. The reviewer staring at a clean-looking PR, sniffing that something is off, not being able to put a finger on it.

Code review tools see the diff. The diff alone isn’t enough. A change ships with context: the transcript that produced it, the files touched, the lines changed, the tool calls along the way, and any AI bot review comments left on the PR. You can’t catch a “premature done” by reading the diff, because the diff doesn’t say “I’m done” three times before the work was actually done. The transcript does.

This is what AI Session Score looks at.

It’s a per-PR confidence signal across six criteria, each evaluated by its own LLM judge:

  • Context & Planning. Was the AI set up to succeed, or set up to wing it?
  • Alignment. Did the AI stay on the task that was actually asked?
  • Scope Discipline. Did the change stay in scope, or did the agent feature-creep?
  • Solution Quality. Is this a real fix, or a workaround that masks the problem?
  • Verification. Was the change actually validated?
  • User Trust Signal. What does the user’s behavior across the session tell us?

A final aggregator rolls those six verdicts into a summary line, a Red/Yellow/Green flag, a 0-100 score, and an items list. A few examples of items the system surfaces, written the way they actually appear:

  • Verification: Yellow. Tests added but only the happy path. The new error branch was never exercised. Consider hitting the endpoint with the malformed input the new branch handles.
  • Solution Quality: Red. The agent ran into a failing pre-commit hook and disabled it via --no-verify rather than fixing the underlying lint error. The bypass landed in the diff.
  • Scope Discipline: Yellow. The feature change in auth/middleware.go is on-task. The rename in billing/invoice.go is unrelated. Consider splitting.

The items list is what reviewers actually act on. The number is a band; the flag is for triage.

How To Start

The pillars compose, so the on-ramp does too. Three commitments, in the order they get harder:

  • Just visibility. Run chainloop trace init in your repos, watch the dashboard fill up.
  • Add readiness. Attach built-in policies to your workflow contract.
  • Add gates. Make the merge check required and the trace push mandatory.

The first one looks like this:

chainloop trace init --project my-project
# work as usual, commit, push
git push origin my-branch
# session attested automatically
# PR comment + check run appear when the PR opens

Run it once per repo, commit the config, the rest of the team is onboarded automatically.

Full guide: Chainloop Trace. Conceptual overview: AI Coding Sessions. Score reference: AI Session Score.

Closing

Bring the meeting back.

“How much of our code is being written by AI?”

Now you can answer it. And the next ten questions: which models, which tools, which sessions touched secrets, which PRs were verified, which sessions failed a policy, which changes a reviewer should look harder at.

The agents are writing your code today. Tomorrow they’ll be orchestrating each other. Software factories are heading toward a mesh of agents kicking off other agents, with humans somewhere in the loop but not at the keyboard for every change. In that world, who did this becomes the load-bearing question of the pipeline. Identity and provenance become the chain itself.

The good news: the same instrumentation that captures a developer’s session today captures an agent’s session tomorrow. The Trivy and LiteLLM posts argued that supply chain security isn’t a checkbox, it’s continuous verification. AI doesn’t change that. It just moves the boundary upstream, into the transcript today and the orchestrators tomorrow. Same thesis. Bigger surface.

If you want to try it out or have feedback, we’d love to hear from you. Reach out at chainloop.dev.

; ---