Anatomy of a Good Verification Report

What signals help you decide whether to merge: facts vs judgments, risk scoring, blast radius, spec alignment, and structured evidence

The merge decision

Every code change ends with a decision: merge or don't. In a pre-AI world, that decision was backed by the reviewer's mental model - they wrote similar code, they knew the system, they could reason about risk from experience.

With AI-generated code, that mental model often doesn't exist. The reviewer didn't write it. They may not have full context on the area being changed. The code looks clean, passes tests, and the diff is large. The decision defaults to gut feel, time pressure, or rubber stamping.

A good verification report replaces gut feel with structured signals. Not a pass/fail score. Not a list of style nits. A layered breakdown that tells you what changed, what might be risky, and what couldn't be verified - so you can make the merge decision with evidence instead of hope.

Separating what you know from what you suspect

The most important property of a verification report is the separation between facts and judgments.

Facts are what changed. A function signature was modified. Error handling was added. An export was removed. A new dependency was imported. These are structural observations from the diff. They're deterministic - the same diff always produces the same facts. They can't be wrong, they can't be gamed, and they don't depend on any model's opinion.

Judgments are what might be risky. The removed export could break three downstream consumers. The new error handler catches exceptions but swallows them silently. The signature change may violate the API contract. These are inferences - they require interpretation, and they can be wrong. A good report makes this distinction explicit. Every judgment should carry evidence (what in the diff triggered it) and a confidence indicator (how certain the analysis is).

Unknowns are what can't be verified. No concurrency handling observed for shared state access. No test coverage for the new branch. No validation on the new input parameter. These aren't findings - they're gaps. The report is saying: "I looked for this and didn't find it. You should check." This category is where the most dangerous issues hide, because they represent absent code, not wrong code.

When a report mixes these three categories - or worse, presents everything as equally certain - the reviewer can't calibrate their attention. They either treat everything as a warning (noise) or nothing as a warning (danger).

Risk signals that actually matter

Not all changes carry the same risk. A formatting change and an authentication change don't deserve the same review attention. A good report scores risk based on what actually matters for production safety.

Severity tells you how bad it could be. A change that touches authentication, payment processing, or data access boundaries is high severity regardless of how small the diff is. A change that affects performance or observability is lower severity. The classification should follow the consequences of getting it wrong, not the size of the change.

Confidence tells you how certain the analysis is. A finding backed by strong evidence in the diff (a removed null check, a changed return type) is high confidence. A finding that requires assumptions about runtime behavior is lower confidence. When the reviewer sees a high-severity, high-confidence finding, they know exactly where to focus. When they see a low-confidence finding, they know to verify it themselves before acting on it.

Categories tell you the domain. Is this a security concern? A data contract change? An error handling modification? A logic correctness issue? Categorization lets different reviewers focus on their expertise. The security engineer looks at auth findings. The data engineer looks at contract changes. Nobody needs to read everything.

Blast radius: the signal most tools miss

A three-line change to a utility function can break fifty files. A two-hundred-line change to a leaf component affects nothing else. Size of the diff is a terrible proxy for risk.

Blast radius - how many other files depend on the changed code - is one of the strongest signals for merge decisions. If the changed file is imported by thirty other modules, even a minor change demands careful review. If the changed file is a standalone script with no dependents, the risk is contained.

The most dangerous changes are small diffs to highly connected files. They look harmless in the PR but have outsized impact. Without blast radius information, reviewers have no way to know.

Spec alignment: did the code do what it was supposed to?

When an AI agent implements a feature from a spec or a task description, there are two questions: did it write correct code, and did it implement the right thing?

Most review processes only address the first question. They check whether the code is well-structured, handles errors, and doesn't introduce security issues. They don't check whether the implementation actually matches what was requested.

Spec alignment closes that gap. Given the requirements document and the diff, a good report can tell you: which requirements were addressed, which were partially addressed, which were contradicted, and which weren't touched at all. This is particularly valuable with AI-generated code, where the agent may confidently implement something that diverges from the spec in ways that aren't obvious from reading the code alone.

What a merge-ready signal looks like

A report that supports good merge decisions gives the reviewer a fast path to the right answer:

When there's nothing to worry about: no findings, low blast radius, all specs addressed. The reviewer can merge with confidence. They spent seconds, not minutes. This is what "review compression" actually means - not skipping review, but making safe changes fast to approve.

When there's something to investigate: one or two findings with evidence and confidence levels. The reviewer knows exactly which files and lines to look at and why. They spend their time on the specific concern instead of scanning the entire diff looking for problems.

When the change isn't ready: high-severity findings, missing error handling, contradicted spec requirements, or high blast radius with insufficient test coverage. The reviewer has evidence to send back with specific, actionable feedback. Not "this doesn't look right" but "this change removes error escalation in a file that thirty other modules depend on."

In all three cases, the report replaces the question "should I merge this?" with "here's the evidence, here's the risk, here's what I couldn't check." The merge decision is still yours. But now it's informed.

The report as an artifact

For teams with compliance requirements, the report isn't just a decision aid. It's evidence.

When an auditor asks "how do you verify AI-generated code before it ships?", the answer should be demonstrable. A structured report - with facts, findings, confidence levels, spec alignment, and identified gaps - stored alongside the PR, is a verifiable record. It shows what was checked, what was flagged, and what the team decided to do about it.

This is the difference between "we review all code" (which auditors increasingly don't accept at face value) and "here's the verification artifact for every change that shipped" (which they can inspect).

The report closes the loop between all the problems this section covers: it addresses the review bottleneck (structured triage), the AI-reviewing-AI concern (grounded in evidence, not consensus), the gaps problem (explicit unknowns), the continuous review cycle (report at every checkpoint), and the code privacy question (generated locally, stored in your repo).

← Code Privacy and Compliance for AI Review Tools Code Health Is Necessary, Verification Is Sufficient→