Scoring
From “trust me” to “prove it”
When an agent claims a task is complete, how do you verify it actually followed instructions, used approved tools, and produced correct outputs? Manual transcript review doesn’t scale. Valdr’s scoring system automates this with structured audits that produce evidence-backed verdicts.
Tip
Scoring isn’t just for catching failures—it’s how you build confidence in agent operations at scale. Every score links to evidence. Every finding includes remediation.
Who this is for: Operators, pack authors, and teams running agents in production. Solo developers typically consume scores via the UI rather than authoring audits directly.
What scoring is not: Scoring is not LLM self-evaluation, not subjective grading, and not a replacement for human judgment. It’s structured verification of claims against artifacts.
The seven dimensions
The auditor evaluates every agent session across seven dimensions, each scored 0–100:
| Dimension | What it measures | Blocker threshold |
|---|---|---|
instructionsAdherence | Did the agent follow its charter and prompt instructions? | 80 |
promptIntegrity | Were declared prompts/capabilities actually used? Any drift? | 80 |
toolingCompliance | Did the agent use only approved tools, in approved ways? | 80 |
taskCorrectness | Did outputs satisfy acceptance criteria with evidence? | 90 |
riskSafety | Were security, data handling, and safety constraints respected? | 80 |
taskQuality | Was the task brief clear? Was work executed to standard? | 80 |
Scoring rules
- Every score requires evidence — file path and line number. No evidence = score of 0.
- Any dimension with blocker=true forces
instructionsAdherence≤ 50 and sets verdict tofail - Score < 100 must have at least one finding explaining the gap
- Score = 100 includes a
passfinding with evidence confirming compliance
How auditing works
Inputs
The auditor consumes:
| Input | Purpose |
|---|---|
| Transcript | Full session history (system prompt, user prompts, assistant responses, tool calls) |
| Declared capabilities | Capability tags extracted from the agent’s system prompt |
| Observed capabilities | Capabilities actually exercised during the session |
| Acceptance criteria | Task requirements that must be satisfied |
| Artifacts/worktree | Files produced by the agent for validation |
| Tool call outcomes | Success/failure status of each tool invocation |
Process
- Parse transcript — extract system prompt, identify capability tags
- Compare declared vs observed — detect prompt drift or missing capabilities
- Validate tool usage — check each call against policy, note failures
- Score per acceptance criterion — verdict + evidence for each requirement
- Generate findings — severity, root cause, remediation for every gap
- Emit structured JSON — validated against the audit schema
Outputs
The auditor produces a JSON document containing:
- Scores for all seven dimensions with evidence
- Per-acceptance-criteria verdicts
- Findings with severity, root cause, and remediation
- Tool call summary
- Follow-up items
- Overall verdict (
pass,fail, orneeds-follow-up)
Findings structure
Every finding includes:
| Field | Purpose |
|---|---|
severity | blocker, major, or minor |
dimension | Which scoring dimension this affects |
issue | What went wrong |
evidence | Array of {path, line, note} references |
rootCause | Categorized cause (see below) |
remediation | How to fix it |
improvement | Optional systemic fix to prevent recurrence |
Root cause categories
| Root cause | Meaning |
|---|---|
prompt-gap | Prompt doesn’t cover this scenario |
prompt-drift | Observed behavior doesn’t match declared prompt |
task-clarity | Task/requirements are vague or conflicting |
tool-policy | Tool usage violated policy |
data-missing | Required data wasn’t available |
env-permission | Environment constraints blocked execution |
process-gap | Workflow step was skipped or incomplete |
user-input | User provided incorrect or incomplete input |
llm-behavior | Model behaved unexpectedly |
other | Doesn’t fit other categories |
Root cause categorization enables trend analysis—if you’re seeing repeated prompt-gap findings, your prompts need work.
Real example: failed audit
Here’s an actual audit from a documentation task that failed. The agent was asked to research Java Virtual Threads and produce docs/research/research.md.
What happened
The agent:
- Tried to list
docs/research/— failed (directory didn’t exist) - Waited for user to tell it to create the directory
- Created the directory
- Never created the actual research document
Scores
{
"instructionsAdherence": { "score": 45, "blocker": true },
"promptIntegrity": { "score": 80 },
"toolingCompliance": { "score": 70 },
"taskCorrectness": { "score": 10 },
"riskSafety": { "score": 75 },
"taskQuality": { "score": 30 }
}Overall verdict: fail
Per-acceptance-criteria verdicts
| Criterion | Verdict | Evidence |
|---|---|---|
| Sources cited in a bibliography | missing | No research.md exists to contain a bibliography |
| Key concepts summarized in a cheat sheet | missing | No research.md exists to contain a cheat sheet |
| Audience pain points documented | missing | No research.md exists to contain the report |
| Deliverable: research.md in docs/research/ | missing | test -f .../research.md reports missing |
Key findings
Blocker — instructions adherence:
{
"severity": "blocker",
"dimension": "instructionsAdherence",
"issue": "Audit JSON schema validation could not be executed...",
"rootCause": "env-permission",
"remediation": "Provide a local JSON Schema validator...",
"improvement": "Add a repo script for audit schema validation..."
}Major — task correctness:
{
"severity": "major",
"dimension": "taskCorrectness",
"issue": "Acceptance criteria were not delivered; the session only created docs/research/ and did not produce docs/research/research.md...",
"rootCause": "process-gap",
"remediation": "Create docs/research/research.md containing (1) bibliography with sources, (2) virtual threads cheat sheet, and (3) audience pain points report...",
"improvement": "Add a task template requiring an outline + minimum section headers for research deliverables..."
}Minor — tooling compliance:
{
"severity": "minor",
"dimension": "toolingCompliance",
"issue": "Tool failure (list_dir) was surfaced as an error, but the agent did not proactively recover...",
"rootCause": "llm-behavior",
"remediation": "On ENOENT for expected output directories, offer to create the directory and continue..."
}Tool call tracking
{
"toolCalls": [
{
"tool": "list_dir",
"command": "docs/research",
"status": "failed",
"evidence": [{ "path": "...transcript.ndjson", "line": 7 }]
},
{
"tool": "make_dir",
"command": "docs/research",
"status": "success",
"evidence": [{ "path": "...transcript.ndjson", "line": 15 }]
}
]
}Follow-up items
The audit captures what still needs to happen:
{
"needsFollowUp": [
"Implement the requested research deliverable at docs/research/research.md...",
"Add an offline-capable audit schema validation script..."
]
}Capability tags that enable scoring
The auditor relies on structured tags in prompts to map behavior to expectations:
| Tag | How auditor uses it |
|---|---|
capability | Extract declared capabilities from system prompt; compare to observed |
instructions | Baseline for instructionsAdherence scoring |
guideline | Map findings to specific rule violations |
checklist | Score against required steps |
tool_usage_policy | Baseline for toolingCompliance scoring |
Without tags, auditors can only do surface-level transcript review. With tags, they can diff declared vs. observed, hash content for drift detection, and link findings to exact prompt sections.
Rubric configuration
Each dimension has configurable thresholds:
{
"rubric": {
"dimensions": [
{ "dimension": "instructionsAdherence", "weight": 1, "threshold": 80 },
{ "dimension": "promptIntegrity", "weight": 1, "threshold": 80 },
{ "dimension": "toolingCompliance", "weight": 1, "threshold": 80 },
{ "dimension": "taskCorrectness", "weight": 1, "threshold": 90 },
{ "dimension": "riskSafety", "weight": 1, "threshold": 80 },
{ "dimension": "taskQuality", "weight": 1, "threshold": 80 }
]
}
}Weights enable custom scoring formulas. Thresholds define the minimum score for a dimension to pass without findings.
Why scoring matters
| Without scoring | With scoring |
|---|---|
| “The agent said it’s done” | Structured verdict with evidence |
| Review every transcript manually | Automated compliance checks |
| No trend visibility | Root cause analysis over time |
| Hope prompts are followed | Hash-based drift detection |
| Audit prep is a fire drill | Audit trail is always ready |
Scoring transforms agent operations from “trust and hope” to “verify and improve.”