Agent Drilldown

Agent Drilldown is where you go when you need to know exactly what your agents are doing with the resources they consume. Every run is attributed, every token is counted, and every failure is surfaced — giving you the visibility to optimize cost, catch reliability issues early, and prove that your AI workforce is delivering value.

This is the tab that turns “I think agents are working” into “I know exactly what each agent ran, how much it cost, and whether it succeeded.”

Agent Drilldown — run metrics, token attribution by agent, and Token Pressure perspective

KPI cards

Six cards give you the headline numbers for agent execution:

Card	What it tells you
Runs	Total agent sessions completed in the time window, with a failure count. The dual-series sparkline shows successful vs failed runs over time.
Failure Rate	Percentage of runs that ended in failure. Zero is the target. Any non-zero rate deserves investigation — click through to see which agents failed and why.
Tokens	Total tokens consumed across all runs, broken down into input, output, and cost. This is your primary cost signal. The sparkline trends token usage over the window.
Avg Duration	Mean run duration across all sessions. Watch for this creeping upward — longer runs mean higher cost and slower feedback loops.
Score Coverage	Percentage of runs that received a quality score (via review or audit). Low coverage means you’re flying blind on agent output quality.
Execution Ops	Total commands executed and conversation turns across all runs. High op counts with low completion may indicate agents are spinning without progress.

Every card links directly to the Agent Sessions view, filtered to the relevant metric.

Token usage charts

The center panel is the heart of the drilldown — a multi-view chart that breaks down token consumption across three dimensions. Toggle between views using the tab buttons above the chart.

By Agent

The default view. A stacked bar chart showing total tokens per agent, split into three categories:

Input (uncached) — tokens sent to the model that weren’t served from cache.
Output — tokens generated by the model.
Cached — input tokens served from prompt cache, reducing cost.

This view answers: Which agents are consuming the most resources? Agents with disproportionately high token counts may need prompt optimization, capability tuning, or task scope reduction.

By Provider

Token usage by provider — compare consumption across AI providers

A bar chart showing total tokens grouped by provider (e.g., Codex, Claude). Use this to understand your provider cost split and make informed decisions about which models to use for which workloads.

By Model

Token usage by model — see exactly which models are consuming tokens

A stacked bar chart showing tokens per model (e.g., GPT 5.4, Opus), split by input, output, and cached. This is your most granular cost optimization lever — if one model is consuming far more tokens than another for similar work, consider whether a more efficient model could handle that workload.

Drilldown Perspectives carousel

Four perspectives provide targeted operational insights:

Token Pressure

Ranks agents by total token consumption with cost attribution. Each row shows the agent handle, total tokens, cost, and run count. The severity badge (HIGH, MEDIUM, LOW) is based on concentration — if a single agent dominates token usage, that’s flagged as high pressure.

Click Open on any agent to jump to their filtered session list.

Score Coverage

Shows which agents have been scored and their average quality ratings. Low score coverage means you’re running agents without verifying output quality — a risk that compounds over time.

Reliability Watch

Surfaces agents with the highest failure rates. Each row shows the agent handle, failure count, and total runs. Zero failures across all agents is the target state. Any failures here should trigger a session review to understand the root cause.

Attribution Confidence

Reports how well agent runs are mapped to known agent identities. Runs tagged as “unknown target” indicate sessions that couldn’t be attributed to a registered agent — which means you have blind spots in your execution trail.

Confidence level	What it means
High	Run is cleanly attributed to a known agent.
Medium	Run is attributed but with some ambiguity.
Low	Attribution is uncertain — review the session metadata.
Unknown	Run couldn’t be mapped to any registered agent.

Putting it to work

Step 1 — Check Failure Rate and Runs

Start here every time. Zero failures with a healthy run count means your agents are reliable. Any failures warrant clicking through to the session details.

Step 2 — Review Token Pressure

Switch to the Token Pressure perspective. If one agent is consuming 80% of your tokens, that’s the one to optimize first — whether through prompt tuning, model selection, or task scoping.

Step 3 — Compare By Model

Toggle to the By Model chart view. If a cheaper model can handle the same workload, you’ve found an easy cost win. Use the provider and model views together to build a cost optimization strategy.

Step 4 — Verify Score Coverage

Low score coverage means you’re trusting agent output without verification. Increase audit frequency or enable automated scoring to close the gap.

Next steps

Open Agent Sessions to inspect individual runs, replay decision trails, and see exactly what each agent did.
Use Agents to adjust agent configurations, prompts, or model assignments based on what you’ve learned.
Return to Workspace Pulse to check whether agent activity is translating into delivery progress.

Agent Workload Pulse