For GRC, security, and compliance teams Local-first No model grades the model

The audit log for your AI agents.

When an auditor asks what your coding agents did in your codebase, "we have a policy" is no longer the answer. TreeTrace turns each coding and CLI agent session into a local, deterministic, redacted record of what was touched, where a human intervened, and what was refused. No cloud. No model grading the model.

$ npx treetrace --each

View on GitHub See what the record looks like

This page covers the compliance and audit use case. TreeTrace is one record with three readings. The other two, regression and eval sets and plain dev efficiency, live on the homepage.

Why now

The deadline already moved one level down.

The oversight conversation is aimed at frontier labs. The record-keeping obligations have already landed on everyone who deploys AI.

EU AI Act, Article 12

Logging and record-keeping

High-risk systems must keep traceable records of operation, with an enforcement date of August 2, 2026. TreeTrace produces a per-session record of agent activity you can retain and hand over.

California SB 53

Frontier transparency in law

Transparency expectations are now in state law, and the direction is clear: show your work. A deterministic session record is show-your-work by construction.

SOC 2

Prove the policy was followed

Auditors have moved from accepting a written policy to asking you to prove it was followed, on every change. A redacted, evidence-backed session bundle is the kind of artifact that survives that question.

ISO 42001, A.6.2.8

Attributable and auditable

Agent actions should be attributable and auditable. TreeTrace attaches evidence and node ids to every finding, so each one is attributable to a specific point in the session.

TreeTrace is the evidence artifact that supports these efforts. It is not a certification and does not satisfy any regulation on its own.

The proof

Verifiable, or it is not an audit.

An audit record written by a model and graded by a model is still a black box. Run it twice and the verdict can drift. Hand it to a regulator and there is nothing to re-derive.

LLM-as-judge

A verdict you have to trust.

Run it twice, the confidence score drifts
Nothing for a regulator to re-derive
One model vouching for another model

TreeTrace

A verdict you can check.

Every flag is a deterministic heuristic; same session in, same verdict out
Every finding ships with its evidence text and the node id it came from
No large language model renders any verdict, anywhere in the pipeline

Run the same session twice with --deterministic and the output is byte-identical. That is the difference between a finding and an opinion.

Threat model

The audit tool should not be a new exposure.

The organizations under the most pressure to audit their AI are often the ones who cannot send session data to a third party. TreeTrace is built for them.

An audit layer that exfiltrates the thing it audits is not a control. TreeTrace stays on your machine.

No account, no upload, no telemetry, and no network in the export path.
A redaction gate that fails closed: outside an interactive terminal every detected secret is redacted, and a shadow scan refuses to write any artifact if an unresolved secret remains.
Zero runtime dependencies, Node built-ins only. The tool that proves your provenance does not add an unaudited vendor to your pipeline.

At scale

Built for a folder of sessions, not one at a time.

GRC does not need a single report, it needs a defensible record for every session. The --each batch mode walks a directory of sessions and writes one standalone, redacted audit bundle per session, plus an index.

~/audit-2026-q2 · npx treetrace --each

$ npx treetrace --each --out-dir audit-2026-q2 --deterministic
 
  ✓ sess-7f3a2b10 · 41 prompts -> audit-2026-q2/sess-7f3a2b10
  ✓ sess-9c1d04ee · 12 prompts -> audit-2026-q2/sess-9c1d04ee
  ✓ sess-3b80f1a2 · 64 prompts -> audit-2026-q2/sess-3b80f1a2
  ...
  ok wrote 12 session reports to audit-2026-q2 (see INDEX.md)

Each bundle contains the human-readable report, the prompt tree, the canonical lineage JSON, and the failure, rejection, and hallucination findings. INDEX.md and index.json summarize the set: prompts, corrections, rejections, and security flags per session, so a reviewer starts from the manifest and drills in.

TreeTrace reads coding and CLI agent sessions today (Claude Code, Codex, Cursor, Copilot, ChatGPT export, Gemini, Grok). Visibility for any AI session a business runs is the direction we are building toward, not a claim about what ships today.

What the record looks like

Real sessions, real findings.

A small, curated set. Each was generated by one command and is reproducible byte for byte. The dangerous-capability examples use placeholders only; the point is that TreeTrace recorded the refusal as a checkable audit event.

Security regression

An agent touched secrets, a human pulled it back.

A coding agent hardcoded a live API key and called an API with a bearer token. A human told it to load the key from an environment variable and rotate it.

Recorded: two security_or_privacy_risk flags at verified tier with evidence and node ids, a user_rejected_action, and the key redacted out of every artifact by the fail-closed gate.

Dangerous capability, cyber

The model refused, and the refusal held.

A user asked for an exploit to break into a host they do not own, the model refused, the user pushed back, the model held, and the user pivoted to a legitimate defensive question.

Recorded: a model_refusal and a user_text_decline, with safe eval framing. The refused content is never quoted as a requirement to honor.

Dangerous capability, bio and chem

A checkable record of a threshold approached.

Same shape as the cyber example, in the chemical-synthesis domain, ending in a benign pivot to general lab safety.

Recorded: a model_refusal and a user_text_decline, with no refused content quoted anywhere. Identical guarantee to the cyber example.

Steering and efficiency

The same record, read for cost.

A normal build session: the user redirected from scraping to a public API, dropped a feature, then restored part of it.

Recorded: two corrections and an abandoned path in the prompt lineage. The same log that serves an auditor shows a developer where the time went.