Local-first security regression for AI agents and chats

RECORD
THE PATH
NOT THE NOISE

Catch every time your AI coding agent touches auth, secrets, or skips a test, then turn the correction you made into a local regression eval.

Local-first and deterministic, with no LLM judge. TreeTrace reads the session transcript on your machine, flags every touch of auth, secrets, or tests and every risky command, and captures the human fix as a rerunnable eval. No upload. No telemetry. Record the path, not the noise.

Get started See how it works

SESSION LINEAGE · session_01H9F2K LIVE

$ npx treetrace replay session_01H9F2K…

● root prompt "add JWT refresh to auth middleware"

│ ├─ steer human → "use rotating refresh tokens"

│ ✗ dead-end console.log("token:", t) ⚠ LEAKED sk_live_a39f…

│ └─ fix memory rule: never log raw tokens ✓ saved

● head outcome PR #214 · 6 steps · 2 dead-ends · 1 caught

Inspect through the lens

Replay how the work actually got done.

$npx treetrace

See how it works

Not just for coding

Any AI agent or chat session, not only coding agents.

TreeTrace reads your coding and CLI agent sessions, and a plain User / Assistant chat transcript too. A regular ChatGPT, Claude, Gemini, or Grok conversation works the same way. No coding required.

$ npx treetrace --from transcript --file chat.txt

Agents & CLIs

Claude Code
Codex
Cursor
Copilot
Gemini
Grok

Or any chat transcript

ChatGPT
Claude
Gemini
Grok

See it in motion

The 80-second tour.

Watch TreeTrace turn a raw agent session into a deterministic record, and the correction you made into a regression eval you can rerun.

Captions available in the player controls.

What it is

One local record. Three ways to read it.

Git history shows what changed. TreeTrace shows how the work actually got done, and keeps it as evidence you can replay, query, and hand off.

01 · How it Works

How it Works

Reconstruct a session's lineage from local transcripts: the root goal, every steer, the corrections that fixed it, and the dead-ends you abandoned, including the one that leaked a token.

Reconstruct a session 02 · What it Records

What it Records

Every node captures the prompt, the edit, the tool call, files touched, refusals, and the outcome, written to your repo as an open, vendor-neutral schema your tools can read.

See the data model 03 · Use Cases

Use Cases

A redacted, deterministic chain of custody for what an agent did, was corrected on, and was refused. One record that serves audit, onboarding, and prompting efficiency.

See where it earns its keep

How it works

From raw transcript to structured record in one command.

Run it in any repo after an AI coding session. It reads local transcripts, never the network.

STEP 01

Discover

Claude Code sessions are found automatically from your local history. Codex, Cursor, Copilot, Gemini, Grok, and plain transcripts import with --from. Tool noise, retries, and "continue" nudges are filtered out.

STEP 02

Reconstruct

A fork-aware tree is derived from prompt topology and your text: the root goal, direction changes, corrections, abandoned branches, checkpoints, and the accepted path, with failure and refusal signals attached.

STEP 03

Export

Structured artifacts are written locally for humans, agents, CI, and eval harnesses. Every export passes a redaction gate that fails closed if a secret is detected, and a read-only treetrace mcp server hands the next agent its lessons.

Who it's for

One record, many uses.

The same local lineage answers three different questions, for three different people.

Efficiency

Engineers steering agents

See the cost of rework and where steering was needed. Make prompting more efficient by seeing what the agent kept getting wrong, and stop paying for the same mistake twice.

Training

Teams & onboarding

Real corrections become regression evals and a memory pack for the next run, with no LLM judge. Hand off what went wrong so the next agent (or teammate) starts already knowing.

GRC

Audit & compliance

A redacted, signed-off record of what an agent did and was refused. Foundation being built

Source-available

Auditable in one sitting. Apache-2.0.

Zero runtime dependencies, no telemetry, no network calls. The analysis layer is deterministic rules tuned to a published taxonomy and scored against a seeded benchmark with a blind holdout. Every result reproduced on committed code before it ships.

Read the source View on npm See the scorecard

0.93F1

blind-holdout F1 (from 0.72)

166/0

unit tests passing

runtime dependencies

bytes of telemetry

Measured accuracy

Validated against a seeded benchmark with a blind holdout.

The analysis layer is scored by a deterministic harness: exact-match rules against a published taxonomy, no LLM judge. The benchmark seeds 40 adversarial scenarios, each pairing a real signal with a benign distractor, so it measures precision and recall, not just coverage. A blind holdout is held out of development, so the headline number reflects generalization, not memorization.

0.93F1

blind-holdout macro-F1 (from 0.72)

40 → 18

benchmark false positives, more than halved

166 / 0

unit tests passing / failing

40 · 2

adversarial scenarios, across 2 blind splits

Per-scenario macro-F1

Final run · S01 to S10

Bars show the final-run macro-F1 for each named scenario; the 0.93 headline is measured on the held-out split. Every result is reproduced on committed code, and the full 166-test suite gates every change.

Deterministic, no LLM judge

The scorer is exact-match rules against a published taxonomy. Same input, same score, every run, with no model in the loop to grade the output, and no nondeterminism to explain away.

Generalizes, not memorizes

A blind holdout is kept out of development. It first exposed overfitting; the 0.93 headline is scored on that held-out split, so gains carry to sessions the detectors have never seen.

Validated by signal class

Corrections and declines, credential and security exposure, hallucinated file references, destructive actions, and lesson quality are each scored independently. Precision held or improved as false positives fell.

Trace your last AI coding session.

One command, in any repo. Nothing leaves your machine. Node 18+.