DeepSec is an open-source security harness from Vercel Labs that uses coding agents to find vulnerabilities in your codebase. It runs on your own infrastructure (your code never leaves your environment) and produces markdown or JSON findings you can hand to engineers or pipe into Linear, GitHub Issues, or whatever ticketing system you use.

How long does a DeepSec scan take?

It depends on the size of your codebase and how parallel you run it. On our website the full pipeline (process, retry of dropped batches, then revalidate) took roughly 32 minutes of AI wall-clock; the regex scan was effectively instant. Larger monorepos can be scaled across Vercel Sandboxes for parallel execution.

How much does it cost in tokens?

Our full run came in at roughly $43 across the process and revalidate stages on Codex (gpt-5.5), about 386,000 tokens for process plus revalidate which only reports cost. The cost scales with how many files the scan flags as security-sensitive, not the total size of the repo.

Is it safe to give an agent access to your codebase?

DeepSec runs on infrastructure you control: no cloud upload of source code, no telemetry. The only external call is to whichever model provider you've configured. If you want one token for both Claude and Codex with zero data retention, point it at Vercel AI Gateway; otherwise, if your model provider's data policy is acceptable for your code, DeepSec is too.

Can DeepSec replace a penetration test?

No. DeepSec is excellent at static analysis with context: tracing data flows, spotting unsafe patterns, identifying missing mitigations. It can't replace a human red-teamer probing live systems for business logic flaws or chained exploits. Use both.

Our DeepSec audit and the future of infosec agents

We ran Vercel Labs' new DeepSec against the codebase that powers this site. 176 files, 32 minutes of agent wall-clock, $43 in tokens, 17 confirmed findings worth fixing.

Below: the full run, the install-step gotcha that nearly ended the experiment ten minutes in, and what this workflow suggests about where security tooling goes next.

What is DeepSec?

DeepSec is an open-source security harness, announced by Vercel Labs and powered by coding agents. The pitch is short: instead of running pattern-matching SAST tools that surface a thousand "consider validating this" suggestions, you let an agent read your code, trace data flows, and tell you what's actually wrong.

It runs on your own infrastructure. Your code never leaves the box. The only thing that crosses the wire is the agent's reasoning, sent to whatever model provider you've configured. For most teams that's not new risk. If Cursor or Claude Code already touches your code, DeepSec is in the same trust bucket.

The core pipeline has five stages and they are worth knowing because each one represents a decision the team made about where agents are good and where they aren't:

Scan: fast regex over the repo, picking out files that touch security-sensitive surfaces. No agent. No cost. Just a list.
Process: an agent reads each flagged file, follows the data, checks for mitigations. This is where most of the token spend goes. (DeepSec's docs call this stage process; we'll use that name from here on.)
Revalidate: a second agent reviews the first agent's findings to kill false positives and re-rate severity. The model arguing with itself, basically.
Enrich: git blame metadata gets attached to each finding, so when you import them downstream you know who wrote the offending lines. Doubles as a starter list for the next round of layoffs.
Export: markdown or JSON output, one file per finding or one big blob. Conversion into Linear tickets or GitHub Issues is on you, but the format is clean enough to script.

There's also an optional triage step that uses a cheaper model (claude-sonnet-4-6 by default) to assign P0/P1/P2 priority before revalidation. We skipped it on this run and revalidated everything directly. Worth knowing it exists if your scan returns hundreds of findings and you want a cheap pre-sort.

The interesting bit is revalidation. The whole reason agentic SAST is plausible now and wasn't two years ago is that models are good enough to grade their own homework. Without revalidation you'd drown in the same noise that tanked every traditional SAST product.

One note on backends before we get into the run: we used DeepSec's default, Codex (gpt-5.5), because we wanted to see how OpenAI's coding agent reasoned about our code under the standard configuration. Claude is a first-class alternative via --agent claude. Pick the one your team is most comfortable with the data policy of, or front the whole thing with Vercel AI Gateway and stop worrying about either of them.

A young Sam Altman in a pink polo layered over a green polo, looking unreadably composed — You can never trust a man who wears two polos.

Getting it running

We were on npx deepsec init to a finished report in roughly 25 minutes end-to-end. The setup itself took us less than 10 minutes, most of which was an agent reading our CLAUDE.md and a handful of representative files to fill in the project context document DeepSec injects into every batch.

bash

npx deepsec init

That command drops a .deepsec/ directory into your repo. Inside you get a config file (deepsec.config.ts), a default matcher pack, an INFO.md for project context, a SETUP.md walkthrough, and (importantly) its own package.json and node_modules. The .deepsec/ directory is a self-contained pnpm workspace. You have to cd into it and run pnpm install before any of the CLI commands work.

bash

cd .deepsec && pnpm install

We nearly screwed this up. First pass, we ran the scan command from the repo root, got a confusing error about a missing module, and spent ten minutes assuming our Node version was wrong. Re-read the README, spotted the install step, and everything worked. We chalk it up to the trace lead in our Stanley cups finally catching up with us. Anyone who has set up ESLint or Biome will recognise the shape of the config once you're inside: opinionated defaults, escape hatches for when you don't like them. But the workspace-inside-a-workspace bit is a sharp edge for a quick-start.

The actual run is three commands, not one:

bash

pnpm deepsec scan       # regex pre-filter; free and instant
pnpm deepsec process    # agent investigation; where the token spend lives
pnpm deepsec revalidate # second-agent review of the findings

A handful of process batches errored partway through (the Codex SDK occasionally drops a long-running session) and DeepSec marked those files for retry on the next pass. We re-ran process a second time, picked up most of the dropped files in another five minutes, and a small remainder are still tagged for the next run. The feedback loop is fast even when something breaks: failed batches are isolated, the rest of the run keeps going, and a re-run picks up exactly where it left off without re-spending tokens on files that already passed.

Then we walked away.

The run, by the numbers

The whole thing took about 32 minutes of AI wall-clock, but it isn't spread evenly across the stages. Most of the time sits in one place. Here's the breakdown:

Where the 32 minutes went

Wall-clock minutes per stage on the robotostudio.com run. Regex scan omitted: it finished in 1.2 seconds, indistinguishable from zero at this scale.

Read this as "where the agent budget goes":

Process (initial) is the heavyweight stage. The first agent reads every flagged file, traces data flows, and decides whether each regex hit is actually exploitable. 18.5 minutes, ~58% of the total wall-clock.
Process (retry) is a second pass DeepSec ran automatically for batches that got dropped mid-session (Codex SDK timeouts, mostly). A 5-minute mop-up on top of the initial run.
Revalidate is the second agent reviewing the first agent's findings to cull false positives and re-rate severity. 8.3 minutes, and the only reason the noise stays manageable.

The shape is the takeaway: more than two-thirds of the wall-clock is the initial AI investigation, and revalidation costs another third on top of process (about a sixth in dollars). If you wanted to halve a future run, the lever is parallelism on process (Vercel Sandboxes, mostly), not skipping revalidate.

Metric	Value
Wall-clock duration	~32 minutes across process + retry + revalidate; regex scan was effectively instant
Files scanned (regex stage)	176
Files processed (agent stage)	171 (5 still pending retry)
Process tokens consumed	~386,000
Approximate cost (USD)	$42.51 ($34.23 process + $2.72 retry + $5.56 revalidate)
Findings (raw)	25
Findings revalidated	19 of 25 (17 confirmed, 1 false positive, 1 uncertain)
Model used	`gpt-5.5` (OpenAI Codex), 150 max turns per batch

The funnel that matters

The numbers above are inputs and outputs, but the bit worth dwelling on is what happens between them. Here is the noise-reduction in one table:

Stage	What it produces	Count
Regex scan	Raw pattern matches	349
Process	Findings worth reading	25
Revalidate	Confirmed true positives	17

That is a 95% reduction from "things to look at" to "things to actually fix". Every traditional SAST tool we have ever used stops at column one: you get the 349, you scroll, you give up. The agent stages are doing the work that a human would otherwise have to do by hand: opening each match, reading the surrounding code, asking "is this actually exploitable in this application?", and only escalating if the answer is yes.

Cost per actual finding

If you divide $42.51 by 17 confirmed findings you get about $2.50 per real bug found, or $8.50 per HIGH-severity vulnerability. We are not going to pretend that maps cleanly to "human security engineer hour" (the comparison is unfair to humans, who can probe live systems and reason about business logic). The equivalent flesh API charges by the hour, ships nothing on weekends, and keeps asking for dental. But as a number to put in front of someone who needs to approve a security budget, it lands. Most engineering teams will not blink at $40 to remove a class of risk from their codebase.

A few caveats

The token bill is not nothing. This is not the kind of tool you point at a five-million-line monorepo on a whim. But for a typical SaaS or marketing site (a few hundred files of meaningful application code) the cost lands somewhere between "a coffee" and "a team lunch". That's a price most teams will happily pay quarterly. Or weekly, if it actually catches things.

The wall-clock time matters less than you'd think. 32 minutes across all three AI stages is short enough to fit in a long lunch and long enough to do something useful with the screen first. If you scale across Vercel Sandboxes you can collapse this further (Vercel's own benchmarks claim 1,000+ concurrent sandboxes for large monorepos), but for a repo of this size sequential was fine.

One last thing on cost: if you're paying flat-rate for a coding agent (ChatGPT Pro, Cursor, whatever), rinse it while you can. We're still at the "money printer goes brrr" stage of AI funding, where the per-token rate on your receipt is subsidised by the next round and not by gross margin. Today's $43 scan is tomorrow's $200 invoice. Run it now, often, and on everything you can justify.

What it actually found

This is the part everyone wants to read. We're not going to enumerate the individual findings. A list of issues with file paths next to them isn't a great look on a public blog. What we will share is the shape, both before and after revalidation:

Severity	Count
CRITICAL	0
HIGH	5
MEDIUM	11
BUG (non-security correctness)	9
Total	25

The post-revalidation breakdown (19 of 25 findings re-checked by a fresh agent):

Verdict	Count
Confirmed true positive	17
Demoted to false positive	1
Uncertain / needs human review	1
Not yet revalidated	6

A few notes on the shape:

Zero criticals. Relief, but not surprising. This is a marketing site with a small image-generation API attached, not a fintech backend. The realistic threat surface for a site like this is the usual marketing-site shape: input handling at the edges, abuse-resistance on anything that costs money to invoke, and the standard web hygiene checklist.
The HIGHs were the interesting ones. Several were the kind of finding a generic SAST tool can't produce: places where the agent had read our threat model and could reason about whether the code matched it.
The BUGs were a bonus. DeepSec's matchers caught non-security correctness issues as a side effect: a React correctness issue, a couple of latent footguns in helpers, and a caching choice that needed tightening up. None would have shipped a vulnerability, but all are worth fixing. Four of these started life classified as MEDIUM and got correctly downgraded by the revalidation pass.
One false positive in 19. That's roughly 5%. Vercel's launch blog quotes a "roughly 10–20%" false positive rate, and the current repo FAQ refines that to "~10–29% on HIGH+ after revalidation". Either way our 5% is at the low end, and the one miss was defensible rather than garbage: the agent had identified a pattern that would be a vulnerability if our specific application context were different. That kind of failure mode is actually useful. It surfaces the assumptions you're relying on, even when they're correct.

There were findings worth fixing, and we worked through them. I suspect we're going to run this every time I get nervous looking at a spend cap.

The wizard parallel

If you've used Vercel's Speed Insights you've already seen the new shape of this. You click a button. An agent reads your codebase. It figures out the changes needed to wire up the tracking. It opens a PR. You review and merge.

You did not write any code. You did not edit any config. You did not even open the file. The tool decided what needed to happen, did it, and asked for permission to apply.

DeepSec applies the same move to security: an agent pointed at your repo, looking for class of issue X, returning a PR that fixes class of issue X. Today the issue is "missing performance tracking" or "unsafe data flow". Tomorrow it's "missing observability", "unhandled errors", "unindexed database queries", "missing rate limits", "leaking PII in logs".

The key shift is that the user experience of fixing things is no longer "read a report, write code, send a PR." It's "describe the class of problem, approve a PR." The work moves from doing the change to deciding which changes to accept.

This is going to be the dominant pattern for a long, boring tail of engineering tasks that nobody enjoys doing. Security audits. Dependency upgrades. Observability instrumentation. Migration from one library to another. Anything where the rules are well-known but the work is mechanical.

Where this lands in 12–24 months

A few predictions, loosely held:

1. Security scans become CI lint. Right now you run DeepSec when you remember to. In a year, you'll run it on every push, the same way you run your linter and your typechecker. The cost will halve at least once between now and then. The latency will collapse via parallel execution. There will be no reason not to.

2. Vulnerability triage becomes a queue. Like dependabot for security findings. The agent opens a PR with a fix; you review and merge or reject. The bottleneck stops being "did anyone scan?" and becomes "did anyone read the PRs?", which is a much better problem to have.

3. Pen-tests don't go away, they get more expensive. A human red-teamer is now competing with an agent that's already found the easy stuff. Their value goes up because they're working on the problems agents can't touch: chained exploits, business logic flaws, social engineering vectors, infrastructure-level attacks. They charge more per engagement and do fewer of them.

4. The "we'll get to it after launch" phase ends. Teams don't suddenly become disciplined. The marginal cost of doing a security review just approaches zero, and skipping something with marginal cost zero starts to look genuinely negligent.

5. Agentic tools eat the consultancy bottom of the market. This is the uncomfortable one. There is a whole tier of security agencies whose entire value proposition is "we will run a tool against your code and write up the output." That tier is on borrowed time. The tools are now the users of the tools, and they don't bill by the hour.

We're in the awkward early phase where the tools work but the workflow hasn't caught up. The DX still feels like running a build tool. In a year it will feel like opening a tab in your IDE. In two years it will feel like nothing: invisible, ambient, on-by-default.

Want a security review on your stack?

We've been pointing agents at production codebases since DeepSec dropped. If you'd like a second opinion on yours, drop us a line.

Get in touch

What to take away

Strip everything else and the numbers we keep coming back to are these: $43 for a full scan of 176 files, 32 minutes wall-clock, 5% false positives against the 10–20% Vercel publishes. Incremental PR scans via --diff cost pennies, so cost stopped being the reason not to scan. Those are nice. They aren't the point.

The point is that INFO.md works. Two paragraphs of plain English describing your threat model and intended posture, and the agent grades the code against them on every run. The agent's findings tracked the concerns we'd written into our threat model, which is the part no pattern matcher can do, no matter how many years of CWE rules it's accumulated.

And the CI integration is a one-liner. npx deepsec process --diff origin/main next to your typecheck step and you have an agentic security gate on every PR. That's the workflow change worth caring about. The next decade of "run a SAST tool, ignore the giant report" is being replaced by "review an agent's PR", and that workflow will eat its way through dependency upgrades, observability instrumentation, accessibility cleanups, and the rest of the long tail of work nobody wants to do by hand.

What we're doing about it

A few concrete things, since this is a Roboto Studio blog and not a futurism column:

Adding DeepSec to our pre-launch checklist. Every client project now ships with at least one full DeepSec run before launch and a baseline report stored alongside the project handover docs.
Building a custom matcher pack for Sanity + Next.js. Our day-to-day stack has a recognisable shape: Sanity GROQ queries, server actions, validated env via @t3-oss/env, Next.js route handlers. DeepSec's plugin system means we can encode our own conventions as matchers, so the agent doesn't waste cycles re-discovering them on every project.
Sharing the pre-launch report with clients. Same way we share Lighthouse and accessibility audits today. Clients should not have to take "yes, it's secure" on faith.

If you're running a Next.js application of any meaningful size, DeepSec is worth the 32 minutes and the $43. It's not perfect. The false positive rate is real, the setup has a few sharp edges. But it's the first agentic SAST tool we've used that we'd genuinely point at production code without flinching.

The bigger story is the workflow DeepSec represents. An agent read our threat model, opened the relevant files, and graded the code against the stated intent. That has never been possible before. Click a button. Agent reads your code. PR shows up in your queue. Review and merge.

That is going to be the default way you maintain software. We are watching it happen in real time.

We build agentic workflows for teams

Background agents, automated pipelines, and custom tooling built around how your team actually works.

See our agentic workflows service

No spam, only good stuff

Get our next post in your inbox

Only god knows why anybody would purposefully subscribe themselves to a newsletter that moans about development. These poor souls did though

A note on the methodology

For full transparency: we ran DeepSec against the main branch of this site in early May 2026. The repo is a pnpm monorepo with our Next.js app and a small image generation API. 176 source files survived DeepSec's regex pre-filter, and 171 of those completed the process (agent investigation) stage across two passes (5 are still tagged for the next retry). We then ran revalidate to re-check the resulting findings; 19 of 25 were re-evaluated by a fresh agent (17 confirmed, 1 demoted to false positive, 1 marked uncertain). We used the default matcher pack with a hand-written INFO.md describing the project context DeepSec injects into every batch. We did not run with parallel sandboxes: single sequential run, default concurrency, locally. The model was gpt-5.5 via OpenAI's Codex SDK with a 150-turn budget per batch (that turn count is DeepSec's default at the time of writing). We stuck with DeepSec's out-of-the-box default backend, Codex (gpt-5.5); Claude is available via --agent claude if you'd rather run that side by side.

If you want to compare notes on a run of your own, or tell us we got the false-positive maths wrong, we'd love to hear about it.

Our DeepSec audit and the future of infosec agents

What is DeepSec?

Getting it running

The run, by the numbers

The funnel that matters

Cost per actual finding

A few caveats

What it actually found

The wizard parallel

Where this lands in 12–24 months

What to take away

What we're doing about it

Get our next post in your inbox

A note on the methodology

Frequently asked questions

About the Authors

Get in touch