Running SecureVibes on SecureVibes - Results & What's Next (Part 3/3)

This is Part 3 of a 3-part series on building SecureVibes, a multi-agent security system for vibecoded applications.
Series Navigation: Part 1 | Part 2 | Part 3
Testing a Security Scanner by Scanning Itself
The best way to test a security scanner? Run it on its own codebase.
I built SecureVibes to find vulnerabilities in vibecoded applications. But SecureVibes itself is vibecoded—I didn't write a single line of code myself. I used AI agents to build an AI agent system.
This meta experiment would answer two questions:
- Does the multi-agent approach actually work?
- How does it compare to traditional tools and single-agent systems?
I figured this was the perfect test case. I am familiar with what the system is supposed to be doing. Even though I vibecoded the entire thing, I am aware of the design decisions I made. I used AI as a companion and guided it to build this thing but I have no idea if its secure or not. This is exactly the problem I wanted to address in the first place.
The Experiment Design
I ran SecureVibes on itself using three different Claude models:
- Haiku (fast/cheap)
- Sonnet (balanced) - Also, ran it twice with Sonnet to see the variance in the results because of the non-deterministic nature of SecureVibes
- Opus (premium)
Then I compared results against:
- Traditional SAST: Semgrep, Bandit
- Single-agent systems: Claude Code, Codex, Droid
- Custom Droid with security focus
All detailed reports are available at github.com/anshumanbh/securevibes/docs/example-reports.
Here's what I found...
Results: Model Comparison
Haiku vs Sonnet vs Opus
Sonnet wins hands down. Not just subjectively, but objectively:
| Model | Vulnerabilities Found | Cost | Value Score |
|---|---|---|---|
| Haiku | 2 | $0.15 | Poor |
| Sonnet | 17 | $3.44 | Best |
| Opus | 12 | $7.64 | Good |
Sonnet found 17 vulnerabilities at $3.44, while Opus found only 12 at $7.64. Haiku's $0.15 price tag is tempting, but catching only 2 issues means you're flying blind.
The sweet spot for security scanning isn't the cheapest or most expensive model—it's the one that balances depth of analysis with practical cost constraints. Sonnet proves that the middle path can outperform the premium option. As to why Opus didn't do well, I am curious about that as well. I don't have a good answer unfortunately.
Multiple Runs of Sonnet
I ran Sonnet twice to see if results were consistent. About 12-13 vulnerabilities appeared in both reports (core issues like API keys, path traversal, JSON validation). But each run found 4-5 unique issues:
Unique to Run 1:
- Race conditions in concurrent scans
- Symlink traversal enabling infinite loops
- Git commit protection warnings
- Report authenticity verification
Unique to Run 2:
- Prompt injection defense gaps
- Model downgrade attacks via env vars
- Hardcoded credentials exposure flow
- Tool parameter validation
The union of both runs found ~21 distinct issue types.
This reveals a powerful insight: running the same scanner multiple times might actually increase coverage. For critical codebases, consider 2-3 runs despite added cost. The probabilistic nature of LLMs means different runs can catch different issues.
Results: SecureVibes vs Everything Else
vs Traditional SAST
I ran two popular open-source SAST tools:
Why zero findings? These tools look for syntactic patterns. They can't detect architectural issues like "CLI bypass via symlink attack" or "insufficient permission validation in file operations"—exactly what SecureVibes found.
This is unfortunately the state of current open source code security scanners. They're excellent at finding known patterns but terrible at understanding context.
vs Single-Agent Systems
Prompt - "perform a security review of the current codebase"
I ran the same security review task using coding agents without specialized multi-agent workflows:
| System with Model | Vulnerabilities Found |
|---|---|
| Claude Code with Sonnet 4.5 | 9 |
| Codex with GPT-5-codex | 4 |
| Droid with GLM 4.6 | 7 |
| SecureVibes with Sonnet | 16 |
SecureVibes crushed the coding agents in their default setting:
- 78% more issues than Claude Code (16 vs 9)
- 4x more issues than Codex (16 vs 4)
- 2.3x more issues than Droid (16 vs 7)
Why the difference? Single-agent systems lack structured workflow. They scan linearly. SecureVibes builds context (Phase 1), hypothesizes (Phase 2), then validates (Phase 3). This progressive refinement mirrors how human security teams work.
vs Custom Security Droid
I also set up a custom droid specifically for security audits and ran it with Sonnet 4.5. The report is here.
Prompt - "security-audit: Review entire codebase for vulnerabilities"
Results: 23 vulnerabilities found
- 4 Critical (vs SecureVibes: 2-4)
- 9 High (vs SecureVibes: 6)
- 7 Medium (vs SecureVibes: 6-9)
- 3 Low (vs SecureVibes: 0)
The Custom Droid found 35-44% more vulnerabilities than SecureVibes using the same model. This taught me what I call "learning the bitter lesson": Using the same model sonnet 4.5, the output from the custom Droid is actually pretty good as compared to the one with SecureVibes.
What this means: All the work I did over the past few days building a custom multi-agent system essentially got matched by a feature Factory released in their coding agent. If you're using Claude Code, I believe the same outcome can be achieved by building your own suite of Claude Code subagents—very much like what I did with SecureVibes, but you'd have to know what you're doing.
The quality difference: The Custom Droid found several unique vulnerabilities SecureVibes missed:
- More granular categorization (Low severity tier)
- Additional timeout and rate limiting issues
- More comprehensive error handling gaps
- Better detection of compliance-related issues (GDPR, SOC 2)
But this isn't defeat—it's validation. The multi-agent approach works so well that platforms are building it in as native features. There are still plenty of opportunities here. This is just the first iteration of SecureVibes and I believe I can definitely improve the results and get it at par with the custom droid results:
- Domain expertise matters: Continue improving agents with security-specific knowledge
- Privacy-first options: Build versions that work with local models to preserve IP
- Accessibility: Non-technical users still need a UI, not command-line tools
- SDLC integration: Build custom droids/agents for different security gates (PR review, pre-commit, pre-deploy)
Key Learnings
Filesystem Threat Boundary
Most vulnerabilities were CLI ↔ filesystem interactions. It makes sense—that's the product so AI understands the threat model and the boundaries really well between the CLI program and the host machine's file system.
Multi-agent > Single-agent
This was the biggest validation. The multi-agent approach consistently outperformed single-agent attempts. The progressive refinement (context → threats → validation) mirrors how human security teams work, and it shows in the quality of results.
The Claude Agent SDK is a game changer for building multi-agent systems. It handles orchestration, so you can focus on designing the workflow and prompts.
File-based Communication is Underrated
Early versions used in-memory state passing between agents. It was a nightmare to debug when something went wrong.
Switching to file-based communication (.md and .json files) made the system so much easier to understand, debug, and extend. I can inspect any phase's output, replay phases, and even manually edit artifacts to test edge cases. Markdown surprisingly works great for both humans and machines.
Real-time Progress Streaming is Essential
Initially, SecureVibes used filesystem polling to detect phase completions. During 10-15 minute scans, users would see progress updates only every 30-60 seconds, leading to "is it frozen?" moments.
I rebuilt it using the Claude SDK's hooks system (PreToolUse, PostToolUse, SubagentStop) for event-driven streaming. Now users see exactly what each agent is doing in real-time—which files it's reading, what patterns it's searching for. This dramatically improved UX.
STRIDE is Still Relevant
I was skeptical about using a traditional threat modeling framework (STRIDE) in an AI-driven system. But it turned out to be perfect.
It gives the Threat Modeling Agent a structured way to think about threats, ensuring comprehensive coverage across all categories. Without STRIDE, the agent would often focus too heavily on one vulnerability class (usually injection attacks) and ignore authorization or audit issues.
False Positives are the Enemy
Traditional SAST tools have terrible false positive rates. By using the three-phase approach where Phase 3 validates threats with concrete evidence, SecureVibes' false positive rate is dramatically lower.
The agent must provide the exact line number, code snippet, and explanation of exploitability. This forces it to actually confirm the vulnerability exists rather than flagging suspicious-looking patterns.
Claude SDK Orchestration is Magical
I initially built a custom orchestrator agent to coordinate the workflow. Then I realized the SDK itself handles orchestration—you just define agents and Claude figures out when to invoke them.
This cut hundreds of lines of coordination code and made the system more reliable. The SDK handles error recovery, retries, and state management automatically.
AI Coding Agents Accelerate Development
I used Factory's Droid and Claude Sonnet 4.5 for this project. I first used Claude Code along with Github MCP and Anthropic documentation to create a comprehensive guide on the Claude Agent SDK. You can find that here.
Then I had Droid reference that guide to build features. The combination of context-aware coding agents and good documentation dramatically sped up development.
NOTE: I can't recommend Factory's Droid enough. It is a game changer. There have been multiple instances where Claude Code, Codex and Cursor just failed to deliver and Droid was able to one-shot it. If you want to try it out, here is a referral code worth $40 credits - https://app.factory.ai/r/Z2B374AY. I promise you will not be disappointed!
Iterative refinement is key
Text extraction from different agent outputs (especially markdown), JSON parsing, and prompt engineering all required multiple iterations. The first version of any prompt never works perfectly. I learned to build in instrumentation early (debug modes, verbose logging) to understand what's actually happening.
Build First, Optimize Later
The current system is expensive. If you are on a Claude subscription plan, you don't have to worry about this too much but if you don't have one and want to just pay as you go for the API requests, the costs can rack up really fast, especially if you run periodic scans on entire codebases. My focus for the first iteration wasn't on building a cost effective system. Now, that I know it works - I will continue to find ways in order to make this cheaper to run.
What's Next: Building in Public
This is just the beginning. I'm committed to building this in public and inviting the community to join me on this journey. Here are some items on my wishlist:
1. Dashboard
Right now, SecureVibes outputs results to the terminal and in different file formats - JSON and Markdown. I want to build a web dashboard that provides:
- Visual trend analysis (are vulnerabilities increasing or decreasing over time?)
- Vulnerability timeline and history
- Team collaboration features (assign findings, track remediation)
- Integration with issue trackers (Jira, GitHub Issues, Linear)
- Comparison between scans (what changed?)
2. Fixer Sub-Agent
Finding vulnerabilities is great, but fixing them is where the real value is. I want to build a Fixer Agent that:
- Takes a vulnerability from
VULNERABILITIES.json - Reads the vulnerable code in context
- Generates a patch that fixes the issue
- Explains what it changed and why
- Creates a PR with the fix (optional)
This is tricky because the fix needs to actually work (not break functionality), preserve the original intent of the code, and consider the broader codebase context.
3. Evaluation Framework
The hardest problem in AI security tools: how do you know if it's actually working?
I want to build a comprehensive evaluation framework:
- Benchmark datasets - Known vulnerable applications (WebGoat, pygoat, NodeGoat, etc.)
- Ground truth - Manually verified vulnerability sets for each benchmark
- Metrics - Precision, recall, F1 score for each vulnerability class
- Regression testing - Ensure updates don't decrease detection quality
- Comparison - How does SecureVibes compare to Semgrep, Snyk, etc.?
This is crucial for validating that improvements actually improve detection, building trust with users, identifying weak spots in detection, and benchmarking against other tools.
4. Context Engineering
via MCP
Claude Agents SDK has MCP support. This is really exciting because what this allows is for the subagents to bring in context from other services/systems.
This is essentially how AI native systems can be made smarter, more efficient and accurate. For example, if an app has an existing threat model saved in Jira, we could use the SDK to fetch that and use it with the subagents. The possibilities are endless!
Compacting / Pre and Post Processing
Right now, all agents get access to the entire repository. But for large codebases (10k+ files), this is inefficient and expensive. I want to build a Context Engineer that:
- Analyzes the repository structure
- Identifies high-risk files (auth, API endpoints, DB queries, file handling)
- Creates a "security-relevant file subset"
- Passes only this subset to downstream agents
This would dramatically reduce token usage and costs for large repositories, while focusing analysis on the code that actually matters from a security perspective.
5. Make SecureVibes work with other models
Currently, since I am using Claude Agent SDK, this will work with Anthropic's models only. And, its not cheap by any means. A full comprehensive scan of a somewhat medium codebase can cost anywhere between $2-$5. Being able to use local models to achieve similar results will unlock a lot of new opportunities for this system to be used in regulated industries, where sending proprietary code (IP) information to frontier model companies is prohibited. Not to mention, this will also help with cost savings.
6. Make SecureVibes into a Web Cyber Reasoning System (Web CRS)
Inspired by the AIxCC CRS (Cyber Reasoning Systems), I'd really like to emulate how those systems are designed with multiple layers of validation. If we can build such a CRS encapsulating the current SAST capabilities, along with DAST capabilities (in particular for web applications), I'd consider that a huge win! Imagine:
finding vulnerabilities via source code analysis -> validating them via dynamic analysis -> proposing a fix -> validating the fix works.
7. Make SecureVibes self-improving
The current scan results are good but I don't necessarily agree with all the severities. I also don't want to fix a few of these yet because its a CLI tool at the end of the day that I am going to be running locally on my machine. But, they might definitely manifest into something bigger in the long term so I want to triage all of these manually and provide a justification as to what I think about them. I would then like SecureVibes to update its threat model and keep my preferences in mind so that it becomes smarter with every feedback I provide it.
How You Can Contribute
This is an open source project, and contributions are welcome! Here are ways you can help:
🔧 Contribute Code
Areas where help is especially welcome:
- Improving prompts for specific vulnerability classes
- Building the dashboard
- Creating benchmark datasets
- Everything mentioned in the wishlist above
🎤 Spread the Word
If you find SecureVibes useful, share it! Tweet about it, write about it, present it at meetups. The more people use it, the better it gets.
⭐ Star the Repo
GitHub stars help with visibility. If you think this project is interesting, give it a star!
Conclusion
Building SecureVibes has been one of the most rewarding projects I've worked on. It combines my passion for security with the exciting possibilities of AI agents. The multi-agent architecture proved that we can build AI security tools that are not just "smart pattern matchers" but systems that reason about security the way human experts do.
We're at an inflection point with AI and security. LLMs are finally capable enough to handle complex security reasoning, but we're still figuring out the right architectures and workflows. Context is key! I believe multi-agent systems like SecureVibes are the future—not because they're trendy, but because they work.
The vibecoding era has democratized software development—anyone can build an app with AI assistance. But with that democratization comes risk. Many vibecoded applications are built by developers who aren't familiar with security best practices, using unfamiliar tech stacks, and shipping to production quickly. SecureVibes aims to make security accessible to these developers, providing professional-grade vulnerability detection without requiring security expertise.
Try SecureVibes on your codebase today. Open an issue if you find bugs. Submit a PR if you have ideas. Let's build the future of AI-native security together.
Follow Along
I'll post about new features, challenges I'm facing, design decisions, and lessons learned. If you're interested in AI agents, security tooling, or building in public, follow along!
- LinkedIn: @anshumanbhartiya
- GitHub: securevibes repository
- Blog: anshumanbhartiya.com
Series Navigation: Part 1 | Part 2 | Part 3
"I don't know where I am going, but I know how to get there" - Boyd Varty