- #_shellntel Cybersecurity Blog
- Posts
- Introducing CASA (Continuous AI Security Assessment) Application
Introducing CASA (Continuous AI Security Assessment) Application
Design, Early Results, and Lessons Learned

This post is about an internal tool that Claude and I have been building called CASA (Continuous AI Security Assessment). CASA is a work in progress and will remain private for now, with no current plan for release or open-sourcing.
The goal of this post is to document why CASA exists, how it’s structured today, what it’s good at finding, what it’s not good at, and what I learned building it. If you’re interested in testing AI systems from an offensive security perspective, this should give you enough detail to understand the approach and build something similar on your own.
Why CASA Exists
Most AI testing today happens in chat interfaces. That works for demos, but it breaks down quickly for security testing. It’s slow, non-repeatable, and makes it difficult to reason about how a model’s behavior changes over time.
I wanted something closer to how we test traditional applications:
repeatable inputs
observable outputs
the ability to rerun tests and compare behavior
minimal reliance on manual prompt typing
CASA started as a simple CLI tool that sent prompts programmatically and logged responses. Everything else grew out of that core loop.

Running a scan from the command line
High-Level Architecture
At a high level, CASA has three main components:
Request Engine
Responsible for sending structured prompts and payloads to target models.Response Analysis Layer
Identifies security-relevant behavior and inconsistencies in responses.Storage Layer
Keeps test runs separated by model, configuration, or scenario so behavior can be compared across runs.
CASA is CLI-first, but it does have a UI. I use the CLI for most testing and iteration, and the frontend is there for visibility, review, and comparing runs.

High-level CASA architecture. The CLI runs tests, the backend orchestrates requests and analysis, adapters talk to cloud and local models, and results are stored so behavior can be reviewed and compared through the UI.
What CASA Is Good At
CASA works best when prompts are treated like payloads rather than conversations.
It is particularly effective at identifying:
over-disclosure in responses
inconsistent refusal behavior across runs
safety logic that weakens after warm-up prompts
models that become more permissive over time
unexpected tool usage
differences in behavior between models given the same input
I’ve used CASA for internal testing, bug-bounty-style exploration, and working through AI-focused exercises similar to those found in Burp Web Academy. The biggest benefit is speed. I can test many variations of the same idea quickly, rerun payloads, and compare outputs without relying on gut feel.

Bulk scanning options

Scan automation

Running a scan

Scan results
What CASA Is Not Good At
CASA does not replace manual testing.
It struggles with:
nuanced intent
subtle social engineering
long conversational attacks that require context buildup
judging real-world impact without human review
It is also not designed for production monitoring. There is no alerting, dashboarding, or enforcement logic. CASA exists to explore behavior, not to prevent it.
These limitations are intentional. Trying to solve everything at once would have killed the project early.
Early Lessons Learned
A few things became obvious very quickly:
treating prompts as test cases is far more useful than treating them as chats
observability matters more than architecture early on
inconsistent model behavior is often more interesting than consistent failure
automation finds patterns faster, but still needs human interpretation
One unexpected outcome was how often models behaved differently after multiple runs. Subtle changes in permissiveness were much easier to spot when responses were logged and reviewed side by side.
If You Wanted to Build Something Similar
If you’re thinking about building your own version of CASA, start smaller than you think.
You only need:
a way to send prompts programmatically
a way to store responses
a way to compare outputs across runs
Even a basic script that sends the same payload to multiple models and logs the responses will surface interesting behavior. Most of CASA’s complexity came later, after the core loop was already useful.
If I were starting over, I would focus less on framework design and more on making changes in behavior easy to see.
Current State and Future Direction
CASA is still evolving. Multi-tenant support needs work. The storage layer has been rewritten more than once. Some early ideas turned out to be dead ends.
That’s fine.
Right now, CASA does one thing well: it provides a repeatable way to explore how AI systems fail. As the ecosystem matures, I expect tools like this to become more common. For now, CASA remains a private internal tool, but documenting the approach felt worthwhile.

Current to-do list
That’s all for now, keep an eye out for a follow-up blog, and keep on trucking 💪.