Introducing CASA (Continuous AI Security Assessment) Application

Design, Early Results, and Lessons Learned

This post is about an internal tool that Claude and I have been building called CASA (Continuous AI Security Assessment). CASA is a work in progress and will remain private for now, with no current plan for release or open-sourcing.

The goal of this post is to document why CASA exists, how it’s structured today, what it’s good at finding, what it’s not good at, and what I learned building it. If you’re interested in testing AI systems from an offensive security perspective, this should give you enough detail to understand the approach and build something similar on your own.

Why CASA Exists

Most AI testing today happens in chat interfaces. That works for demos, but it breaks down quickly for security testing. It’s slow, non-repeatable, and makes it difficult to reason about how a model’s behavior changes over time.

I wanted something closer to how we test traditional applications:

  • repeatable inputs

  • observable outputs

  • the ability to rerun tests and compare behavior

  • minimal reliance on manual prompt typing

CASA started as a simple CLI tool that sent prompts programmatically and logged responses. Everything else grew out of that core loop.

Running a scan from the command line

High-Level Architecture

At a high level, CASA has three main components:

  • Request Engine
    Responsible for sending structured prompts and payloads to target models.

  • Response Analysis Layer
    Identifies security-relevant behavior and inconsistencies in responses.

  • Storage Layer
    Keeps test runs separated by model, configuration, or scenario so behavior can be compared across runs.

CASA is CLI-first, but it does have a UI. I use the CLI for most testing and iteration, and the frontend is there for visibility, review, and comparing runs.

High-level CASA architecture. The CLI runs tests, the backend orchestrates requests and analysis, adapters talk to cloud and local models, and results are stored so behavior can be reviewed and compared through the UI.

What CASA Is Good At

CASA works best when prompts are treated like payloads rather than conversations.

It is particularly effective at identifying:

  • over-disclosure in responses

  • inconsistent refusal behavior across runs

  • safety logic that weakens after warm-up prompts

  • models that become more permissive over time

  • unexpected tool usage

  • differences in behavior between models given the same input

I’ve used CASA for internal testing, bug-bounty-style exploration, and working through AI-focused exercises similar to those found in Burp Web Academy. The biggest benefit is speed. I can test many variations of the same idea quickly, rerun payloads, and compare outputs without relying on gut feel.

Bulk scanning options

Scan automation

Running a scan

Scan results

What CASA Is Not Good At

CASA does not replace manual testing.

It struggles with:

  • nuanced intent

  • subtle social engineering

  • long conversational attacks that require context buildup

  • judging real-world impact without human review

It is also not designed for production monitoring. There is no alerting, dashboarding, or enforcement logic. CASA exists to explore behavior, not to prevent it.

These limitations are intentional. Trying to solve everything at once would have killed the project early.

Early Lessons Learned

A few things became obvious very quickly:

  • treating prompts as test cases is far more useful than treating them as chats

  • observability matters more than architecture early on

  • inconsistent model behavior is often more interesting than consistent failure

  • automation finds patterns faster, but still needs human interpretation

One unexpected outcome was how often models behaved differently after multiple runs. Subtle changes in permissiveness were much easier to spot when responses were logged and reviewed side by side.

If You Wanted to Build Something Similar

If you’re thinking about building your own version of CASA, start smaller than you think.

You only need:

  • a way to send prompts programmatically

  • a way to store responses

  • a way to compare outputs across runs

Even a basic script that sends the same payload to multiple models and logs the responses will surface interesting behavior. Most of CASA’s complexity came later, after the core loop was already useful.

If I were starting over, I would focus less on framework design and more on making changes in behavior easy to see.

Current State and Future Direction

CASA is still evolving. Multi-tenant support needs work. The storage layer has been rewritten more than once. Some early ideas turned out to be dead ends.

That’s fine.

Right now, CASA does one thing well: it provides a repeatable way to explore how AI systems fail. As the ecosystem matures, I expect tools like this to become more common. For now, CASA remains a private internal tool, but documenting the approach felt worthwhile.

Current to-do list

That’s all for now, keep an eye out for a follow-up blog, and keep on trucking 💪.