As AI systems rapidly move from experimentation into production, engineering teams are running into a familiar problem: compliance is difficult to operationalize in modern AI workflows.

Most teams are very good at optimizing model performance, scaling infrastructure, and managing latency. What’s less clear is how to consistently evaluate whether those same systems meet governance, security, and privacy expectations—especially in regulated environments like healthcare.

In practice, this gap usually shows up late. Systems that look production-ready from a performance standpoint often require additional redesign once compliance requirements are introduced.

The Core Problem

Compliance is still treated as a downstream activity. It tends to appear after core design and development decisions have already been made.

That creates a mismatch. Engineers are focused on building systems that work, while compliance teams are focused on whether those systems meet regulatory expectations. Without a shared structure, alignment becomes difficult—and often slow.

Exploring a Different Approach

To better understand this problem, we built a prototype system, an “AI compliance copilot,” to explore how compliance evaluation could be integrated earlier into the development lifecycle.

In collaboration with Rajat Rawal, I examined how structured evaluation approaches could make compliance more consistent and repeatable across teams.

The idea was simple: instead of treating compliance as a separate review process, what if we could evaluate AI systems continuously using structured logic?

System Design

One of the first design decisions we made was to keep the architecture modular.

The system is organized into a few core components:

  • An input layer that captures system descriptions
  • A control repository that stores compliance requirements
  • An evaluation engine that assesses each control independently
  • A risk scoring module that aggregates results
  • A reporting layer that produces structured outputs

This structure makes it easier to extend the system to different frameworks without changing the core logic. The system evaluates controls derived from established frameworks such as the NIST AI Risk Management Framework and HIPAA, enabling structured and repeatable compliance assessment across regulated environments.

Why Control-Level Evaluation Matters

Instead of trying to determine whether a system is “compliant” in one step, the system evaluates individual controls.

In practice, this turned out to be important.

Each control is assessed independently using a structured prompt that combines:

  • A base evaluation rule set
  • Framework-specific guidance
  • The control definition
  • The system context

This mirrors how real audits work and makes the output much easier to interpret.

Prompt Design as a System Layer

One thing that became clear early on is that prompt design isn’t just an implementation detail—it’s a core part of the system.

Rather than relying on generic prompts, we separated them into:

  • A base prompt that defines scoring and output structure
  • Framework-specific prompts that capture domain knowledge
  • A builder that assembles the final evaluation input

This approach made the system significantly easier to extend and reason about.

Risk Scoring and Output

Each control is assigned a simple score:

Yes → 0

Partial → 0.5

No → 1

These scores are aggregated to produce:

  • An overall risk score
  • A list of higher-risk areas
  • Suggested remediation steps

Just as important as the score is the explanation. The system surfaces what evidence was found, what’s missing, and what could be improved.

What This Changes in Practice

What we found most interesting is how this shifts the role of compliance in the development process.

Instead of being a late-stage checkpoint, compliance becomes something that can be evaluated continuously. That changes how teams think about design decisions early on.

It also creates a shared reference point across teams. Engineers, compliance specialists, and product stakeholders can all look at the same structured output and have a more productive conversation.

Why This Matters

As AI systems become part of critical workflows, the ability to demonstrate compliance will matter just as much as performance.

From an engineering perspective, embedding compliance into the system itself reduces rework, improves clarity, and makes it easier to scale governance practices across teams.

It’s still early, and this approach will evolve. But even a lightweight implementation shows that compliance doesn’t have to be an external process—it can be built into the system from the start.

Closing Thoughts

For teams working in regulated environments, the question is no longer whether compliance is required—it’s how to integrate it without slowing down development.

Treating compliance as a system capability rather than a separate process is one way to move in that direction.