2026-02-23

AI Governance for DevOps

Intended Team · Founding Team

AI Governance for DevOps

DevOps teams are among the earliest adopters of AI agents. The use cases are natural: automated deployments, infrastructure scaling, incident detection and response, configuration management, and cost optimization. AI agents fit the DevOps philosophy of automation, speed, and continuous delivery.

But DevOps is also where ungoverned AI agents cause the most spectacular failures. A misconfigured deployment takes down production. An overzealous scaling agent spins up 500 instances and burns through the monthly cloud budget in an hour. An incident response agent rolls back the wrong service. These are not theoretical risks. They are stories that infrastructure teams share.

Governing AI agents in DevOps requires understanding the specific risk patterns of infrastructure operations.

Deployment Gates

Deployments are the highest-frequency, highest-impact action that DevOps agents perform. A deployment to production changes what customers experience. The governance challenge is balancing speed (teams want to deploy frequently) with safety (not every deployment should proceed without review).

Intended's infrastructure domain pack implements deployment governance as a multi-factor evaluation:

Environment tier. Deployments to development auto-approve. Deployments to staging auto-approve with audit. Deployments to production require policy evaluation. The environment tier is the coarsest governance dimension but the most impactful.

Change classification. A CSS change and a database migration are both "deployments" but carry completely different risk. The domain pack classifies changes by type: static assets, application code, configuration, database schema, and infrastructure definitions. Each type has a different risk profile.

Time window. Production deployments during business hours auto-approve (assuming other factors are within bounds). Production deployments outside business hours require explicit approval. Production deployments during a declared change freeze are denied.

Service criticality. Deploying to a tier-1 service (authentication, payment processing, core API) carries more risk than deploying to a tier-3 service (internal dashboard, documentation site). Service criticality is configured per service and factored into the risk score.

Deployment method. Canary deployments score lower risk than full rollouts. Blue-green deployments score lower than in-place updates. The deployment method determines the reversibility dimension of the risk score.

A typical evaluation: agent "deploy-bot" wants to deploy a code change to the production payment service at 3 PM on a Wednesday using canary deployment. Environment tier: production (elevated). Change classification: application code (moderate). Time window: business hours (normal). Service criticality: tier-1 (elevated). Deployment method: canary (reduced). Composite risk: 0.55. Policy: auto-approve with audit for composite risk below 0.6. Decision: allow.

The same deployment at 2 AM on a Saturday: time window shifts to out-of-hours (elevated). Composite risk: 0.75. Policy: escalate for composite risk above 0.6. Decision: escalate to on-call engineer.

Infrastructure Changes

Infrastructure changes, including modifications to compute resources, networking, storage, and security configurations, are among the most consequential actions an AI agent can take. A security group change can expose internal services to the internet. A scaling change can multiply cloud costs. A DNS change can redirect customer traffic.

The infrastructure domain pack evaluates infrastructure changes across several dimensions specific to this domain:

Blast radius. How many services, users, or customers are affected by this change? Modifying a security group attached to 50 services has a larger blast radius than modifying one attached to a single development server.

Reversibility. Can this change be undone? Adding a firewall rule is easily reversible. Deleting an S3 bucket is not. Resizing a database is partially reversible (you can scale back down, but the process takes time and may cause downtime).

Dependency impact. Does this change affect resources that other services depend on? Modifying a shared VPC, a common IAM role, or a DNS zone can cascade across the infrastructure.

Security posture. Does this change weaken the security posture? Opening ports, broadening CIDR ranges, adding public IP addresses, or modifying encryption settings all reduce security posture and receive elevated risk scores.

Intended evaluates infrastructure changes against these dimensions in real time. An agent that wants to add an ingress rule for port 443 from a specific IP to a development security group scores low risk. An agent that wants to add an ingress rule for all ports from 0.0.0.0/0 to a production security group scores critical risk and is denied immediately, regardless of other factors.

Incident Response Automation

AI agents are increasingly used for incident response: detecting anomalies, diagnosing root causes, and executing remediation. The governance challenge is that incident response happens under time pressure. You want the agent to act quickly. You also want it to act correctly.

The infrastructure domain pack uses a tiered governance model for incident response:

Diagnostic actions (read logs, query metrics, check service health) are auto-approved with minimal evaluation. During an incident, the agent needs to gather information quickly. Governance should not slow down diagnosis.

Standard remediation (restart a service, scale up capacity, fail over to a secondary) is auto-approved during an active incident with elevated audit. These are well-understood actions with predictable outcomes. The risk is low relative to the incident itself.

Aggressive remediation (roll back a deployment, disable a feature flag, drain traffic from a region) requires rapid approval. The agent submits the intent with incident context, and a shortened escalation timeout (5 minutes instead of the standard 30) ensures that the approval does not delay the response significantly.

Destructive remediation (terminate instances, delete resources, modify security groups) requires standard approval even during an incident. The time pressure of an incident is precisely when destructive actions are most likely to cause additional damage.

This tiered model lets agents respond to incidents quickly for standard actions while maintaining checkpoints for actions that could make the incident worse.

Cost Governance

Cloud cost optimization is a growing use case for AI agents. An agent that monitors resource utilization and right-sizes instances, terminates idle resources, and adjusts reserved capacity can save significant money. It can also make expensive mistakes.

The infrastructure domain pack includes cost-specific governance:

Spending authority. Each agent has a configured spending authority: the maximum cost impact it can authorize. An agent that right-sizes an instance from m5.xlarge to m5.large is reducing cost. An agent that scales up from 3 instances to 30 is increasing cost. Both are optimizations, but the second one needs a spending authority check.

Cumulative spend tracking. The domain pack tracks the cumulative cost impact of each agent's actions over rolling time windows. An agent that has already increased monthly spend by $10,000 through individually-authorized scaling actions receives elevated risk scores on subsequent scaling actions.

Resource protection. Certain resources can be marked as protected from AI agent modification. Production databases, core networking infrastructure, and security resources can be excluded from automated optimization entirely.

The Infrastructure Domain Pack

Intended's infrastructure domain pack brings these governance patterns together in a deployable configuration. It includes intent mappings for common infrastructure operations across AWS, GCP, Azure, and Kubernetes. It includes risk models calibrated for infrastructure impact dimensions: blast radius, reversibility, security posture, and cost impact. It includes baseline profiles for common agent roles: deployment bots, scaling agents, incident responders, and cost optimizers.

The pack integrates with standard DevOps tooling. Terraform plans can be submitted as intents before apply. Kubernetes manifests can be evaluated before kubectl apply. CI/CD pipeline steps can be governed individually.

DevOps teams move fast. AI agents move faster. Governance that understands the specific risk patterns of infrastructure operations lets both move fast without the incidents that make everyone slow down.