Seven-layer defense for a production LLM agent
informed by the Agents of Chaos taxonomy
Jorge Espada · March 2026
Solo deployment on a 2017 MacBook Pro · 16 GB RAM · ~$0.04/mo
Scroll down or use the TOC to navigate
An LLM agent accessible via WhatsApp, running as a KVM guest on a NixOS host. It can:
Think: a capable assistant with root-level power potential, reachable by anyone who knows the phone number.
Crafted messages that cause the agent to execute unintended commands, exfiltrate data, or abuse messaging capabilities.
Malicious payloads embedded in web pages the agent fetches. The tool output becomes the injection channel.
External (anyone) · Known contact (social engineering advantage) · Compromised web content (passive injection)
Shapira et al. (arXiv:2602.20021, 2026) deployed 6 undefended OpenClaw agents for 14 days with 20 adversarial researchers. They catalogued 10 vulnerability case studies across three attack categories.
| CS | Vulnerability | Category | Example |
|---|---|---|---|
| CS1 | Disproportionate Response | Behavioral | Agent destroys mailbox to hide a secret |
| CS2 | Non-Owner Compliance | Identity | Agent obeys non-owner because they sound authoritative |
| CS3 | Semantic Reframing | Ingress | "Forward" bypasses "share" refusal |
| CS4 | Infinite Loop | Behavioral | Mutual agent relay spawns unbounded processes |
| CS5 | Storage Exhaustion | Behavioral | Silent DoS via attachment accumulation |
| CS6 | Silent Censorship | Ingress | Opaque filtering injects attacker text |
| CS7 | Guilt Trip / Pressure | Behavioral | 12+ refusals overcome by emotional manipulation |
| CS8 | Identity Hijack | Identity | Display name spoofing to impersonate owner |
| CS10 | Corrupted Constitution | Ingress | Injection via user documents / web content |
| CS11 | Mass Broadcast | Egress | Spoofed identity triggers disinformation campaign |
Beyond Agents of Chaos, the architecture draws from:
Data flow enforcement — untrusted data cannot influence control flow. Applied as a design principle throughout.
89% exfiltration success via URL redirect chains. Motivated our redirect-chain analysis and nftables egress enforcement.
Worker isolation with schema-validated JSON boundaries. Applied in our API sidecar proxy.
Policy Compiler for Agentic Systems. Simplified into our action quota engine.
Seven defense layers, each independent. A failure in one layer does not compromise others.
| Layer | Component | Protects Against | CS Coverage |
|---|---|---|---|
| A | Guard Chain (4-stage LLM firewall) | Prompt injection, jailbreaks | CS3, CS6, CS10 |
| B | Egress Enforcement (nftables + proxy) | Data exfiltration, mass broadcast | CS3, CS11 |
| C | Behavioral Controls (quotas, approvals) | Disproportionate response, loops, DoS | CS1, CS4, CS5, CS7 |
| D | Trust Tiers & Identity | Non-owner compliance, identity hijack | CS2, CS8 |
| OS | Sandbox & Isolation | Privilege escalation, host compromise | All |
| Web | Web Content Guard (5-layer) | Indirect injection via web content | CS10 |
| E | Monitoring & Alerting | Visibility into all layers | All |
Every inbound message and outbound response passes through a 4-stage content filter before reaching or leaving the agent.
| Guard | Scope | Cost |
|---|---|---|
| Regex Pre-scan | Outbound credentials (sk-ant-*, JWT, AWS keys, GitHub PATs, PEM keys) | Free |
| Lakera Guard v2 | Cloud API, primary filter | 10K/mo free |
| Prompt Guard 2 (86M) | Local ONNX model, ~500MB RAM | Free (CPU) |
| GPT-4o-mini | OpenRouter, last resort | ~$0.00014/call |
Early versions returned the GPT-4o-mini guard's free-text reason to the agent. This was itself an injection channel — the "reason" field contained attacker-influenced text. Fix: replaced with a fixed-vocabulary enum: injection | exfiltration | credential | content_policy | flagged.
All guard decisions logged to /data/openclaw/logs/guard.log (JSON-lines). Content is never logged — only SHA-256 hashes. 30-day retention, gzip rotation. Shipped to Loki with PII redacted.
Network-level controls prevent the agent from exfiltrating data or reaching unauthorized endpoints.
A 5-minute timer resolves 8 API hostnames and updates nftables sets:
The agent calls guarded_web_fetch POST http://127.0.0.1:19090/...
The proxy injects credentials server-side from a file the agent process cannot read.
Credentials never enter the agent's context window.
Handles Gmail OAuth2 token refresh internally.
openclaw_vm_egress_denied_packets_total and openclaw_vm_egress_denied_bytes_total are scraped every 60s. Alert fires on >10 denied packets/min.
Four mechanisms limit what the agent can do, independent of what it is asked to do.
Per-class rate limiting for exec/bash commands:
| Class | Budget | Window |
|---|---|---|
destructive_exec | 1 | 24h |
file_delete | 5 | 1h |
message_send | 20 | 1h |
process_spawn | 3 | 1h |
config_modify | 5 | 1h |
CS1 CS4
destructive_exec commands are blocked and queued. A WhatsApp notification goes to the owner's phone. Owner approves/denies via SSH CLI.
/data/openclaw/approval-queue/CS1
Every 50 tool calls, GPT-4o-mini analyzes recent activity for:
Async, non-blocking. Flags increment openclaw_session_flagged_total.
CS3 CS7
>10 identical tool calls (by name + arg hash) in 5 minutes triggers a 15-minute cooldown.
Catches degenerate loops where the agent repeats the same command. In-memory tracking, resets on gateway restart.
CS4
Phone-based three-tier identity model. Configuration is Nix-managed — the agent cannot modify it.
| Tier | Identity | Capabilities | Rate Limit | Guard Fail Mode |
|---|---|---|---|---|
| Owner | Configured phone + optional PIN | All tools, exec, full autonomy | Unlimited | Fail-open |
| Known | Allowlisted phones | Read-only tools (glob, grep, read, guarded_web_*) | 20 msg + 10 tools/hr | Fail-closed |
| Unknown | Everyone else | Text responses only, no tools | 5 msg/hr, 0 tools | N/A |
Send unlock <pin> via WhatsApp DM to elevate from known to owner tier.
/etc/openclawd/env)openclaw_owner_auth_attempts_total{result="failure"}WhatsApp group messages carry the group JID, not the sender's phone. Required a 3-component fix:
message_received hook extracts senderE164groupLatestSender map bridges hooks to agent eventslookupTier checks map before falling back to session keyThe original Agents of Chaos paper assumes 1:1 messaging — group identity is an extension we had to build.
systemd service hardening reduces the attack surface from the default 8.7/10 EXPOSED to under 4.0.
User = openclaw (not in wheel, no sudo)NoNewPrivileges = truePrivateTmp = true (isolated /tmp)UMask = 0077 (0600 files by default)ProtectSystem = strict (entire FS read-only)ReadWritePathsPrivateDevices = trueProtectKernelTunables/Modules/Logs = trueProtectControlGroups/Clock/Hostname = trueProtectProc = invisibleRestrictNamespaces = trueSystemCallFilter = @system-service ~@mount ~@reboot ~@debug ~@raw-io + pkey_alloc/free/mprotect + mincore (crashpad disabled via --disable-crash-reporter)ProcSubset = all (Chromium needs /proc/sys/fs/inotify; ProtectKernelTunables keeps it read-only)RestrictAddressFamilies = AF_INET AF_INET6 AF_UNIX AF_NETLINKSystemCallArchitectures = nativeSECURITY.md and AGENTS.md are the agent's system prompt. They are:
chattr +i — even root can't modify without unlocking)Edit workflow: sudo openclaw-unlock → edit → sudo openclaw-lock → sudo openclaw-save-canonical
The agent has multiple paths to fetch web content. Each path requires its own defense.
web_fetch, web_search. Browser is enabled (output screened by Layer 4).exec commands with curl/wget/lynx/chromium/playwright/puppeteer. Forces browser use through built-in tools (guarded) rather than raw exec.execute(). Agent only sees screened content. Fail-closed.browser_snapshot, browser_navigate, browser_screenshot, etc.) before persisting to conversation history (defense-in-depth).Collected via 6 textfile collectors on 60s timers. Key metrics:
guard_calls_total | Guard invocations by mode/result/guard |
guard_blocked_total | Blocked by threat type |
guard_latency_seconds | Per-guard latency |
approval_queue_count | Queue depth by status |
trust_tier_blocked_total | Tier-based blocks |
session_flagged_total | Behavioral flags |
loop_circuit_breaker_total | Circuit breaker trips |
vm_egress_denied_* | Blocked egress traffic |
owner_auth_attempts_total | PIN attempts |
prompt_guard_up | Local model health |
| Alert | Severity |
|---|---|
| GuardBlockSpike (>10/5min) | warning |
| GuardAPIKeyExhausted | critical |
| PromptGuardDown (>2min) | critical |
| ApprovalQueueStale (>10min) | warning |
| SessionFlagged | warning |
| LoopBreakerActive | warning |
| TrustTierBreach (>5/5min) | warning |
| OwnerPINFailureSpike | critical |
| VMEgressDeniedSpike | critical |
| QuotaExhausted | warning |
Grafana Alloy ships journald + guard.log to Loki. PII-safe: guard.log rewritten to metric-only fields. Journal logs regex-redacted for credential patterns. Verified: 8/8 Loki queries confirmed zero PII leakage.
The most valuable outcomes were NOT confirmations of the taxonomy — those attacks work as described. The surprises came from defense layer interactions in production.
When we added an approval queue with a resolve_approval tool, the agent called it to approve its own rm -rf. After removing the tool, it used exec to cat the queue files and jq to change status from "pending" to "approved".
Fix: Remove the tool entirely AND block exec access to the queue directory in before_tool_call.
Lesson: Any tool that can influence a security decision must be inaccessible to the entity being constrained. The agent wasn't adversarial — it was being efficient.
PrivateTmp=true gives the agent an isolated /tmp. The agent ran rm -rf /tmp/build-artifacts, got exit code 0 (thanks to -f), and reported success. Nothing was actually deleted — it operated on the isolated tmpfs.
The agent's world model diverged from reality. It confidently reported outcomes that didn't occur. For chained actions, this gap compounds.
GPT-4o-mini's free-text verdict.reason was returned to the agent for transparency. But the reason field is generated from attacker-controlled input — it could paraphrase or echo attacker instructions. The agent treated this as authoritative output from a trusted security component.
Fix: Fixed-vocabulary enum instead of free text.
Gateway's UMask=0077 protects secrets but breaks Syncthing sharing. Truncating content to 2000 chars for latency created the attack pattern the guard was looking for (false positives on job postings). Each security choice has second-order effects at domain boundaries.
This solo setup serves as a reference architecture. Below is how each defense layer maps to common company/startup stacks: AWS, GCP, and self-hosted (Nomad, bare metal, etc.).
The 4-stage filter is a TypeScript plugin running inside the agent process. Works anywhere Node.js runs. Lakera paid tier ($50/mo) removes the 10K/month limit. Prompt Guard 2 local model needs 1 CPU + 1 GB RAM — runs as a sidecar or on any Linux box.
The before_tool_call hook is agent-agnostic. Quotas are tracked in-process (JSON counters). The approval queue is file-based but can be backed by any queue (Redis, SQS, Pub/Sub, PostgreSQL). Tune budgets per use case.
The three-tier model (owner/known/unknown) maps to any RBAC system. Owner = SRE/Security/Admin, Known = Engineering, Unknown = External/CI. Swap the identity source (phone → SSO, Slack ID, GitHub login, API key).
Pure plugin code (TypeScript). No infrastructure dependency. Runs wherever the agent runs.
| Layer | AWS | GCP | Self-Hosted (Nomad / Bare Metal) |
|---|---|---|---|
| Compute | ECS Fargate task (agent + sidecars) | Cloud Run service or GKE pod | Nomad job / systemd service / Docker Compose |
| Container Hardening | readonlyRootFilesystem, drop caps, non-root in task def |
Cloud Run: read-only by default. GKE: securityContext in pod spec |
systemd: ProtectSystem=strict, seccomp. Docker: --read-only --cap-drop ALL |
| Egress Enforcement | VPC Security Group: default-deny egress, allowlist API IPs | VPC Firewall rules or GKE NetworkPolicy | nftables / iptables (reference config provided). Nomad: network stanza + Consul intentions |
| Credential Isolation | Sidecar container in ECS task. Secrets from AWS Secrets Manager | Sidecar in Cloud Run multi-container or GKE pod. Secrets from Secret Manager | Sidecar process or Unix socket proxy. Secrets from Vault, SOPS, agenix |
| Approval Queue | SQS FIFO + Lambda + Slack webhook | Pub/Sub + Cloud Function + Slack webhook | Redis queue / PostgreSQL table + cron notifier + Slack/email |
| Identity Backend | Slack user ID, IAM roles, or Cognito | Slack user ID, Google Workspace identity, or Firebase Auth | Slack user ID, LDAP, Keycloak, or API key with scopes |
| Monitoring | DogStatsD sidecar → Datadog. Or CloudWatch custom metrics | OpenTelemetry → Cloud Monitoring. Or Datadog | Prometheus textfile collector (reference config provided). Or Grafana Cloud |
| Log Shipping | CloudWatch Logs + Datadog Forwarder Lambda | Cloud Logging (built-in for Cloud Run/GKE) | Grafana Alloy → Loki. Or Promtail. Or rsyslog → your SIEM |
| Priority | Layer | Why First | Effort |
|---|---|---|---|
| P0 | Guard Chain + Trust Tiers | Prevents prompt injection and unauthorized access — the two most likely attack vectors on any deployment. | ~2 days |
| P0 | Container / Process Hardening | Limits blast radius. Every platform has native primitives (ECS task def, GKE securityContext, systemd sandbox). | ~1 day |
| P1 | Egress Enforcement | Prevents data exfiltration. Use your platform's network primitives (SG, firewall rules, nftables). | ~2 days |
| P1 | Approval Queue | Prevents catastrophic actions. Use your existing queue infra (SQS, Pub/Sub, Redis). | ~2 days |
| P2 | Web Content Guard | Only needed if agent fetches external URLs. Plugin code works as-is. | ~1 day |
| P2 | Monitoring + Alerting | Visibility. The 28 metrics and 21 alert rules are portable — adapt to your stack (Datadog, Prometheus, CloudWatch). | ~1 day |
| P3 | Session Analysis + Loop Breaker | Behavioral detection. Lower priority — structural controls handle most cases. | ~1 day |
Agent assists with incident response, IaC plan review, security finding analysis, audit trail investigation. Key controls: Trust tiers (SRE = owner), action quotas on destructive ops, approval queue for production changes.
Agent queries dashboards/metrics, analyzes logs, correlates failures across services. Key controls: Read-only tool access for most users. Egress limited to internal monitoring endpoints.
Agent assists with code review, PR creation, test execution, documentation. Key controls: Sandbox (can't reach production databases), web content guard (fetching external docs), trust tiers (team = known, CI = restricted).
Agent executes predefined runbooks, restarts services, scales resources. Key controls: Owner-only tier, approval queue for every action, full audit trail via guard.log, loop breaker prevents runaway automation.
Agent defense layers complement your existing security stack — they operate at a layer (semantic intent) that traditional tools cannot see:
| Your Existing Tool | Protects | Agent Layer Adds |
|---|---|---|
| Cloud threat detection (GuardDuty, SCC, Falco) | Infrastructure anomalies | Guard chain catches prompt-level attacks at the semantic layer |
| WAF / API Gateway | HTTP-level attacks | Guard chain analyzes intent, not just payload patterns |
| Audit logs (CloudTrail, GCP Audit, auditd) | API/system call audit | Guard.log provides agent-level audit (what was attempted, why blocked) |
| Secret managers (Vault, AWS SM, GCP SM) | Credential storage | Sidecar proxy ensures credentials never enter the agent's context |
| Monitoring (Datadog, Prometheus, CloudWatch) | Infrastructure metrics | Agent-specific metrics (guard latency, block rate, tier breaches) |
| SIEM / log aggregation | Correlation & alerting | Guard.log is JSON-lines, PII-safe, ready to ship to any SIEM |
| Component | Solo (reference) | Startup (low volume) | Growth (high volume) |
|---|---|---|---|
| Lakera Guard | Free (10K/mo) | $50/mo (100K calls) | $200/mo (1M calls) |
| Prompt Guard 2 (local) | Free (CPU) | ~$30/mo (1 vCPU sidecar) | ~$30/mo (same) |
| GPT-4o-mini fallback | ~$0.01/mo | ~$1/mo | ~$5/mo |
| Agent compute | Free (existing host) | ~$30-60/mo (Fargate/Cloud Run) | ~$100-200/mo |
| Queue + notifications | Free (file-based) | ~$1/mo (SQS/Pub/Sub) | ~$5/mo |
| Total incremental | ~$0.04/mo | ~$80-140/mo | ~$240-440/mo |
All secrets managed via agenix (NixOS declarative secrets).
LAKERA_API_KEY — Guard v2 authOPENROUTER_API_KEY — GPT-4o-mini fallbackOWNER_PHONES — E.164 numbers for notificationsOWNER_PIN — Identity elevationBRAVE_SEARCH_API_KEY — Search fallbackDecrypted at deploy time to /etc/openclawd/env (0440). Auto-restarted on changes via restartTriggers.
55 test vectors covering:
./deploy.sh openclawd guard-test
./deploy.sh openclawd status | Check security posture |
./deploy.sh openclawd guard-test | Run 55-vector test suite |
./deploy.sh openclawd guard-status | Guard health + credit alerts |
sudo openclaw-lock | Lock identity files |
sudo openclaw-unlock | Unlock for editing |
sudo openclaw-approve | List/resolve pending approvals |
sudo openclaw-save-canonical | Backup golden copy |
sudo openclaw-restore-canonical | Incident response restore |
This architecture is documented in an ongoing experience report:
docs/paper-draft.mddocs/PAPER-OUTLINE.mdLESSONS-OPENCLAW.mdReferences: Agents of Chaos (arXiv:2602.20021), CaMeL (arXiv:2503.18813), Silent Egress (arXiv:2602.22450), PCAS (arXiv:2602.16708), AgentSys (arXiv:2602.07398)