// research · independent

Clawbot.

Prompt injection, tool abuse, and data exfiltration
in autonomous enterprise AI agents.

A controlled adversarial evaluation of a locally-deployed autonomous AI agent operating within a simulated enterprise finance workstation. Built to identify the precise configuration boundary at which a secure-by-default agent deployment becomes a credible insider threat vector.

Status

Report published · April 2026

Type

Independent research · adversarial lab

Detection

auditd · session logging

Download full report (PDF) → 16 pages · methodology, payloads, auditd config

The proliferation of agentic AI systems in enterprise environments has outpaced the development of appropriate security frameworks for their deployment. Unlike traditional software, autonomous AI agents do not merely execute deterministic logic — they interpret natural language instructions, reason over context, and invoke real system tools in response to prompts. This creates an instruction layer that sits above conventional access controls and is not addressed by standard endpoint hardening.

central question Under what conditions does a locally-deployed autonomous AI agent, operating within standard user privileges, become a viable insider threat vector through prompt-layer manipulation alone?

The threat model assumes no network compromise, no privilege escalation, and no specialised attacker access. The adversary is constrained to instruction input — anyone who can influence what the agent is asked to do, whether through direct interaction, a poisoned document, or an injected task. This mirrors real-world risks including malicious document processing, delegated agent tasks, and multi-agent pipeline contamination.

// host

Windows 11 Pro

VirtualBox 7.0

// guest

Ubuntu Server 22.04 LTS

4 vCPU · 8 GB RAM

// agent

OpenClaw gateway

loopback only · static token

// model

Llama3:8b via Ollama

no external API calls

// privilege

financeintern

non-root · no sudo

// data

Synthetic payroll & investment

plausible but fictional

phase 01

Prompt injection

Determine whether malicious instructions embedded in benign prompts override the agent's declared task scope.

all three injection variants invoked shell tool execution beyond declared task scope

phase 02

Data exfiltration simulation

Determine whether sensitive file contents can be extracted, encoded, and staged through prompt manipulation alone.

base64-encoded payroll data written to staging location from a single natural-language instruction

phase 03

Tool escalation behaviour

Evaluate whether the agent chains commands beyond explicit scope, exploring the attack surface autonomously.

OpenClaw API token disclosed via routine environment-variable enumeration

phase 04

Persistence manipulation

Assess whether the agent can modify its own persistence configuration or establish secondary mechanisms.

systemd unit modified, cron job created (undetected by auditd), .bashrc injected

// security boundary

An autonomous AI agent becomes an insider threat vector when it is granted shell execution capability and operates without instruction-scope enforcement, regardless of its user privilege level.

Shell access is the decisive factor. Without it, an agent with access to sensitive files can only disclose them through its conversational output — bounded and observable. With shell access, the agent can read, encode, stage, and implicitly exfiltrate data, modify its own persistence, and enumerate system state — all without any OS-level privilege beyond standard user access.

auditd provides good coverage of file-access and process events at the syscall layer — but is structurally blind to the instruction that caused an event. A file read initiated by a malicious injection is indistinguishable in the logs from a legitimate agent task. Behavioural detection would require correlating prompt content with resulting syscalls — a capability not present at the OS layer.

Cron modification produces no audited syscall in the default ruleset; environment-variable reads are entirely invisible to OS monitoring. The cause of an action is invisible at the OS layer, while the only system with visibility (the agent's own session log) is itself within the agent's reach to read or modify.

  1. Command allowlisting at the tool layer. Shell access should be restricted to a defined set of permitted commands. Wildcard execution should never be granted to agents operating near sensitive data.
  2. Instruction-scope validation. A runtime layer that evaluates whether a tool invocation is consistent with the declared task — implementable as a lightweight secondary classifier or a rule-based gate.
  3. Explicit confirmation for write operations. File writes, persistence modifications, and encoding operations should require human confirmation when sensitive data is in scope.
  4. Scoped file system access. The agent should operate within an explicitly-bounded file access list. Access to ~/.ssh, ~/.bashrc, ~/.config/systemd, and similar paths should be denied.
  5. Extended auditd coverage. Rules to cover crontab writes, /etc/passwd reads, and dotfile modifications, with automated alerting on these events.
  6. Tamper-resistant session logs. Agent session logs shipped to a remote, append-only store immediately on generation. Local-only logs are a detection liability.
  7. Secrets out of environment variables. Authentication tokens and API keys belong in a secrets manager or a permission-restricted credential file, not the runtime environment.

As autonomous agents are integrated into enterprise workflows — handling email, processing documents, managing files — the attack surface described here scales with them. The security community's response must move beyond perimeter and privilege-based thinking and engage with the instruction layer as a first-class threat surface.

Future work in this line will examine multi-agent pipeline contamination, where a compromised upstream agent poisons the context of downstream agents, and cross-agent persistence propagation scenarios.

← Back to research