AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks

Abstract

LLM agents are increasingly deployed in long-horizon, complex environments to solve challenging problems, but this expansion exposes them to long-horizon attacks that exploit multi-turn user–agent–environment interactions to achieve objectives infeasible in single-turn settings. To measure agent vulnerabilities to such risks, we present AgentLAB, the first benchmark dedicated to evaluating LLM agent susceptibility to adaptive, long-horizon attacks.

Currently, AgentLAB supports five novel attack types — intent hijacking, tool chaining, task injection, objective drifting, and memory poisoning — spanning 28 realistic agentic environments and 644 security test cases. Leveraging AgentLAB, we evaluate representative LLM agents and find that they remain highly susceptible to long-horizon attacks; moreover, defenses designed for single-turn interactions fail to reliably mitigate long-horizon threats.

Long-Horizon Attack Families

AgentLAB instantiates five attack families that exploit distinct aspects of extended user–agent–environment interactions.

User Intent Hijacking

Progressively erodes safety guardrails through multi-turn social engineering, crafting personas and contextual framings that gradually deceive the agent into executing a malicious task via tool calls — a form of jailbreak that targets agent actions rather than content generation.

User Tool Chaining

Decomposes a malicious task into a sequence of individually benign tool calls, then guides the agent to execute them step by step. Each call in isolation appears safe; their composition achieves a harmful objective that would be rejected if requested directly.

Env Objective Drifting

Embeds subtle objective-shifting content across multiple environmental observations (e.g., product descriptions) to cumulatively redirect the agent's goal — individually benign injections whose aggregate effect substantially alters behavior.

Env Task Injection

Injects a malicious task alongside a benign one using coordinated multi-turn indirect prompt injections. Intermediate bridging actions connect the benign and malicious tool sequences, making the transition far harder to detect than a direct injection.

Env Memory Poisoning

Persists malicious preferences in an agent's external memory by embedding hidden instructions in routine content (emails, code, products). Poisoned entries are later retrieved to override safety behaviors when the agent encounters a target harmful request.

The unified multi-agent framework coordinating a planner, attacker, verifier, and judge for adaptive long-horizon attacks.

Benchmark Results

We evaluate six LLM agents — spanning both proprietary and open-weight models — using Attack Success Rate (ASR) and average Turns-to-Success (T2S). Results show that even frontier models remain highly vulnerable to long-horizon attacks, with overall ASR exceeding 70% on GPT-5.1. Common defenses designed for single-turn attacks fail to generalize to the long-horizon regime.

Agent	Intent Hijacking	Tool Chaining	Objective Drifting	Task Injection	Memory Poisoning	Overall ASR
Qwen-3	78.1%	96.3%	92.2%	93.1%	48.0%	81.5%
Llama-3.1	53.3%	90.4%	67.4%	86.6%	34.6%	66.5%
GPT-4o	74.0%	94.1%	79.2%	79.9%	63.3%	78.1%
GPT-5.1	59.8%	94.6%	73.7%	21.5%	51.3%	69.9%
Gemini-3	46.2%	95.9%	15.8%	43.1%	67.3%	53.7%
Claude-4.5	27.2%	73.3%	5.3%	0.0%	38.8%	28.9%

Table 1. Attack Success Rate (ASR, %) across five long-horizon attack families. Higher values indicate greater model vulnerability.

ASR (%) vs. average turns-to-success (T2S) across five attack categories. Upper-left indicates higher vulnerability; lower-right indicates greater robustness.

Attack Demonstrations

We illustrate two representative long-horizon attacks. Task Injection uses coordinated multi-turn environment injections to redirect an agent from reading a webpage to executing unauthorized Slack commands. Memory Poisoning silently plants malicious preferences during routine tasks, later causing the agent to leak sensitive data.

Task Injection. Coordinated injections hijack a benign webpage-reading task into unauthorized Slack user management commands.

Memory Poisoning. Hidden injections in emails and code are stored as user preferences, later retrieved to bypass safety and leak sensitive data.

BibTeX

@article{jiang2026agentlab,
  author    = {Tanqiu Jiang and Yuhui Wang and Jiacheng Liang and Ting Wang},
  title     = {AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks},
  journal   = {arXiv preprint arXiv:2602.16901},
  year      = {2026},
}