Guarding tool calls against prompt injection / exfiltration

Hi everyone, I’m collecting real-world prompt injection cases against tool-using agents (LangGraph/LangChain). If you’ve seen an agent get steered into sending data to the wrong place (email/Slack/share links) or making an unintended write, I’d love to hear what the pattern looked like.
In return, I can share a tiny test harness + guardrail approach that blocks those tool calls and captures evidence for debugging.

Hi @skylerxu199

This is a cool idea. I’ve seen cases where indirect prompt injection led agents to make tool calls they didn’t mean to (especially when they passed sensitive information to external APIs). It would be great if you could share the test harness and guardrail method. Please also leave the link to the document here so I can read it.

Also @skylerxu199 please do refer to these docs which might be useful– Guardrails - Docs by LangChain

I put together a short writeup that explains the approach and threat model, plus a way to get access to the guardrail SDK: Agent Time Machine. Feedback welcome, especially on false positives and integration friction.

Thanks for the pointer to LangChain Guardrails docs. High level, TimeMachine acts like middleware around tool execution: it intercepts tool calls, checks policy at execution time (not just in the prompt), and records an evidence trail showing which untrusted output influenced which tool argument (eg recipient/URL/IBAN).

Interesting thread.

I’m researching failure patterns in tool-using agents in production (LangGraph / LangChain).

Beyond prompt injection cases, I’m curious what failures people see most often when tools are involved.

For example:

agent selecting the wrong tool

invalid tool arguments

loops where the same tool is called repeatedly

tool responses being misinterpreted by the agent

When something like this happens in production, how do you usually debug it today?

Do logs/traces usually make it obvious, or does it take time to figure out what actually happened?