Guarding tool calls against prompt injection / exfiltration

Hi everyone, I’m collecting real-world prompt injection cases against tool-using agents (LangGraph/LangChain). If you’ve seen an agent get steered into sending data to the wrong place (email/Slack/share links) or making an unintended write, I’d love to hear what the pattern looked like.
In return, I can share a tiny test harness + guardrail approach that blocks those tool calls and captures evidence for debugging.

1 Like

Hi @skylerxu199

This is a cool idea. I’ve seen cases where indirect prompt injection led agents to make tool calls they didn’t mean to (especially when they passed sensitive information to external APIs). It would be great if you could share the test harness and guardrail method. Please also leave the link to the document here so I can read it.

Also @skylerxu199 please do refer to these docs which might be useful– Guardrails - Docs by LangChain

1 Like

I put together a short writeup that explains the approach and threat model, plus a way to get access to the guardrail SDK: Agent Time Machine. Feedback welcome, especially on false positives and integration friction.

Thanks for the pointer to LangChain Guardrails docs. High level, TimeMachine acts like middleware around tool execution: it intercepts tool calls, checks policy at execution time (not just in the prompt), and records an evidence trail showing which untrusted output influenced which tool argument (eg recipient/URL/IBAN).