Agent Goal Pinning, an Experiment and POC
What if we applied the same theory from mobile application security to an agent ?
Let's start off by saying this is an experiment and just an idea that I've been looking into. This work builds on my previously shared research into non adversarial conversational state corruption in LLMs. It may or may not be an answer, but it sure was fun going down this path. Without any fanfare let's get straight into it. As usual, I like to start at the beginning to ensure that we have a shared understanding to work from.
What is an agent?
Simply put, an AI agent is an intelligent software system designed to perceive its environment, reason about how to achieve specific goals, and take autonomous actions to fulfill those objectives.
There are a number of characteristics that define an agent and these can include:
Autonomy: Agents operate independently to pursue goals, making decisions on how to complete a task rather than just executing a single instruction.
Perception & Action: Agents perceive their environment (e.g., user inputs, database states) and act upon the perceived environment by using tools, calling APIs, or manipulating files in order to change the state of their environment
Reasoning & Planning: Agents use the connected LLM to break down high-level goals into step-by-step plans (often using patterns like “ReAct”—Reason + Act) and adjust those plans based on feedback or errors.
Tool Use: A defining feature is the ability to connect to external functions—such as web browsers, calculators, databases, or other software—to perform tasks beyond the model’s internal knowledge.
Memory: Agents maintain short-term (session) or long-term (persistent) memory to retain context, learn from past interactions, and improve decision-making over time
Now that we have the basics out of the way, let's get to the fun part.
Conversational State Corruption
From the aforementioned definitions, we know that an agent should have a clear goal and reasoning about how to achieve said goal using a number of tools within its environment. This is the part we love as security folks. How do we get the agent to do things that it isn't supposed to do by manipulating its behaviour and architecture against itself?
There are many ways to do this, and that is another writeup. For purposes of this writeup, we build on my previous work of non adversarial state corruption where I use Socratric questioning and template injection at the reasoning layer to create epistemic drift. This aim of this attack vector is to exploit the self critic mechanism to gradually bypass the inherent filters and guardrails that are built on deontological ethics and pattern matching techniques as opposed to context aware dynamic resolution and filtering.
Lest I digress into the full attack chain, let's leave it at that as that's what is needed to understand why this POC exists. You now know the high level details of how we can achieve epistemic drift and why this is a problem. In short, this is how this experiment and POC came to be.
What if we applied the same principles from mobile security to agents?
Agen Goal & Permission Pinning with a dash of Merkle Chains for fun.
The idea is very simple really. We use the same principles of app pinning just in the agentic realm using existing cryptographic hash strategies and an authorisation service for verifying tool calling permission.
We start with the manifest file which defines the name, goals and permitted actions an agent may take and follow this journey:
A developer defines agent goals/permissions in
agent_manifest.jsonThe manifest is signed with the developers GPG key (that uses Elliptic Curve Cryptography) to produce the
agent_certificate.ascwhich we store in a secret manager (CMK).At runtime, the auth service loads the signed certificate and extracts SHA-256 hashes of the original goals and permissions.
When the agent attempts a privileged action, it sends its current goals/permissions to the auth service for cross checking if it is allowed to perform this action.
The service compares hashes — if they don’t match, the action is blocked and a
SECURITY_VIOLATIONalert is raised.Immutable Logging: Every authorized decision is appended to a Merkle Chain (hash chain).
Each log entry is cryptographically linked to the previous one.
The chain is persisted to
agent_decision_log.json.The
/logsendpoint retrieves the chain.The
/verify_logsendpoint re-calculates all hashes to prove integrity.
As previously stated, this is scoped to and limited to memory injection attacks and is an experiment to understand how we can leverage existing mechanisms in a new paradigm.
Below are some of the test results from this experiment
This is as far as I have gotten on exploring this idea.
Alternative Experiments
I also explored using ZKPs for the agent goal binding, albeit a fascinating and fun experiment, the computational costs, speed and overhead of a blockchain solution did not make sense for what I wanted to look into.
This was tested on the Polygon Mumbai testnet, which is a faster DLT tech stack and provides increased cryptographic benefits. The idea works but the experiment proved too much for what I wanted to work through and led me down the rabbit hole of creating an agent name service and more.
As you can see there are pros and cons to this approach compared to the initial simpler approach. I noted that this ZKP solution also created a wider attack surface and introduced too much complexity on the user end. These are some of the reasons that I abandoned this route and stuck to the simpler method.
So, What Now?
I started this entry by stating that this may or may not be a solution. I've created a repo for the basic experiment and POC, and invite anyone to challenge, invalidate or improve upon this idea. It is just an idea, not a production grade solution. I write about such ideas to increase discourse around how we could potentially approach Zero Trust through agent identity and goal pinning by using techniques from other AppSec domains. Breaking the agent tech stack was the easy part. Now the not so easy part, how do we make it easy to add defence in depth and defence in function in the new security paradigm?
As always,
Be Kind. Be Brave. Be You.
Mats.












