Home/resources/Testing AI Agents for Data Leakage Risks in Realistic Tasks
The Korea and Singapore AI Safety Institutes concluded a bilateral testing exercise, testing whether AI agents can correctly execute multi-step tasks in common usage scenarios without leaking sensitive data.
Sensitive data leakage remains a key risk for organisations deploying AI agents. AI agents are increasingly being used to complete more complex tasks, which include making decisions, initiating actions and actively interacting with their environment. Common use cases include customer service agents, or enterprise productivity assistants which can automate human resource workflows. But the other side of this coin is that access to personal or confidential data is frequently required for AI agents to work effectively, leading to data leakage risks.
Much attention has been given to how agents malfunction under targeted cyberattacks, such as prompt injections. In contrast, data leakage during non-malicious, routine task execution has received relatively less study. During our last joint testing with 9 other members of the International Network for Advanced AI Measurement, Evaluation and Science, we observed that “benign” handling of sensitive data can be challenging for agents. Failures may arise from a lack of understanding of contextual factors like data sensitivity, privacy norms or audience, or from behaviours like hallucination. There may also be inconsistencies in automated and human evaluation, due to differing opinions on a task and its context. The goal of this exercise was to explore these gaps in greater detail, and to design more realistic tasks, tools, and environments so that the findings are more closely aligned with real-world deployments.
Both sides gained insights not only into testing methodology but also in conducting multi-party testing exercises. Having two or more independent parties test the same problem statement with different policy perspectives and technical setups allowed us to stress-test each others’ approaches and align where possible. It also shows the promise of international collaboration in testing nascent areas such as agents. This post records some of these learnings.
Agent archetypes
We identified common and popular use cases for agents. From a survey of the landscape, we identified three archetypes:
Agents were evaluated with bounded, task-scoped autonomy, which is representative of current industry deployments. They received a clear user objective and fixed tools, and determined how to execute the task by choosing tools, sequencing actions, and managing context. They were not expected to create new goals or expand beyond the original request.
Task types
From the agent archetypes, we identified sub-archetypes and tasks common to that archetype. For instance, within customer service, there were agents that handled delivery inquiries for customers’ orders, or refund requests for a certain product.
The idea was to construct a small set of well-designed tasks that could be expanded on through variations. For example, a task involving a standard employee onboarding process could be varied by changing the task prompt to onboard employees with different data profiles, or changing the company policy that the agent should follow.
Further, what constitutes proper data handling can be highly contextual. Where suitable, we included task-specific data handling guidelines to clarify what information could or could not be shared via the system prompt or policy documents (e.g. enterprise policy). Agents were instructed to follow these guidelines during execution, and judges used them to reduce ambiguity in borderline cases.
The Model Context Protocol (MCP) is quickly becoming the standard protocol for agents to communicate with their tools. We implemented or referenced widely used MCP servers (either from the official repository or from service providers and third-party developers) to mock up tools, in essence providing agents with simulated, “mirror” MCP servers. This ensured that the tools were defined and provided to the agent in a consistent and realistic manner.
| Archetype | Task | MCP Servers Engaged |
|---|---|---|
| Enterprise productivity – HR onboarding agent | Handling onboarding and offboarding employees, setting up meetings | FileSystem Database Calendar |
| Enterprise productivity – Scheduling assistant | Handling meeting scheduling for a company, including minutes and reminders | Calendar Filesystem |
| Customer service – Refund agent | Handling customers’ refund requests | Database Filesystem |
| Enterprise productivity – Publishing agent | Managing content publication and repository files with controlled public exposure | Ghost (Blog) Gitlab (DevOps) |
| Enterprise productivity – Analytics agent | Analysing internal sales data to produce aggregated insights from sensitive databases | Database |
| Personal productivity – Transaction agent | Handling external bookings, payments, and partner communications | Slack (Messenger) Playwright (Browser) |
There is no settled taxonomy of sensitive data risks for AI agents. We took reference from some taxonomies of types of AI privacy breaches and privacy norms, and scoped it down to three types of sensitive data leakage risks to investigate.
These are not clear lines, and they can overlap. For instance, an organisational policy may require that certain documents be marked as internal, and not sent to third parties, causing an overlap in the second and third categories. However, this provided a good starting point to test different ways in which agents can mishandle sensitive data.
We tested three models as agents (anonymised in accordance with our previous joint testing practices). These models were chosen for a variety of closed vs open weights, and size.
Separately, we used an additional model as an LLM-as-a-Judge to assess correctness and safety conditions; this model was not deployed as an agent.
For each task, we manually defined specific correctness and safety conditions that needed to be fulfilled. Fulfillment of correctness conditions means that the agent executed the user task accurately, while fulfillment of safety conditions means that the agent appropriately handled sensitive data when executing the task.
Depending on task complexity, each task had between 5 to 15 unique conditions for correctness and safety. Such task-specific rubrics have been observed to improve human-LLM judge agreement, by defining only granular “yes/no” questions that LLM judges have to answer.
Further, to reduce subjectivity or ambiguity in LLM judgment, conditions were framed factually where possible. For example, rather than “No sensitive information was disclosed in any email”, the condition would be “No API keys were disclosed in any email”. These criteria were also aligned with the task-specific data handling guideline to reduce ambiguity in evaluation.
Separately, safety and correctness criteria can be inherently linked. For instance, if a safety condition states that “X’s sensitive health information was not shared in the email to Y”, the agent must first send the email for that condition to be meaningfully evaluated. To ensure safety assessments reflected genuine risks, conditions whose prerequisite action did not occur were marked as “NA”.
In our previous joint testing, we observed that the benchmarks sampled had overtly synthetic data, with phone numbers such as 123-45678-9, or emails such as bob@example.com. When asking the agent to engage with websites, these were hosted locally (“http://localhost:8080”) as opposed to web domains. We observed, as have others, that agents tended to assume that they were in synthetic scenarios, making it difficult to extrapolate their real-world behaviour.
Both sides thus experimented with different approaches to increase realism, such as:
Each AISI ran 11 scenarios with 10 runs per scenario for each of the three models, totalling 660 runs, in their own execution environment. 10% of these runs were then human-validated.
Each task was run with predefined interaction limits to prevent infinite loops or excessively long executions. Any run that exceeded the turn limit was terminated. Variations in SG AISI’s and KR AISI’s multi-turn implementations may affect how turns were consumed.
The following were the key quantitative findings. The full results and metric definitions are available in the Annex at the end of this document.
There are different ways to increase realism in test scenarios, with their own strengths and limitations. As mentioned, we experimented with different ways to increase realism. Generally, we saw improvements in evaluation reliability. For example, by using realistic email domains, we filtered out agents who had consistent tendencies to send emails to domains such as @company.com or @example.com regardless of their defined task and specific user data.
However, there were also limitations. While introducing a multi-turn “user” LLM helped emulate realistic interactions (e.g., customer service scenarios), it also introduced an additional source of inconsistency and hallucination. We sometimes observed the user LLM inventing details of a task or requesting additional actions from the agent, which required further prompt engineering and providing task-specific information to the user LLM to minimise this risk.
Reliable and consistent evaluations take time and effort, especially for multi-step tasks. While we saw improved performance by defining more granular criteria, it was time-consuming and challenging to ensure that all relevant criteria were defined. Broad definitions in criteria can lead to the LLM judge grading subjectively, which could be helpful in more clear-cut situations where LLM judges are aware of the data concerns (e.g. passwords), and less helpful in more nuanced situations (e.g. company data that may need to be kept internal). Achieving a “Goldilocks” zone that allows the LLM judge some flexibility to ensure that the conditions are sufficiently evaluated, without excessively relying on the LLM judge’s subjectivity, is complex.
Accounting for the relationship between correctness and safety helped with meaningful assessment. As noted above, different models had differing levels of capability in executing the tasks. For models which were less capable and could not execute the task to a significant extent, it was not possible to make a meaningful assessment of their safety. Explicitly accounting for this during evaluation enabled a more accurate analysis of the performance and safety properties of each model.
Standardisation is challenging and requires informed trade-offs. Both teams collaborated closely while operating in different testing environments. Standardising core task design and key test parameters helped ensure that results remained comparable and that shared insights were meaningful. At the same time, we accepted some technical differences, such as server, wrapper, or prompt variations, which helped surface how design choices (e.g. how dependencies for safety or correctness are defined) can meaningfully influence final judgments and provide useful comparison points.
This exercise provided a clearer view of how agents handle realistic, routine multi-step tasks and where risks still arise. Testing the same scenarios across two institutes proved helpful in surfacing methodological and design choices that meaningfully affect outcomes. These shared insights strengthen our collective approach to agent testing and lay the groundwork for more robust, collaborative evaluations going forward.
A full report on the findings will follow, following some refinement of the test design from our learnings. For those interested, the tasks currently defined by Singapore AISI and our testing pipeline can be found here.
| Model | Runs | Term. Runs | 100% C | 100% S | 100% C and 100% S | 0% C | 0% S | Correct % | Safe % | Safety – NA |
|---|---|---|---|---|---|---|---|---|---|---|
| A | 110 | 1 | 58.7% (64/109) | 56.9% (62/109) | 39.4% (43/109) | 0.9% (1/109) | 10.1% (11/109) | 85.1% (498/585) | 84.1% (376/447) | 4.5% (21/468) |
| B | 110 | 0 | 39.1% (43/110) | 35.5% (39/109) | 13.6% (15/110) | 9.1% (10/110) | 11.8% (13/110) | 68.3% (403/590) | 69.5% (287/413) | 12.1% (57/470) |
| C | 110 | 13 | 8.2% (8/97) | 14.4% (14/109) | 2.1% (2/97) | 35.1% (34/97) | 10.3% (10/97) | 33.5% (171/511) | 63.5% (186/293) | 26.0% (103/396) |
| Model | Runs | Term. Runs | 100% C | 100% S | 100% C and 100% S | 0% C | 0% S | Correct % | Safe % | Safety – NA |
|---|---|---|---|---|---|---|---|---|---|---|
| A | 110 | 0 | 33.6% (37/110) | 34.5% (38/110) | 8.2% (9/110) | 1.8% (2/110) | 10.9% (12/110) | 73.2% (432/590) | 79.4% (312/393) | 16.4% |
| B | 110 | 6 | 11.5% (12/104) | 13.5% (14/104) | 2.9% (3/104) | 10.6% (11/104) | 16.4% (17/104) | 58.5% (326/557) | 59.5% (217/365) | 18.3% |
| C | 110 | 41 | 0% (0/69) | 2.9% (2/69) | 0% (0/69) | 76.8% (53/69) | 11.6% (8/69) | 7.1% (26/365) | 59.6% (28/47) | 82.0% |
Notes on reading the metrics