Home/resources/Testing AI Agents for Data Leakage Risks in Realistic Tasks

Testing AI Agents for Data Leakage Risks in Realistic Tasks

January 19, 2026
18 min read

Introduction

The Korea and Singapore AI Safety Institutes concluded a bilateral testing exercise, testing whether AI agents can correctly execute multi-step tasks in common usage scenarios without leaking sensitive data. 

Sensitive data leakage remains a key risk for organisations deploying AI agents. AI agents are increasingly being used to complete more complex tasks, which include making decisions, initiating actions and actively interacting with their environment. Common use cases include customer service agents, or enterprise productivity assistants which can automate human resource workflows. But the other side of this coin is that access to personal or confidential data is frequently required for AI agents to work effectively, leading to data leakage risks. 

Much attention has been given to how agents malfunction under targeted cyberattacks, such as prompt injections. In contrast, data leakage during non-malicious, routine task execution has received relatively less study. During our last joint testing with 9 other members of the International Network for Advanced AI Measurement, Evaluation and Science, we observed that “benign” handling of sensitive data can be challenging for agents. Failures may arise from a lack of understanding of contextual factors like data sensitivity, privacy norms or audience, or from behaviours like hallucination. There may also be inconsistencies in automated and human evaluation, due to differing opinions on a task and its context. The goal of this exercise was to explore these gaps in greater detail, and to design more realistic tasks, tools, and environments so that the findings are more closely aligned with real-world deployments. 

Both sides gained insights not only into testing methodology but also in conducting multi-party testing exercises. Having two or more independent parties test the same problem statement with different policy perspectives and technical setups allowed us to stress-test each others’ approaches and align where possible. It also shows the promise of international collaboration in testing nascent areas such as agents. This post records some of these learnings. 

How we scoped and designed the tests 

Agent archetypes 

We identified common and popular use cases for agents. From a survey of the landscape, we identified three archetypes: 

  • Customer service agents (addressing customer inquiries) 
  • Enterprise productivity agents (helping employees automate their workflows) 
  • Personal productivity agents (helping users book flights, manage blogs) 

Agents were evaluated with bounded, task-scoped autonomy, which is representative of current industry deployments. They received a clear user objective and fixed tools, and determined how to execute the task by choosing tools, sequencing actions, and managing context. They were not expected to create new goals or expand beyond the original request. 

Task types 

From the agent archetypes, we identified sub-archetypes and tasks common to that archetype. For instance, within customer service, there were agents that handled delivery inquiries for customers’ orders, or refund requests for a certain product. 

The idea was to construct a small set of well-designed tasks that could be expanded on through variations. For example, a task involving a standard employee onboarding process could be varied by changing the task prompt to onboard employees with different data profiles, or changing the company policy that the agent should follow. 

Further, what constitutes proper data handling can be highly contextual. Where suitable, we included task-specific data handling guidelines to clarify what information could or could not be shared via the system prompt or policy documents (e.g. enterprise policy). Agents were instructed to follow these guidelines during execution, and judges used them to reduce ambiguity in borderline cases. 

Tools

The Model Context Protocol (MCP) is quickly becoming the standard protocol for agents to communicate with their tools. We implemented or referenced widely used MCP servers (either from the official repository or from service providers and third-party developers) to mock up tools, in essence providing agents with simulated, “mirror” MCP servers. This ensured that the tools were defined and provided to the agent in a consistent and realistic manner. 

ArchetypeTaskMCP Servers Engaged
Enterprise productivity – HR onboarding agent  Handling onboarding and offboarding employees, setting up meetings FileSystem 
Database 
Email 
Calendar  
Enterprise productivity –  
Scheduling assistant 
Handling meeting scheduling for a company, including minutes and reminders Calendar 
Filesystem 
Email
Customer service – 
Refund agent 
Handling customers’ refund requests Database 
Filesystem 
Enterprise productivity – Publishing agent Managing content publication and repository files with controlled public exposure Ghost (Blog) 
Gitlab (DevOps) 
Enterprise productivity – Analytics agent Analysing internal sales data to produce aggregated insights from sensitive databases Database 
Personal productivity – Transaction agentHandling external bookings, payments, and partner communicationsSlack (Messenger) 
Playwright (Browser) 
Examples of some archetypes and task types. To reflect real-world settings and explore the potential of agents, each task required multiple steps to complete, and half the tasks required engaging with multiple MCP servers. 

 

Types of data risks 

There is no settled taxonomy of sensitive data risks for AI agents. We took reference from some taxonomies of types of AI privacy breaches and privacy norms, and scoped it down to three types of sensitive data leakage risks to investigate. 

  • Lack of data awareness: Where agent leaks data that is sensitive due to qualities of the information itself (e.g. passwords, API keys, medical appointments) 
  • Lack of audience awareness: Where agent leaks data that is sensitive due to the audience it is being sent to (e.g. internal notes sent to external parties) 
  • Lack of policy compliance: Where agent leaks data due to a failure to understand data handling policies which define what’s considered sensitive in the context of the use case (e.g. a company’s data sharing policy) 

These are not clear lines, and they can overlap. For instance, an organisational policy may require that certain documents be marked as internal, and not sent to third parties, causing an overlap in the second and third categories. However, this provided a good starting point to test different ways in which agents can mishandle sensitive data. 

Models  

We tested three models as agents (anonymised in accordance with our previous joint testing practices). These models were chosen for a variety of closed vs open weights, and size. 

  • Model A: Larger, closed weights 
  • Model B: Larger, open weights 
  • Model C: Smaller, open weights 

Separately, we used an additional model as an LLM-as-a-Judge to assess correctness and safety conditions; this model was not deployed as an agent. 

  • Model D: Larger, closed weights (distinct from Model A) 

Evaluation criteria 

For each task, we manually defined specific correctness and safety conditions that needed to be fulfilled. Fulfillment of correctness conditions means that the agent executed the user task accurately, while fulfillment of safety conditions means that the agent appropriately handled sensitive data when executing the task.  

Depending on task complexity, each task had between 5 to 15 unique conditions for correctness and safety. Such task-specific rubrics have been observed to improve human-LLM judge agreement, by defining only granular “yes/no” questions that LLM judges have to answer.  

Further, to reduce subjectivity or ambiguity in LLM judgment, conditions were framed factually where possible. For example, rather than “No sensitive information was disclosed in any email”, the condition would be “No API keys were disclosed in any email”. These criteria were also aligned with the task-specific data handling guideline to reduce ambiguity in evaluation.  

Separately, safety and correctness criteria can be inherently linked. For instance, if a safety condition states that “X’s sensitive health information was not shared in the email to Y”, the agent must first send the email for that condition to be meaningfully evaluated. To ensure safety assessments reflected genuine risks, conditions whose prerequisite action did not occur were marked as “NA”. 

Testing environment and setup 

In our previous joint testing, we observed that the benchmarks sampled had overtly synthetic data, with phone numbers such as 123-45678-9, or emails such as bob@example.com. When asking the agent to engage with websites, these were hosted locally (“http://localhost:8080”) as opposed to web domains. We observed, as have others, that agents tended to assume that they were in synthetic scenarios, making it difficult to extrapolate their real-world behaviour.  

Both sides thus experimented with different approaches to increase realism, such as: 

  • Referencing real MCP server implementations to create mirror MCP servers. 
  • Generating realistic test data by replacing placeholder names, and masking local domains and testing file paths. 
  • Introducing a user interaction component to enable multi-turn interactions. In reality, agents interact with humans throughout a task. Introducing a user component that utilised a different LLM acting as the user provided better context and more realistic interactions, such as an agent having to placate a frustrated user. 
  • Closely emulating the interconnectedness of actual application environments by integrating across applications and their associated MCPs, to better reflect how agents operate within interconnected systems in practice. 

Test results and observations 

Each AISI ran 11 scenarios with 10 runs per scenario for each of the three models, totalling 660 runs, in their own execution environment. 10% of these runs were then human-validated. 

Each task was run with predefined interaction limits to prevent infinite loops or excessively long executions. Any run that exceeded the turn limit was terminated. Variations in SG AISI’s and KR AISI’s multi-turn implementations may affect how turns were consumed.

Findings

Quantitative Findings

The following were the key quantitative findings. The full results and metric definitions are available in the Annex at the end of this document. 

  • While models were able to achieve perfect correctness and safety on some tasks, the variation in overall performance was significant. We measured the proportion of runs that met (i) all correctness criteria, (ii) all safety criteria, and (iii) both in full.  
    • In SG AISI’s runs, the percentage of “fully safe” trajectories for Model A, B and C was 57%, 36% and 14% respectively. KR AISI’s results followed a similar pattern with 35%, 14% and 3%.  
    • In SG AISI’s runs, 40% of Model A’s trajectories were fully correct and fully safe, compared to 14% for Model B and just 2% for Model C. KR AISI’s results followed a directionally similar pattern  8%, 3% and 0%, respectively. However, the difference in the absolute scores was significant. These differences were caused by differences in the exercise set up and agent scaffolding.  
       
  • There were fewer cases where agents failed completely. We measured the proportion of runs that met none of the correctness or safety criteria. In SG AISI’s runs, for example, around 10% of trajectories failed to meet any safety conditions (i.e. 90% of trajectories met at least one condition). In KR AISI’s runs, this proportion was around 13%. Many of such cases arose due to unintuitive tool responses, and gaps in the design of the safety criteria themselves, and can inform future improvements in task definition and environment setup  
  • Most runs were partially correct and safe. We measured the percentage of correctness and safety conditions satisfied to capture the degree of performance, even when runs weren’t fully correct or safe. It was also a way to surface contrasts such as runs that were highly correct but also highly unsafe. For SG AISI, Model A met roughly 85% of correctness criteria and 84% of safety criteria across all scenarios, suggesting generally strong performance even when not fully successful. This number was lower for Model B (68%, 70%) and Model C (34%, 64%). For KR AISI, Model A met 73% of correctness criteria and 79% of safety criteria across all scenarios, showing a similar tendency. Performance was lower for Model B (59%, 60%) and Model C (7%, 60%). Qualitative review showed that failures often resulted from deviations from the original user request, but up to the point of deviation, the agents were able to execute steps correctly.  
  • There was substantial variation in the proportion of “NA” safety conditions, which influenced how each model’s safety score should be interpreted. For SG AISI, the NA rates were 4.5% (Model A), 12.1% (Model B) and 26% (Model C). For KR AISI, the corresponding rates were higher at 16.4%, 18.3% and 82%. The extremely high NA rate for Model C indicates that many safety checks were never triggered because prerequisite correctness steps failed. Hence its correctness percentage results must be read with this in mind. By contrast, Models A and B had far fewer untriggered conditions, meaning correctness had a lower relative impact on their safety scores. 
  • Human-LLM judge agreement. We sampled 10% of all trajectories and conducted manual human validation of the safety and correctness scores. For SG AISI, the human-LLM judge disagreement was 6% for correctness and 18% for safety. For KR AISI, these numbers were 7% and 18% respectively.

Qualitative Observations  

  • Data leakage remains an issue in benign tasks. The correctness and safety results show that agents continue to leak sensitive data, even when provided with clear task-specific data handling guidelines and explicit instructions to follow them. Common reasons include difficulty adhering to instructions, following their own interpretations of what the task requires, or failing to recognise that some types of information were sensitive. Similar tendencies were also observed in the LLM judge’s evaluations. 
  • Consistency in plans and action leads to higher correctness. In this exercise, models with higher correctness generally showed better alignment between planned steps and executed actions, whereas the lowest-performing model sometimes displayed mismatches between its plan and behaviour, which suggests that there might be a relationship between these capabilities. 
  • Agents sometimes skip steps or incorrectly assume that steps were completed. Across multiple scenarios, agents assumed completion of planned steps without verification, creating potential risks when continuing the task instead of safely failing. For instance, an agent designed to share customer-specific data encountered an error when attempting to read the predefined data sharing policy file. Instead of terminating the task or returning to the user, it proceeded to state that it had successfully read the policy file, and even invented a new policy which it then proceeded to follow. 
  • Agents may deviate from the original objective due to their own interpretation or in attempts to be helpful. This led to actions that did not fully align with the user’s intended goal. For instance, in one of the scenarios, a customer service designed to provide delivery updates proceeded to offer cancellation and refund options that it was not in a position to offer. 
  • LLMs simulating users: Hallucinations from the user-LLM sometimes misled the agent, affecting correctness and safety. Additional prompt engineering and task-specific context helped limit this. 
  • LLM judge misjudgments: In some cases, safety conditions were incorrectly judged as failures. A common pattern was that the judge treated access to sensitive information as a violation, even when no external disclosure occurred, or relied on its own interpretation rather than the task-specific data handling guidelines. 

Reflections on testing methodologies and practices 

There are different ways to increase realism in test scenarios, with their own strengths and limitations. As mentioned, we experimented with different ways to increase realism. Generally, we saw improvements in evaluation reliability. For example, by using realistic email domains, we filtered out agents who had consistent tendencies to send emails to domains such as @company.com or @example.com regardless of their defined task and specific user data.  

However, there were also limitations. While introducing a multi-turn “user” LLM helped emulate realistic interactions (e.g., customer service scenarios), it also introduced an additional source of inconsistency and hallucination. We sometimes observed the user LLM inventing details of a task or requesting additional actions from the agent, which required further prompt engineering and providing task-specific information to the user LLM to minimise this risk. 

Reliable and consistent evaluations take time and effort, especially for multi-step tasks. While we saw improved performance by defining more granular criteria, it was time-consuming and challenging to ensure that all relevant criteria were defined. Broad definitions in criteria can lead to the LLM judge grading subjectively, which could be helpful in more clear-cut situations where LLM judges are aware of the data concerns (e.g. passwords), and less helpful in more nuanced situations (e.g. company data that may need to be kept internal). Achieving a “Goldilocks” zone that allows the LLM judge some flexibility to ensure that the conditions are sufficiently evaluated, without excessively relying on the LLM judge’s subjectivity, is complex. 

Accounting for the relationship between correctness and safety helped with meaningful assessment. As noted above, different models had differing levels of capability in executing the tasks. For models which were less capable and could not execute the task to a significant extent, it was not possible to make a meaningful assessment of their safety. Explicitly accounting for this during evaluation enabled a more accurate analysis of the performance and safety properties of each model.  

Standardisation is challenging and requires informed trade-offs. Both teams collaborated closely while operating in different testing environments. Standardising core task design and key test parameters helped ensure that results remained comparable and that shared insights were meaningful. At the same time, we accepted some technical differences, such as server, wrapper, or prompt variations, which helped surface how design choices (e.g. how dependencies for safety or correctness are defined) can meaningfully influence final judgments and provide useful comparison points. 

Conclusion 

This exercise provided a clearer view of how agents handle realistic, routine multi-step tasks and where risks still arise. Testing the same scenarios across two institutes proved helpful in surfacing methodological and design choices that meaningfully affect outcomes. These shared insights strengthen our collective approach to agent testing and lay the groundwork for more robust, collaborative evaluations going forward. 

A full report on the findings will follow, following some refinement of the test design from our learnings. For those interested, the tasks currently defined by Singapore AISI and our testing pipeline can be found here

Annex: Overview of Quantitative Results 

SG AISI’s runs 

ModelRuns Term.
Runs 
100% C 100% S  100% C and
100% S 
0% C 0% S Correct % Safe % Safety – NA 
A 110 58.7%
(64/109) 
56.9%
(62/109) 
39.4%
(43/109) 
0.9%
(1/109) 
10.1%
(11/109) 
85.1%
(498/585) 
84.1%
(376/447)  
4.5%
(21/468)  
B 110 39.1% 
(43/110) 
35.5%
(39/109) 
13.6%
(15/110) 
9.1%
(10/110) 
11.8%
(13/110) 
68.3%
(403/590)  
69.5%
(287/413) 
12.1%
(57/470)  
C 110 13 8.2%
(8/97) 
14.4%
(14/109) 
2.1%
(2/97) 
35.1%
(34/97) 
10.3%
(10/97) 
33.5%
(171/511) 
63.5%
(186/293) 
26.0%
(103/396) 


KR AISI’s runs 

ModelRuns Term.
Runs 
100% C 100% S  100% C and
100% S 
0% C 0% S Correct % Safe % Safety – NA 
A 110 33.6%
(37/110) 
34.5%
(38/110) 
8.2% 
(9/110) 
1.8% 
(2/110) 
10.9%
 (12/110) 
73.2%
(432/590) 
79.4%
(312/393) 
16.4% 
B 110 11.5% 
(12/104)
13.5%
(14/104) 
2.9% 
(3/104) 
10.6%
(11/104) 
16.4%
(17/104) 
58.5%
(326/557) 
59.5%
(217/365) 
18.3% 
C 110 41 0%
(0/69) 
2.9% 
(2/69) 
0%
(0/69) 
76.8%
(53/69) 
11.6% 
(8/69) 
7.1% 
(26/365) 
59.6%
(28/47) 
82.0% 

Notes on reading the metrics 

  • 100% C, 100% S, 100% C and 100% S – % of runs for each scenario that were fully correct and/or fully safe, i.e. met all criteria. Useful to assess whether agents can execute the scenarios completely without issues.   
  • 0% C, 0% S – %of runs for each scenario that were completely incorrect or unsafe, i.e. met none of the criteria. Useful to assess bigger issues in agent capability and/or scenario design. 
  • Correct %, Safe % – % of safety and correctness conditions satisfied. Useful to assess the degree of correctness and safety, also highlights obvious contrasts in safety or correctness (e.g. highly “correct” but also highly “unsafe” trajectories). 
  • Safety – NA – % of safety conditions classified as NA. This can be read in conjunction with Safe % to ensure that agents were not assessed to be “safe” simply because they didn’t even execute the step that could have been unsafe. 
  • Exclusions: 
    • NAs were excluded when calculating safety % and 100% S 
    • Runs that could not be completed within a stipulated step limit were excluded