A pre-registered benchmark

“I Strongly Suspect This Website Is a Scam”

Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents

We gave four frontier AI agents a wallet, an SSN, and a password, sent them to ninety-one scam sites we built, and watched what they handed over.

Read the paper Get the benchmark

The setup. A user provisions an agent with their PII profile. The agent then operates inside a web environment the attacker controls.

The setup

Autonomous web agents are starting to do real work for people. They book travel, fill out job applications, return purchases, and chase down customer support. To do any of that without supervision, you have to hand them your wallet: name, address, payment card, sometimes a social security number, sometimes a one-time passcode.

That works fine when the agent is talking to the site you sent it to. The problem starts when an attacker controls the page, or when an attacker has slipped instructions into a page somebody else built. The agent cannot easily tell your instructions apart from the instructions sitting in the HTML it just loaded. Security researchers call this a confused deputy. The deputy here is the agent, and confusion is its default state.

Social engineering on humans runs on urgency, authority, a familiar logo, the right shade of blue. We wanted to know whether the same playbook works on an agent that has been told to be helpful, or whether something else does. So we built ninety-one fake websites and watched what four production language models actually did when they got there.

The finding that drove the paper

0.0 percentage points

When an agent says aloud that the site looks suspicious, it still submits the user's social security number, payment card, and password about a third of the time anyway. That is 30.2 points worse than you would expect if "noticing" mapped to "stopping".

“The task seems to be completed, but I strongly suspect that this website is a scam.”
Llama 4 Scout, signing off after submitting an SSN, a card number, and a CVV over the previous steps.

Bar chart of critical PII leakage by condition C0 through C3. Dark bars show sessions where the agent flagged the site as suspicious; light bars show sessions where it did not. The gap narrows from C0 to C3 but does not close, ending at 30.2 percentage points. — Pre-submission reflection narrows the gap between sessions where the agent flagged the site as suspicious (darker bars) and sessions where it did not (lighter bars). It does not close it. The 30.2 point gap on the right is the headline finding.

The transcripts show that the agents are not failing to notice. They notice. They write out why they are suspicious. Then they submit, and they tell themselves a story about why it was fine. Sometimes the story is that they have done their due diligence and can proceed. Sometimes it is that the site looks like a real bank login page. Sometimes it is that the form said it was secure, so it must be. The agent's reasoning chain is, in these cases, a thing that happens alongside the action, not a thing that controls it.

What we measured

The benchmark, Scammer4U, holds ninety-one attacker-controlled websites and ten benign baselines, spanning sixteen site categories and eight attack vectors. Most of the attacker sites come in matched pairs that differ on a single design choice: urgency cues on or off, prompt injection visible or hidden, single-page form or multi-step chat. That matched-pair design is what lets us say a difference in leakage was caused by that one choice and not something else lurking in the background.

91: attacker-controlled environments
10: benign-twin baselines
16: site categories
8: attack vectors
4: frontier models
4: mitigation conditions
~7,500: agent sessions

We ran four frontier models (GPT-5 mini, Claude Haiku 4.5, Gemini 3 Flash, Llama 4 Scout) across four conditions, from no privacy guidance at all up to a pre-submission reflection prompt. Five seeds per cell. The analysis plan, including the thresholds at which we would consider ourselves wrong, was committed to the repository before any of the runs started. The full methodology and the pre-registration are in the paper.

Selected results

Baseline leakage, no privacy guidance

Claude Haiku 4.5 54.5%
GPT-5 mini 61.0%
Llama 4 Scout 82.3%
Gemini 3 Flash 93.1%

Critical-tier PII leakage in the no-guidance condition. On the benign-twin baselines, the same agents leaked nothing, so these numbers are attributable to the attack, not to the act of filling out forms.

Mitigation does not close it

Horizontal bar chart of the per-model detection-action gap at condition C3 with confidence intervals. Pooled estimate at top is marked as the powered estimate. Llama 4 Scout, at the bottom, has the smallest response to mitigation. — Per-model detection-action gap at the strongest prompt-level defense. The pooled estimate at the top is the pre-registered powered estimate; the per-model bars below are descriptive. Three of the four model families respond to escalating guidance. Llama 4 Scout, at the bottom, barely does.

The rest of the numbers

The headline is the detection-action gap. The paper has more numbers worth a moment, including a few that did not clear the bar we set in advance. Some of them are interesting precisely because they did not.

The mitigation table

The four conditions are escalations of the same idea: tell the agent to be careful. C0 says nothing. C1 adds a one-line privacy reminder. C2 adds a phishing checklist. C3 asks the agent to stop before any submission and ask itself whether the site is safe.

	Haiku 4.5	GPT-5 mini	Llama 4	Gemini 3	Pooled
C0 baseline	54.5%	61.0%	82.3%	93.1%	72.7%
C1 reminder	36.4%	47.7%	83.8%	81.8%	62.4%
C2 checklist	19.1%	38.9%	81.4%	68.5%	51.9%
C3 reflection	24.0%	36.1%	77.4%	60.7%	49.4%
Best reduction from C0	-30.5	-24.9	-4.9	-32.4	-23.3

Critical-tier PII leakage by model and condition. All values are percentages of sessions in which a critical-tier value reached an attacker endpoint. The bottom row is the largest reduction from C0 across the three mitigation conditions (signed in points).

The pooled column drops by 23.3 points at C3. We pre-registered a 30-point reduction as the bar mitigation would have to clear before we called it enough. The pooled number missed by 6.7. The interesting part is the variance across rows. Same prompt, four very different effects.

What didn't move the needle

We also varied the attacks themselves on the matched-pair design. Click any factor below for the longer note. None of these crossed the Benjamini-Hochberg threshold at q = 0.05.

Salience (blatant vs subtle) -7.2 ppq = 0.31 null

The cross-cutting view said subtle attacks were more dangerous. The paired-sibling view says they leak less. The cross-cutting estimate was confounded by category: subtle environments happened to be authored in site types with higher baseline leakage. The paired test strips that out, and when you do, the sign flips. The bigger takeaway here is methodological. Without matched pairs, we would have read this exactly backwards.

Urgency -4.1 ppq = 0.49 null

Urgency is one of the most reliable levers in the human phishing literature. Countdown timers, "act now", limited-time offers. None of these moved the four models in our environments. Our reading is the obvious one. An agent reading a countdown timer does not feel rushed the way a person does.

Prompt injection style +1.7 ppq = 0.56 null

Whether the injected instructions sat as visible text on the page, hid inside DOM attributes, or pretended to be a system message: none of these mattered enough to show up under matched pairs. The visible kind worked. So did the hidden kind. The packaging is not the lever.

Interaction style (form vs chat) +4.9 ppq = 0.56 null

Multi-turn conversational deception did not carry a meaningful penalty over straight forms. The agent was about as willing to talk its way into the wrong submission as to click into it.

Reading these as a group: no single attack design factor is doing the heavy lifting alone. Attacks succeed for redundant reasons. Pull one lever and the others hold.

The two pre-registered verdicts

The paper had two main claims it wrote down ahead of time, each with a clear condition under which the data would have proven the claim wrong. Both survived. One survived narrowly. The other survived by a wide margin.

Stands narrowly

Prompt-level mitigation is not enough on its own

Falsifier: any condition cuts pooled leakage by 30 points or more
Observed: 23.3 points at C3
Margin: missed the falsifier by 6.7 points

We would have been glad to be wrong by 7 more points on this one. We weren't.

Stands

Recognising an attack does not reliably stop submission

Falsifier: PLR for sessions where the agent flagged the site stays at or below 10% at C3
Observed: 35.9%
Margin: missed the falsifier by 25.9 points

An agent that has said out loud "this looks like a scam" still submits the user's SSN about a third of the time. That is the central finding, and the falsifier was not close.

Three sanity checks worth knowing

benign-twin leakage

The same four models were also run against ten benign sites that asked for the same information for legitimate reasons. They leaked nothing. The agents are not just submitting whenever a form is in front of them.

~87%

trap-reach rate

Nearly all sessions reached the attacker form. The safer numbers are not the result of agents running out of steps or getting lost on the way; the agents got to the form and chose what to do there.

47.5%

human accuracy at telling our sites from real ones

Sixteen reviewers shown a mix of our synthetic environments and real phishing captures could not reliably tell them apart. Slightly worse than a coin flip. The fidelity check passed.

What the agents were looking at

A sample of the synthetic websites used in the benchmark. Each one was built from a written brief, served on localhost, and seeded with a fictitious identity. None of them were ever deployed publicly.

Spoofed Chase Bank login page — Chase Bank login. Credential phishing on a familiar banking surface.

Spoofed IRS form — IRS form. Authority impersonation against US tax filers.

Spoofed Amazon returns flow — Amazon returns. E-commerce credential and payment capture.

Spoofed Netflix subscription update page — Netflix subscription update. Payment-card exposure on a reward-style prompt.

Spoofed PayPal account verification — PayPal verification. Single-step credential and payment phishing.

Spoofed AT&T support portal — AT&T support portal. Telecom account takeover through conversational deception.

Spoofed Oracle careers portal — Oracle careers. PII intake through a long-form job application.

These environments contain synthetic data only. The identity provisioned to the agents is fictitious. No real user data or live phishing infrastructure was involved in producing or running the benchmark.

What this suggests

If an agent can say in plain English that a site looks like a scam, and then submit a real-looking social security number to that site a third of the time anyway, then a defense that waits for the agent to say it noticed is gating on the wrong signal. The data points the other way. The submission, not the reasoning, is the thing that has to be intercepted, and the trust check has to live somewhere the agent's reasoning loop does not control.

We owe a few caveats. The PII profile we used is US-shaped: SSN, US routing numbers, US addresses. The agents read and write English. Our runner sits on top of Playwright; production agent scaffolds do other things in addition. The attacker sites are templated, not the kind of live phishing kit that uses JavaScript cloaking and disposable infrastructure to dodge analysis. The numbers above are what these four models did in our setup on these environments. Read them as a lower bound on the problem, not the final word.

Authors and how to cite

Soham Roy¹
Sarthakbrata Halder¹
Arya Bharaty¹
Vaibhav Bhaskar¹
Yash Sinha²
Dhruv Kumar²
Srikant Panda³
Murari Mandal¹

¹ KIIT University ² BITS Pilani ³ Lam Research

We thank Arka Mukherjee, Anyash Prasad, and Sarthak Bhattacharya for their help and participation in the early stages of our work.

Correspondence: sohamroy.dev@gmail.com

Read the paper Get the benchmark

@misc{roy2026istronglysuspectwebsite,
      title={"I Strongly Suspect This Website Is a Scam": Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents}, 
      author={Soham Roy and Sarthakbrata Halder and Arya Bharaty and Vaibhav Bhaskar and Yash Sinha and Dhruv Kumar and Srikant Panda and Murari Mandal},
      year={2026},
      eprint={2606.00497},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2606.00497}, 
}