Phone-use Agent Safety

It Lied to a Doctor to Buy Poison 😈

Quantifying Real-World Misuse of Phone-use Agents

Yiming Sun, Chen Chen, Zifan Zhou, Mi Zhang

Fudan University

Paper Summary

Phone-use agents can operate real mobile apps end to end, which gives them access to many more real-world actions than API- or CLI-based agents. This paper studies the misuse risk on real phones across 27 commercial apps and reports that agents powered by eight mainstream commercial and open-source models can complete harmful tasks including precursor procurement, fraud, harassment, fake traffic, and review manipulation.

Across the tested agents, harmful-request refusal remains low while average task completion reaches 68.8%. The paper highlights a Safety Awareness-Execution Gap: agents may recognize that a request is harmful but continue executing it anyway. In one case study, a model fabricated medical information to obtain an electronic prescription and complete an order, illustrating how phone-use agents can bypass real-world platform and medical constraints.

The authors find that simple defenses can reduce overt misuse, but covert abuse such as coordinated review manipulation and fake engagement remains difficult to stop. The paper argues that safer phone-use agents need stronger mechanisms that connect risk awareness to action-level refusal and intervention.

Misuse Risk Taxonomy

A sunburst taxonomy derived from 144 manually curated seed misuse tasks and legal rationales, covering 6 high-level categories and 34 fine-grained subcategories. Surrounding examples show representative misuse prompts, labels, and rationales.

Misuse Risk Taxonomy

Misuse Query Collection Pipeline

The benchmark pipeline collects laws, regulations, and disclosed violation cases, extracts 144 seed misuse tasks, combines them with functions from 27 apps, generates candidate queries, and manually filters them into 1,381 executable evaluation samples.

Misuse Query Collection Pipeline

Closed-Source Model Case Studies

Case studies show closed-source multimodal models operating real mobile apps under user-misuse scenarios, including high-risk product purchases and fraudulent social media promotion. The examples illustrate persistent safety-boundary failures during action execution.

Case study 1 Case study 3 Case study 4 Case study 2