Paper Summary
Phone-use agents can operate real mobile apps end to end, which gives them access to
many more real-world actions than API- or CLI-based agents. This paper studies the
misuse risk on real phones across 27 commercial apps and reports that agents powered
by eight mainstream commercial and open-source models can complete harmful tasks
including precursor procurement, fraud, harassment, fake traffic, and review manipulation.
Across the tested agents, harmful-request refusal remains low while average task
completion reaches 68.8%. The paper highlights a Safety Awareness-Execution Gap:
agents may recognize that a request is harmful but continue executing it anyway.
In one case study, a model fabricated medical information to obtain an electronic
prescription and complete an order, illustrating how phone-use agents can bypass
real-world platform and medical constraints.
The authors find that simple defenses can reduce overt misuse, but covert abuse such as
coordinated review manipulation and fake engagement remains difficult to stop. The paper
argues that safer phone-use agents need stronger mechanisms that connect risk awareness
to action-level refusal and intervention.