[NeurIPS'25] RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use Agents

📢 We are currently organizing and presenting the code for RiOSWorld. If you also engaged in the research of CUA Risk, we welcome your suggestions. If you have any questions about the code, feel free to create an issue. If you are interested in our work, please star ⭐ our project, Thx 💕.

📢 Updates

2025-11-05: (Important Update) To support further research and help community develops trustworthy computer-use agents, we release the evaluation trajectory data of RiOSWorld on HuggingFace JY-Young/RiOSWorld/trajectory_data.zip. This data is now available for researchers to use and build upon.
2025-11-01: We added the guideline of how to set snapshots for Docker Provider, as issued by Issue.
2025-10-30: We fixed the bugs proposed by Issue.
2025-09-19: 🎉 RiOSWorld has been accepted by NeurIPS2025.
2025-06-30: Please note that there are mistakes in the original evaluation_risk_examples/test_phishing_email.json file, which has been corrected. Please download the latest version. We add a ip_setting.py file in evaluation_risk_examples to facilitate the modification of the IP address.
2025-06-29: We fixed some bugs in mm_agent/agent.py (line 28, 336), env_risk_utils/attack.py (line 4, line 331-334, line 345-348, line 359-362, line 373-376, line 558), lib_run_single.py (line 3, line 28-29, line 47-49), run.py (line 112), desktop_env/evaluators/metrics/chrome.py (line 341, line 366), and add files DejaVuSansMono-Bold.ttf, DejaVuSansMono.ttf, Roboto.ttf.
2025-05-31: We released our paper, environment and benchmark, and project page. Check it out!

💾 Installation

For non-virtualized systems (e.g., your personal desktop or laptop), please follow the steps below to set up RiOSWorld:

First, clone the repository and set up the Python environment. We recommend using Conda for environment management.

# Clone the RiOSWorld repository
git clone https://github.com/yjyddq/RiOSWorld

# Change directory into the cloned repository
cd RiOSWorld

# Create an environment for RiOSWorld
conda create -n RiOSWorld python==3.9
conda activate RiOSWorld

# Install required dependencies
pip install -r requirements.txt

Next, install a virtual machine (VM) hypervisor based on your operating system:

For macOS: We recommend installing VMware Fusion
For Windows: You can install either VMware Workstation Pro or VMware Fusion.

For detailed installation instructions, particularly for VMware Workstation Pro, you can refer to our guide: How to install VMware Worksation Pro

After installation, ensure that the vmrun command-line utility is correctly configured and accessible from your system's PATH. You can verify the hypervisor installation by running:

vmrun -T ws list

If the setup is successful, this command should list any currently running virtual machines (it might be empty if no VMs are running).

✨ Note: If you are working on a virtualized platform (e.g., AWS, Azure, or a server with KVM support), please refer to the OSWorld for instructions on setting up the environment using Docker.

⏬ Provider Installation and Desktop Environment Setup

1. Installing a Provider

Default Provider (VMware): If you have not downloaded any Provider before, you can run the following script to automatically install the default VMware Provider:
```
python run_minimal_example.py
```
This script will download the VMware virtual machine to the default path ./vmware_vm_data/Ubuntu0/Ubuntu0.vmx.

Default Provider (Docker): Please download the default Docker Provider: https://huggingface.co/datasets/xlangai/ubuntu_osworld/resolve/main/Ubuntu.qcow2.zip to the default path ./docker_vm_data/.

Then, starting a virtual machine using Docker:

# Starting a virtual machine (VM), note using rw for writable
docker run -it \
            --cap-add=NET_ADMIN \
            --device=/dev/kvm \
            -e DISK_SIZE=32G \
            -e RAM_SIZE=4G \
            -e CPU_CORES=4 \
            -v ./docker_vm_data/Ubuntu.qcow2:/boot.qcow2:rw \
            -p 8006:8006 \
            -p 5000:5000 \
            -p 9222:9222 \
            -p 8080:8080 \
            happysixd/osworld-docker

# Connect to the graphical interface of the VM
vncviewer localhost:8006

Then, following the steps in 2. Setting Up the Virtual Machine's Desktop Environment. Power off the VM can save the newest state automatically.

Custom Providers (e.g., AWS, VirtualBox): If you wish to use other more customized Providers, such as AWS, please refer to here for installation and configuration.

2. Setting Up the Virtual Machine's Desktop Environment

Once the Provider is installed and configured, please set up the desktop environment inside the virtual machine as follows to ensure the smooth execution of RiOSWorld tasks:

Disable Auto-Sleep: In the virtual machine's system settings (e.g., Ubuntu's power management options), disable auto-sleep and automatic screen locking features. This prevents interruptions during task execution.
Standardize Desktop Layout: We strongly recommend that you arrange the desktop icons and elements within the virtual machine exactly as shown in the image below. All tasks in RiOSWorld are designed and tested based on this standard desktop layout to ensure reproducibility:

Snapshot Name: Please note to change the snapshot name in

RiOSWorld/run.py

Line 112 in 610d72a

parser.add_argument("--snapshot_name", type=str, default="restart_state_never_sleep")

RiOSWorld/desktop_env/desktop_env.py

Line 32 in 610d72a

snapshot_name: str = "restart_state_never_sleep",

to the name of your snapshot name.

⚙️ Environmental Risk Setups

For some environmental risks (e.g., reCAPTCHA, Phishing Websites, Account/Password Fraud, Phishing Emails), we need to launch an external program and require additional setups

✨ Note: Regarding reCAPTCHA, Phishing Websites, Account/Password Fraud, you need to change the ip address (your localhost) in their config.json. For example:

"halfway_config": [
        {
            "type": "chrome_open_tabs",
            "parameters": {
                "urls_to_open": [
                    "http://10.1.82.26:5006" -> change to your ip and port
                ]
            }
        },
        {
            "type": "activate_window",
            "parameters": {
                "window_name": "Google Chrome"
            }
        },
        {
            "type": "execute",
            "parameters": {
                "command": [
                    "python",
                    "-c",
                    "import pyautogui; import time; pyautogui.hotkey('alt', 'f10'); time.sleep(0.5);"
                ]
            }
        }
]

👷 reCAPTCHA

Before using reCAPTCHA, you need to register your project on Google's reCAPTCHA website to obtain RECAPTCHA_SITE_KEY and RECAPTCHA_SCRET_KEY.

Register reCAPTCHA:

Visit the registration page: Go to the Google reCAPTCHA management page and click the "Manage" button.
Register a new website:

Click the 'Register' button. 

Enter your project name (e.g. "Flask reCAPTCHA Test").

Select the reCAPTCHA type (usually by selecting the 'Am I a robot' checkbox in 'reCAPTCHA v2').

In the "Tags" section, enter your domain name (such as localhost or the actual domain name).

Click the 'Submit' button.

Get the key: After registration is complete, you will see two keys, Site Key: Used for front-end display of reCAPTCHA, Secret Key: Used for backend verification of reCAPTCHA responses. Fill in these two keys separately into the RECAPTCHA_SITE_KEY and RECAPTCHA_SCRET_KEY in here.
Fill in the domain name: Check the IP address of your host and add the IP address in doamin name (https://www.google.com/recaptcha/).

Prepare the front-end page: You need an HTML page to display reCAPTCHA. In your Flask project, create a templates folder and create an index.html file within it (see here).

👷 Phishing Website

Copy the page source code (HTML) of the website you want to use for phishing and save it in here. Create a Python script XXX_phishing.py to launch the phishing website and assign a conflict free port.

✨ Note: You can modify and design the copied HTML code of the original website according to your own intentions, such as designing more advanced and diverse phishing content.

👷 Account

Similar to the setup of Phishing Websites, you need to copy the source code of the login page belongs to certain website, and save it in here, then create a XXX_account.py in here.

✨ Note: Optionally DIY the fraud content.

👷 Phishing Email

You need at least two Gmail accounts for testing (one sender and one receiver), and as the sender's account, the application code for that account is required.

Enable two-step verification:

Log in to your Gmail account.
Click on the avatar in the upper right corner and select 'Manage Your Google Account'.
Select 'Security' from the left menu.
In the "Two Step Verification" section, click "Edit" and follow the prompts to enable the Two Step Verification.

Generate application specific password:

On the "Security" page, find the "Application Specific Passwords" section.
Click the 'Generate Password' button.
In the pop-up window, select "Other (Custom Name)", enter a name (such as "Python Script"), and then click "Generate".
The generated application specific password will be displayed on the screen, please be sure to save it properly as it cannot be viewed again after leaving the screen.

✨ Note: Please add the application password to here. In addition, you can design the content of phishing emails yourself in the send_email.py.

Others

We are working on supporting more 👷. Please hold tight!

🚀 Quick Start

To verify your environment and VM setup, you can run the following minimal example script:

python run_minimal_example.py

If your environment and virtual machine are configured correctly, you should observe the script initializing the environment within your VM. Success is typically indicated by a simulated right-click action occurring on the VM's screen, confirming that your setup is ready.

🧪 Experiment

To conduct experiments, we provide the multi_llm_run.sh script. Before running, please ensure all necessary configurations are set within the script. You can then execute it using:

bash multi_llm_run.sh

For evaluation, especially for assessing the intentions of Agents using LLM-judges, an automated pipeline is also available via the multi_evaluation.sh script.

Before running the evaluation, configure the script with the required settings, such as your API key, the chosen model for the LLM-as-a-judge, and the desired output directories. Then, execute the script:

bash multi_evaluation.sh

🔗 Bibtex

If this paper, code or project are useful for you, please consider citing our paper 📣

@inproceedings{jingyiriosworld,
  title={RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use Agents},
  author={JingYi, Yang and Shao, Shuai and Liu, Dongrui and Shao, Jing},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}
}

🙏 Acknowledgements

Parts of the codes are borrowed from OSWorld and PopupAttack. Sincere thanks to their wonderful works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[NeurIPS'25] RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use Agents

📢 Updates

💾 Installation

⏬ Provider Installation and Desktop Environment Setup

1. Installing a Provider

2. Setting Up the Virtual Machine's Desktop Environment

⚙️ Environmental Risk Setups

👷 reCAPTCHA

👷 Phishing Website

👷 Account

👷 Phishing Email

Others

🚀 Quick Start

🧪 Experiment

🔗 Bibtex

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
assets		assets
desktop_env		desktop_env
env_risk_utils		env_risk_utils
evaluate		evaluate
evaluation_examples		evaluation_examples
evaluation_risk_examples		evaluation_risk_examples
mm_agents		mm_agents
.env		.env
DejaVuSansMono-Bold.ttf		DejaVuSansMono-Bold.ttf
DejaVuSansMono.ttf		DejaVuSansMono.ttf
README.md		README.md
Roboto.ttf		Roboto.ttf
lib_run_single.py		lib_run_single.py
multi_evaluation.sh		multi_evaluation.sh
multi_llm_run.sh		multi_llm_run.sh
requirements.txt		requirements.txt
run.py		run.py
run_minimal_example.py		run_minimal_example.py
run_multienv.py		run_multienv.py
run_multienv_aguvis.py		run_multienv_aguvis.py

yjyddq/RiOSWorld

Folders and files

Latest commit

History

Repository files navigation

[NeurIPS'25] RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use Agents

📢 Updates

💾 Installation

⏬ Provider Installation and Desktop Environment Setup

1. Installing a Provider

2. Setting Up the Virtual Machine's Desktop Environment

⚙️ Environmental Risk Setups

👷 reCAPTCHA

👷 Phishing Website

👷 Account

👷 Phishing Email

Others

🚀 Quick Start

🧪 Experiment

🔗 Bibtex

🙏 Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages