GUIrilla-See-7B
Vision–language grounding for graphical user interfaces
Summary
GUIrilla-See-7B is a 7 billion-parameter Qwen 2.5-VL model fine-tuned to locate on-screen elements of macOS GUI. Given a screenshot and a natural-language task, the model returns a single point (x, y) that lies at (or very near) the centre of the referenced region.
Quick-start
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch, PIL.Image as Image
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"GUIrilla/GUIrilla-See-7B",
torch_dtype="auto",
device_map="auto",
attn_implementation="flash_attention_2",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
"GUIrilla/GUIrilla-See-7B",
trust_remote_code=True,
use_fast=True,
)
image = Image.open("screenshot.png")
task = "the search field in the top-right corner"
conversation = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text",
"text": (
"Your task is to help the user identify the precise coordinates "
"(x, y) of a specific area/element/object on the screen based on "
"a description.\n"
"- Your response should aim to point to the centre or a representative "
"point within the described area/element/object as accurately as possible.\n"
"- If the description is unclear or ambiguous, infer the most relevant area "
"or element based on its likely context or purpose.\n"
"- Your answer should be a single string (x, y) corresponding to the point "
"of interest.\n"
f"\nDescription: {task}"
"\nAnswer:"
)},
],
}]
texts = processor.apply_chat_template(conversation, tokenize=False,
add_generation_prompt=True)
image_inputs = [image]
inputs = processor(text=texts, images=image_inputs,
return_tensors="pt", padding=True).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=16, num_beams=3)
generated_ids = output_ids[:, inputs.input_ids.shape[1]:]
answer = processor.batch_decode(generated_ids,
skip_special_tokens=True)[0]
print("Predicted click:", answer) # → "(812, 115)"
Training Data
Trained on GUIrilla-Task.
- Train data: 25,606 tasks across 881 macOS applications (5% of apps from it for validation)
- Test data: 1,565 tasks across 227 macOS applications
Training Procedure
- 2 epochs LoRA fine-tuning on 2 × H100 80 GB.
- Optimiser – AdamW (β₁ = 0.9, β₂ = 0.95), LR = 2 e-5 with cosine decay and 0.05 warm up ratio.
Evaluation
Split | Success Rate % |
---|---|
Test | 75.59 |
Ethical & Safety Notes
- Always sandbox or use confirmation steps when connecting the model to real GUIs.
- Screenshots may reveal sensitive data – ensure compliance with privacy regulations.
License
MIT (see LICENSE
).
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for GUIrilla/GUIrilla-See-7B
Base model
Qwen/Qwen2.5-VL-7B-Instruct