GUIrilla-See-7B

Vision–language grounding for graphical user interfaces

Summary

GUIrilla-See-7B is a 7 billion-parameter Qwen 2.5-VL model fine-tuned to locate on-screen elements of macOS GUI. Given a screenshot and a natural-language task, the model returns a single point (x, y) that lies at (or very near) the centre of the referenced region.

Quick-start

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch, PIL.Image as Image

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "GUIrilla/GUIrilla-See-7B",
    torch_dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
    "GUIrilla/GUIrilla-See-7B",
    trust_remote_code=True,
    use_fast=True,
)

image = Image.open("screenshot.png")
task  = "the search field in the top-right corner"

conversation = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text",
         "text": (
             "Your task is to help the user identify the precise coordinates "
             "(x, y) of a specific area/element/object on the screen based on "
             "a description.\n"
             "- Your response should aim to point to the centre or a representative "
             "point within the described area/element/object as accurately as possible.\n"
             "- If the description is unclear or ambiguous, infer the most relevant area "
             "or element based on its likely context or purpose.\n"
             "- Your answer should be a single string (x, y) corresponding to the point "
             "of interest.\n"
             f"\nDescription: {task}"
             "\nAnswer:"
         )},
    ],
}]

texts        = processor.apply_chat_template(conversation, tokenize=False,
                                             add_generation_prompt=True)
image_inputs = [image]
inputs       = processor(text=texts, images=image_inputs,
                         return_tensors="pt", padding=True).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=16, num_beams=3)

generated_ids = output_ids[:, inputs.input_ids.shape[1]:]
answer = processor.batch_decode(generated_ids,
                                skip_special_tokens=True)[0]
print("Predicted click:", answer)      # → "(812, 115)"

Training Data

Trained on GUIrilla-Task.

Train data: 25,606 tasks across 881 macOS applications (5% of apps from it for validation)
Test data: 1,565 tasks across 227 macOS applications

Training Procedure

2 epochs LoRA fine-tuning on 2 × H100 80 GB.
Optimiser – AdamW (β₁ = 0.9, β₂ = 0.95), LR = 2 e-5 with cosine decay and 0.05 warm up ratio.

Evaluation

Split	Success Rate %
Test	75.59

Ethical & Safety Notes

Always sandbox or use confirmation steps when connecting the model to real GUIs.
Screenshots may reveal sensitive data – ensure compliance with privacy regulations.

License

MIT (see LICENSE).

GUIrilla
/

GUIrilla-See-7B