--- license: apache-2.0 base_model: - mistralai/Devstral-Small-2507 --- # Devstral-Vision-Small-2507 Created by [Eric Hartford](https://erichartford.com/) at [Quixi AI](https://erichartford.com/) ## Model Description Devstral-Vision-Small-2507 is a multimodal language model that combines the exceptional coding capabilities of [Devstral-Small-2507](https://huggingface.co/mistralai/Devstral-Small-2507) with the vision understanding of [Mistral-Small-3.2-24B-Instruct-2506](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506). This model enables vision-augmented software engineering tasks, allowing developers to: - Analyze screenshots and UI mockups to generate code - Debug visual rendering issues with actual screenshots - Convert designs and wireframes directly into implementation - Understand and modify codebases with visual context ### Model Details - **Base Architecture**: Mistral Small 3.2 with vision encoder - **Parameters**: 24B (language model) + vision components - **Context Window**: 128k tokens - **License**: Apache 2.0 - **Language Model**: Fine-tuned Devstral weights for superior coding performance - **Vision Model**: Mistral-Small vision encoder and multimodal projector ## How It Was Created This model was created by surgically transplanting the language model weights from Devstral-Small-2507 into the Mistral-Small-3.2-24B-Instruct-2506 architecture while preserving all vision components: 1. Started with Mistral-Small-3.2-24B-Instruct-2506 (complete multimodal model) 2. Replaced only the core language model weights with Devstral-Small-2507's fine-tuned weights 3. Preserved Mistral's vision encoder, multimodal projector, vision-language adapter, and token embeddings 4. Kept Mistral's tokenizer to maintain proper image token handling The result is a model that combines Devstral's state-of-the-art coding capabilities with Mistral's vision understanding. Here is the [script](make_devstral_vision.py) ## Intended Use ### Primary Use Cases - **Visual Software Engineering**: Analyze UI screenshots, mockups, and designs to generate implementation code - **Code Review with Visual Context**: Review code changes alongside their visual output - **Debugging Visual Issues**: Debug rendering problems by analyzing screenshots - **Design-to-Code**: Convert visual designs directly into code - **Documentation with Visual Examples**: Generate documentation that references visual elements ### Example Applications - Building UI components from screenshots - Debugging CSS/styling issues with visual feedback - Converting Figma/design mockups to code - Analyzing and reproducing visual bugs - Creating visual test cases ## Usage ### With OpenHands The model is optimized for use with [OpenHands](https://github.com/All-Hands-AI/OpenHands) for agentic coding tasks: ```bash # Using vLLM vllm serve cognitivecomputations/Devstral-Vision-Small-2507 \ --tokenizer_mode mistral \ --config_format mistral \ --load_format mistral \ --tensor-parallel-size 2 # Configure OpenHands to use the model # Set Custom Model: openai/cognitivecomputations/Devstral-Vision-Small-2507 # Set Base URL: http://localhost:8000/v1 ``` ### With Transformers ```python import torch from transformers import AutoModelForCausalLM, AutoProcessor from PIL import Image model_id = "cognitivecomputations/Devstral-Vision-Small-2507" model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) processor = AutoProcessor.from_pretrained(model_id) # Load an image image = Image.open("screenshot.png") # Create a prompt prompt = "Analyze this UI screenshot and generate React code to reproduce it." # Process inputs inputs = processor( text=prompt, images=image, return_tensors="pt" ).to(model.device) # Generate outputs = model.generate( **inputs, max_new_tokens=2000, temperature=0.7 ) response = processor.decode(outputs[0], skip_special_tokens=True) print(response) ``` ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/GUij-XVX7zaoU9UjG4n19.png) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/wLHwLZti9Na0O-UOVh-Nh.png) ## Performance Expectations ### Coding Performance Inherits Devstral's exceptional performance on coding tasks: - 53.6% on SWE-Bench Verified (when used with OpenHands) - Superior performance on multi-file editing and codebase exploration - Excellent tool use and agentic behavior ### Vision Performance Maintains Mistral-Small's vision capabilities: - Strong understanding of UI elements and layouts - Accurate interpretation of charts, diagrams, and visual documentation - Reliable screenshot analysis for debugging ## Hardware Requirements - **GPU Memory**: ~48GB for full precision, ~24GB with 4-bit quantization - **Recommended**: 2x RTX 4090 or better for optimal performance - **Minimum**: Single GPU with 24GB VRAM using quantization ## Limitations - Vision capabilities are limited to what Mistral-Small-3.2 supports - Not specifically fine-tuned on vision-to-code tasks (uses Devstral's text-only fine-tuning) - Large model size may be prohibitive for some deployment scenarios - Best performance achieved when used with appropriate scaffolding (OpenHands, Cline, etc.) ## Ethical Considerations This model inherits both the capabilities and limitations of its parent models. Users should: - Review generated code for security vulnerabilities - Verify visual interpretations are accurate - Be aware of potential biases in code generation - Use appropriate safety measures in production deployments ## Citation If you use this model, please cite: ```bibtex @misc{devstral-vision-2507, author = {Hartford, Eric}, title = {Devstral-Vision-Small-2507}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/cognitivecomputations/Devstral-Vision-Small-2507} } ``` ## Acknowledgments This model builds upon the excellent work by: - [Mistral AI](https://mistral.ai/) for both Mistral-Small and Devstral - [All Hands AI](https://www.all-hands.dev/) for their collaboration on Devstral - The open-source community for testing and feedback ## License Apache 2.0 - Same as the base models --- *Created with dolphin passion 🐬 by Cognitive Computations*