Seeing Isn’t Understanding: The Spatial Reasoning Gap in Vision-Language Models
Motivation
Vision-language models (VLMs) have rapidly improved and now perform impressively at describing images, answering questions about visual scenes, and even generating stories from pictures.
But when it comes to understanding where things are in space, a surprising weakness emerges.
Ask a VLM if the cat is on the bed or under it, or which object is closer to the camera—and you might get an answer that sounds confident, but is completely wrong. This isn’t a fluke. It reflects a deeper issue: these models can often recognize objects, but fail to grasp their spatial relationships.
This blog post explores one of the most persistent challenges in multimodal AI: spatial reasoning. We'll walk through what spatial reasoning actually means, why it’s so hard for VLMs, what researchers have tried to fix it, and what future directions and the benefits of a spatially-aware VLMs.
Table of Contents
- Defining the Problem: What Is Spatial Reasoning?
- A Comparative Evaluation of Spatial Skills in VLMs
- Why Do VLMs Struggle with Spatial Reasoning?
- How Researchers are Tackling the Problem?
- Future Directions and the Promise of Spatially-Aware AI
- References
Defining the Problem: What Is Spatial Reasoning?
Spatial reasoning is the ability to understand, manipulate, and infer relationships between objects in space. It’s not just one skill, but a set of interconnected abilities [1]:
Spatial Relations
Spatial relations is about understanding how objects are positioned relative to each other. For example, knowing if something is next to, inside, above, or farther away from something else.
There are different types of spatial relations:
- Topological: Describes connections or placement, like whether something is inside, next to, or touching something else.
- Projective: Describes direction, such as above/below, in front/behind, or left/right.
- Metric: Involves size or distance, like how big something is or how far apart two objects are.
In fields like digital mapping (e.g., GPS or GIS systems), computers need to understand these relationships to answer questions like “What’s near me?” or “What’s inside this region?”
Mental Rotation
Mental rotation is the ability to imagine how an object would look if turned or viewed from a different angle. Humans do this naturally — for example, we can look at a chair from the front and picture what it would look like from the side.
For vision-language models, this is much harder. Even if a model recognizes an object, it may struggle to identify the same object when it’s rotated or viewed from a new perspective.
Mental rotation requires models not just to recognize objects, but to mentally transform how they appear in space. This skill is important for tasks like identifying objects from unusual angles, answering visual questions like “What does the object look like from behind? and matching rotated objects in images.
Spatial Visualization
Spatial visualization is the ability to imagine how objects or spaces change when they move, fold, rotate, or transform. It goes beyond just seeing things — it’s about mentally working with shapes and spaces to understand what they would look like after a change.
Examples include imagining how a folded paper will look when opened, figuring out how parts fit together in a machine, etc.
For vision-language models, spatial visualization means handling multi-step transformations of objects or scenes. This is crucial for tasks like understanding how a scene changes over time, following step-by-step instructions involving movement or assembly, simulating spatial changes in virtual environments.
Spatial Orientation and Navigation
Spatial orientation helps us find our way in complex environments — like knowing where we are in a room, how to get to the door, or what’s behind us. Our brains combine visual, bodily, and auditory information to build a mental map of surroundings.
Two main frames of reference organize this spatial information:
- Egocentric: Based on our own body — e.g., “the chair is on my right.”
- Allocentric: Based on landmarks — e.g., “the chair is next to the window,” regardless of where you stand.
This flexibility lets us move through spaces, remember routes, and adapt when environments change.
For vision-language models, understanding spatial orientation means recognizing where things are — not just in an image, but relative to each other and to the viewer. This is essential for following navigation instructions in real-world settings, interacting with 3D environments or augmented reality, etc.
A Comparative Evaluation of Spatial Skills in VLMs
In their paper, Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [1], Stogiannidis et al. evaluate the spatial reasoning abilities of a diverse set of 13 vision-language models across six cognitively inspired tasks: Paper Folding, Mental Rotation (Easy & Hard), Navigation, Orientation, and Spatial Relations.
Their evaluation reveals significant performance differences between models, as shown in the table below. Notably, performance often hovers near random chance for several tasks, highlighting spatial reasoning as a persistent weakness in current VLMs.
Why Do VLMs Struggle with Spatial Reasoning?
Now that we've clarified what spatial reasoning involves, it's time to look at why today's Vision-Language Models (VLMs) struggle with it so consistently. It's not that they don't "see" the image—it’s that they don't look in the right places, or sometimes, they don’t really look at all.
In their paper Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas[2], Shiqi Chen et al. investigate the root causes behind Vision-Language Models’ struggles with spatial reasoning focusing primarily on spatial relations.
Their analysis reveals that despite processing large visual inputs, VLMs often underutilize image information during reasoning, largely due to how they allocate attention. They identify three key challenges:
- Imbalanced attention: Although image tokens constitute over 90% of the input, they receive only about 10% of the model’s attention, showing a strong bias toward textual inputs.
- Misplaced visual focus: It’s not just about attention quantity but also its location — the model sometimes focuses too little on the right objects or too much on irrelevant ones, causing spatial errors.
- Training Data Bias & Overreliance on unimodal priors: Models like Llava tend to be more confident when predicting common relationships like “left” or “right,” but struggle with less frequent ones such as “under” or “behind.” This bias leads them to perform well on these common relationships and to guess based on familiar language patterns rather than actual visual evidence for the underrepresented ones, resulting in hallucinations in spatial relationships.
How Researchers are Tackling the Problem?
Now that we understand some of the reasons why VLMs struggle with spatial reasoning, the next question is: What can we do about it?
Researchers have started exploring practical solutions—both by rethinking how we use language [3] and by adjusting how models focus on images. From caption prior normalization and prompt reformulations[3] to clever decoding techniques like ADAPTVIS, these efforts aim to help models better understand where things are in an image, not just what they are.
What’sUp Benchmark:
Kamath et al. [3] introduce the What’sUp benchmark, a carefully curated dataset designed to isolate spatial reasoning by varying object positions (e.g., a dog under vs. on top of a table) while keeping object identity fixed. Their evaluations across 18 models show a significant drop in performance on spatial tasks, highlighting that popular VL pretraining corpora like LAION-2B lack sufficient spatial examples and that simple fine-tuning or upweighting strategies aren't enough to bridge the gap.
ADAPTVIS: Confidence-Based Attention Intervention
To move beyond just working with the text, Shiqi et al. proposed ADAPTVIS, a simple yet powerful decoding-time method that directly addresses how models allocate visual attention.
The idea is this:
- When the model shows high confidence in its prediction (based on attention logits as a metric to assess the model self-confidence), its attention is likely focused in the right area—so we sharpen it to reinforce that focus.
- When the model is less confident, its attention may be misdirected—so we smooth it, encouraging exploration of alternative regions in the image.
This technique dynamically adjusts the model’s attention to image tokens at inference time, based on the confidence of the last generated token. By applying this across all attention heads and layers, without any retraining, ADAPTVIS significantly improves spatial reasoning performance.
Future Directions and the Promise of Spatially-Aware AI
As Vision-Language Models become more spatially aware, their real-world potential expands dramatically. In healthcare, better spatial reasoning could enhance the interpretation of medical images, leading to more accurate diagnoses. In augmented reality, it could enable richer, more immersive experiences through precise scene understanding. For assistive technologies, enhanced spatial grounding can empower visually impaired individuals with more accurate, real-time descriptions of their environment—improving independence and quality of life. And in AI and robotics, these principles already underpin systems like SLAM (Simultaneous Localization and Mapping), allowing robots to navigate and map their surroundings more effectively using vision and sensors.
And thaaaat’s a wrap on my very first blog post! 🎉 I had so much fun learning and sharing about spatial reasoning in vision-language models, and I really hope you found it useful and interesting. Here’s to many more posts. Thank you for reading and see you next time! 🤗👋
References
[1] Stogiannidis, I., McDonagh, S., & Tsaftaris, S. A. (2025). Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models. arXiv preprint arXiv:2503.19707.
[2] Chen, S., Zhu, T., Zhou, R., Zhang, J., Gao, S., Niebles, J. C., Geva, M., He, J., Wu, J., & Li, M. (2025). Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas. arXiv preprint arXiv:2503.01773.
[3] Kamath, A., Hessel, J., & Chang, K.-W. (2023). What's "up" with vision-language models? Investigating their struggle with spatial reasoning. arXiv preprint arXiv:2310.19785.
[4] https://sisap-challenges.github.io/2024/datasets/
[5] https://huggingface.co/datasets/juletxara/visual-spatial-reasoning