SketchVLM: Vision-Language Models Can Annotate Images to Explain Thoughts and Guide Users

Brandon Collins¹, Logan Bolton¹, Hung Huy Nguyen¹, Mohammad Taesiri², Trung Bui³, Anh Nguyen¹

¹Auburn University, ²Independent, ³Adobe

Abstract

When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision–language models (VLMs) such as Gemini-3-Pro and GPT-5 typically respond with only text, which can be difficult for users to verify. We present SketchVLM, a training-free, model-agnostic framework that enables VLMs to produce non-destructive, editable SVG overlays on the input image to visually explain their answers. Across six benchmarks spanning visual reasoning (maze navigation, ball-drop trajectory prediction, and object counting) and drawing (part labeling, connecting-the-dots, and drawing shapes around objects), SketchVLM improves visual reasoning task accuracy by up to +28.5 points and sketch quality by up to +48.3% over image-editing and fine-tuned sketching baselines, while also producing sketches that are more faithful to the model's stated answer. We find that single-turn generation already achieves strong accuracy and sketching quality, and multi-turn generation opens up further opportunities for human-AI collaboration.

For complex questions, modern chatbots like ChatGPT often return long text responses (a) that are hard for users to understand, verify, and follow. In contrast, SketchVLM guides users (b) step by step by directly annotating the input image and grounding instructions in visual evidence—here, guiding a user through checking their car's oil level.

Existing approaches require task-specific training, modify the source image directly, or lack support for free-form visual annotation. SketchVLM is:

Training-free

Multi-turn

Model-agnostic

General purpose

Non-destructive

Free-form drawing

Method

First, a coordinate grid is appended to the input image so the model can reference precise spatial locations. Second, a system prompt instructs the VLM to output structured annotation primitives (circles, arrows, text labels, lines) as XML alongside its text answer. Third, the XML is automatically converted into an SVG overlay rendered on top of the original image, keeping the source pixels completely untouched. No fine-tuning or external vision tools are required as the entire pipeline runs through the VLM's existing capabilities.

Tasks

We evaluate on six benchmarks spanning two categories. Drawing tasks test annotation quality: Connect-the-Dots (locate and connect numbered dots in order), Drawing Shapes around Objects (localize COCO objects with rectangles or ovals), and Part Labeling (place text labels at correct part locations on PACO/Pascal-Part images). Visual reasoning tasks test whether sketching helps models think: Maze Navigation (trace a path through a 3×3 grid maze), Physics Ball Drop (predict which container a dropped ball lands in), and Object Counting (count and mark every instance of an object).

We compare SketchVLM (with Gemini-3-Pro and GPT-5 ) against the state-of-the-art image-editing model Nano Banana Pro , which generates annotations by editing the image directly and by comparing against the fine-tuned sketching models ViLaSR and ThinkMorph , which are trained specifically to produce visual reasoning traces.

Evaluation

We evaluate using three different metrics. Accuracy captures whether the model's text answer is correct. Annotation quality captures whether the visual annotation is clear and interpretable, scored 1-5 by a VLM judge that evaluates plausibility and visual clarity. Annotation-text alignment captures whether the sketch faithfully reflects the model's stated answer: a VLM judge views only the annotated image, infers an answer from the sketch alone, and we check if it matches the model's text output.

Task Examples

Counting Objects

generates a different image and predicts an incorrect count. directly outputs a number and severely undercounts, while sketchVLM outputs the correct answer and produces visual annotations to explain its answer.

Drawing Shapes around Objects

When prompted to outline the classes "person" and "sports-ball", replaces the original image with a newly generated one, whereas SketchVLMs preserve the original image and draw shapes that accurately align with object boundaries and locations.

Part Labeling

Qualitative comparison on the part labeling task. sketchVLM places each part label directly on its corresponding region while preserving the original image, producing more interpretable part annotations than .

Physics Ball Drop

SketchVLM generates the most accurate ball drop physics images compared to other baselines.

Results

Key Takeaway: Prompting frontier models (GPT-5 and Gemini-3-Pro ) to output sketches with SketchVLM yields superior generalizability, accuracy, and sketch quality compared to specialized fine-tuned sketching models (ViLaSR , ThinkMorph ) and image-editing models (Nano Banana Pro ).

SketchVLMs produce accurate visual reasoning traces while maintaining competitive or superior task accuracy. Fine-tuned sketching models perform near random chance on visual reasoning tasks, and Nano Banana Pro frequently alters the original image, undermining trustworthiness.

Beyond accuracy, SketchVLMs consistently produce higher-quality sketches that more faithfully reflect the model's stated answer. Baselines either generate low-quality annotations or produce sketches that contradict their own text output.

Sketch quality and alignment results table

Real-World Applications

Multi-turn example of SketchVLM guiding a user through how to remove an image's background. At each turn, the model receives a screenshot then annotates the screenshot with labeled arrows and highlights UI elements to indicate the next step.

How do I turn off my hot water?

I've got two sticks of RAM. Where do they go?

Where is the ground nutmeg?

SketchVLMs can be used in a variety of real world use cases.

BibTeX

@misc{collins2026sketchvlmvisionlanguagemodels,
      title={SketchVLM: Vision language models can annotate images to explain thoughts and guide users},
      author={Brandon Collins and Logan Bolton and Hung Huy Nguyen and Mohammad Reza Taesiri and Trung Bui and Anh Totti Nguyen},
      year={2026},
      eprint={2604.22875},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.22875},
}