NVIDIA AI Introduce SpatialClaw: A Training-Free Agent That Treats Code as the Action Interface for Spatial Reasoning
NVIDIA Research has released SpatialClaw, a training-free framework for spatial reasoning.

NVIDIA Research has released SpatialClaw, a training-free framework for spatial reasoning. It targets a persistent weakness in vision-language models (VLMs). These models still struggle to judge where objects are, how they relate, and how they move in 3D.
SpatialClaw does not retrain the model. Instead, it changes the action interface the agent uses to call perception tools. The research team argues the interface is the bottleneck. Their solution is to treat code as the action interface. Across 20 benchmarks, SpatialClaw reaches 59.9% average accuracy. It outperforms the recent spatial agent SpaceTools by 11.2 points.
SpatialClaw is an agent loop wrapped around a stateful Python kernel. The kernel is pre-loaded with input frames and a set of primitives. Perception tools are plain Python callables. Their outputs, including masks, depth maps, camera geometry, and trajectories, are ordinary Python variables.
The kernel exposes six public entry points. InputImages holds the sampled frames. Metadata carries frame rate, duration, and frame indices. tools exposes perception and geometry primitives. show() embeds an image into the agent’s next context. vlm dispatches queries to a separate VLM session. ReturnAnswer() submits the final answer.
Two perception tools are central. tools.Reconstruct wraps Depth Anything 3 and returns per-frame depth, camera intrinsics, extrinsics, and dense point maps. tools.SAM3 wraps SAM 3 and produces image or video masks from text, point, or box prompts. The framework adds lightweight utilities: tools.Geometry , tools.Mask , tools.Time , tools.Graph , and tools.Draw .
It is training-free. The same system prompt, tool set, and hyperparameters run across every benchmark and backbone.
The research team studied three action interfaces on the same question. Consider measuring the closest distance between a heater and a door.
SpatialClaw was tested on 20 benchmarks across five categories. These span single-image, multi-view, general, video and 4D, and general video understanding. It improves over the no-tool baseline on all six backbones tested. Backbones range from 26B to 397B parameters across the Qwen3.5/3.6 and Gemma4 families.
A controlled comparison isolates the interface. All three variants share the same toolset and prompt. Only the action interface differs.
Gemma4-31B backbone, 20-benchmark average.
Against prior spatial agents on the same Gemma4-31B backbone, the gap widens.
Source: MarkTechPost