Scaling Up and Failure Demo

Scaling trend analysis
Figure 5. Scaling trends under NoCoT and CoT settings across Level 1/2/3 in VidNum-1.4K.

InternVL3 generally improves from 8B to 78B, but gains are uneven by level. The largest gains appear in Level 3 compositional reasoning, especially under Zero-shot CoT, while Level 2 remains comparatively stagnant.

Failure demo analysis
Failure Demo. Visual-saliency trap: smaller models can confuse visual volume with true cardinality in compositional counting scenarios.

This suggests scaling helps high-level arithmetic reasoning more than foundational visual grounding (instance tracking and cross-shot re-identification), which remains a core bottleneck.

Back to Main Page