Mimicking the Physicist's Eye: A VLM-centric Approach for Physics Formula Discovery

Jiaqi Liu1,3, Songning Lai2, Pengze Li3,4, Di Yu3,5, Wenjie Zhou6,9, Yiyang Zhou1, Peng Xia1, Zijun Wang7, Xi Chen4, Shixiang Tang3, Lei Bai3, Wanli Ouyang3,8, Mingyu Ding1, Huaxiu Yao1, Aoran Wang3

1UNC-Chapel Hill, 2The Hong Kong University of Science and Technology (Guangzhou), 3Shanghai Artificial Intelligence Laboratory, 4Fudan University, 5Tsinghua University, 6Nankai University, 7UC Santa Cruz, 8The Chinese University of Hong Kong, 9Shanghai Innovation Institute

Contact: jqliu@cs.unc.edu, wangaoran@pjlab.org.cn

VIPER-R1 Framework Overview

Abstract

Automated discovery of physical laws from observational data is a grand challenge in AI. Current methods, relying on symbolic regression or LLMs, are limited to uni-modal data and overlook the rich, visual phenomenological representations of motion that are indispensable to physicists. This "sensory deprivation" severely weakens their ability to interpret the inherent spatio-temporal patterns within dynamic phenomena.

To address this gap, we propose we propose VIPER-R1, a multimodal model for Visual Induction for Physics-based Equation Reasoning to discover fundamental symbolic formulas.

The model is trained via a curriculum of Motion Structure Induction (MSI), using supervised fine-tuning to interpret kinematic phase portraits and construct hypotheses guided by a Causal Chain of Thought (C-CoT), followed by Reward-Guided Symbolic Calibration (RGSC) to purify the formula's structure with reinforcement learning. During inference, the trained VIPER acts as an agent: it first posits a high-confidence symbolic ansatz, then proactively invokes an external symbolic regression tool to perform Symbolic Residual Realignment (SR²). This final step, analogous to a physicist's perturbation analysis, reconciles the theoretical model with empirical data.

To support this research, we introduce PhysSymbol, a new 5,000-instance multimodal corpus. Experiments show that VIPER-R1 consistently outperforms state-of-the-art VLM baselines in accuracy and interpretability, enabling more precise discovery of physical laws.

Key Contributions

  • VIPER-R1, a multimodal framework that simulates the scientific reasoning process by integrating visual perception, trajectory data, and symbolic reasoning
  • A two-stage training pipeline featuring Motion Structure Induction (MSI) and Reward-Guided Symbolic Calibration (RGSC)
  • An agentic refinement stage, Symbolic Residual Realignment (SR²), where the VLM proactively utilizes external tools to harmonize theoretical hypotheses with empirical data
  • PhysSymbol, a new comprehensive benchmark of 5,000 multimodal instances for physics formula discovery

Methodology

VIPER-R1 Framework

Our framework consists of a comprehensive two-stage pipeline designed to emulate the cognitive workflow of physicists in discovering physical laws from visual observations.

1

Motion Structure Induction (MSI)

Step 1: Joint Induction

Joint generation of Causal Chain of Thought (C-CoT) and initial Symbolic Ansatz from visual evidence and trajectory data.

LMSI-1 = -𝔼(E,Y)~Dphys Σt=1|Y| log πθ(yt | E, y<t)

Step 2: C-CoT-Guided Formulation

Refines symbolic formulation by conditioning on ground-truth reasoning chain, focusing on syntax and semantics of physical formalisms.

LMSI-2 = -𝔼(E,C,S)~Dphys Σt=1|S| log πθ(st | E, C, s<t)
2

Reward-Guided Symbolic Calibration (RGSC)

Uses Group Relative Policy Optimization (GRPO) to refine symbolic hypotheses through a sophisticated reward system:

Format Reward (Rformat)

Ensures adherence to predefined template structure

Structural Reward (Rstructural)

Parameter-agnostic Jaccard similarity for topological correctness

Accuracy Reward (Raccuracy)

Binary reward for exact symbolic matches

R(Si) = wfRformat(Si) + wsRstructural(Si, SGT) + waRaccuracy(Si, SGT)
3

Symbolic Residual Realignment (SR²)

Agentic refinement process where VIPER-R1 invokes external symbolic regression tools to correct residual errors:

Step 1: Calculate residual field: r(t) = aGT(t) - aVLM(x, v, t)
Step 2: Apply symbolic regression: aresidual(x, v, t) ← SR(x, v, t, r(t))
Step 3: Theory realignment: afinal(x, v, t) = aVLM(x, v, t) + aresidual(x, v, t)

Experimental Results

Main Results

VIPER-R1 significantly outperforms state-of-the-art VLMs on the PhysSymbol benchmark:

0.812 Structural Score

56.7% improvement over best baseline (Claude-4-Sonnet)

0.487 Accuracy Score

45.4% improvement over top zero-shot model

0.032 Post-SR² MSE

3× lower error than best baseline (0.091)

Main Performance Results

Performance Comparison

VLM Performance Bubble Chart

Performance comparison across different SOTA VLMs

Ablation Study

Each component of our framework contributes significantly to the overall performance:

Ablation Study Results

Contribution of MSI and RGSC stages

Base Model (Qwen-VL-2.5)

Structural: 0.096 | Accuracy: 0.179

+ MSI (SFT only)

Structural: 0.554 | Accuracy: 0.399

+475% structural improvement

+ MSI + RGSC (Full Model)

Structural: 0.812 | Accuracy: 0.487

+746% total improvement

Case Study & Analysis

Additional Case Studies

Damped Oscillator

Damped Oscillator

Linear system with velocity damping

a = -kx - cv
Driven System

Driven System

Periodic external forcing

a = -kx + F₀cos(ωt)
Non-linear Dynamics

Non-linear Potential

Cubic restoring force

a = -kx - αx³
Multi-frequency System

Multi-scale Dynamics

Multiple time scales

a = -k₁x - k₂x cos(ωt)

PhysSymbol Dataset

We introduce PhysSymbol, a comprehensive benchmark containing 5,000 multimodal instances for physics formula discovery. Each instance comprises:

  • Kinematic phase portraits (velocity vs. position)
  • Trajectory plots (position vs. time)
  • Numerical motion data (t, x(t), v(t), a(t))
  • Ground-truth symbolic equations
  • Chain-of-Thought reasoning explanations
5,000 Total Instances
11 Physics Term Types
2 Visualization Types

Sample Data

Phase Space Analysis

Phase space trajectory analysis

Time Series Analysis

Time series trajectory visualization

Combined Analysis

Combined multimodal analysis

Download PhysSymbol Dataset

Dataset will be released upon paper acceptance

Citation

@article{liu2025VIPERr1,
  title={Mimicking the Physicist's Eye: A VLM-centric Approach for Physics Formula Discovery},
  author={Liu, Jiaqi and Lai, Songning and Li, Pengze and Yu, Di and Zhou, Wenjie and Zhou, Yiyang and Xia, Peng and Wang, Zijun and Chen, Xi and Tang, Shixiang and Bai, Lei and Ouyang, Wanli and Ding, Mingyu and Yao, Huaxiu and Wang, Aoran},
  journal={arXiv preprint arXiv:2508.17380},
  year={2025}
}