Mimicking the Physicist's Eye: A VLM-centric Approach for Physics Formula Discovery

Jiaqi Liu^1,3, Songning Lai², Pengze Li^3,4, Di Yu^3,5, Wenjie Zhou^6,9, Yiyang Zhou¹, Peng Xia¹, Zijun Wang⁷, Xi Chen⁴, Shixiang Tang⁸, Lei Bai³, Wanli Ouyang^3,8, Mingyu Ding¹, Huaxiu Yao¹, Aoran Wang³

¹UNC-Chapel Hill, ²The Hong Kong University of Science and Technology (Guangzhou), ³Shanghai Artificial Intelligence Laboratory, ⁴Fudan University, ⁵Tsinghua University, ⁶Nankai University, ⁷UC Santa Cruz, ⁸The Chinese University of Hong Kong, ⁹Shanghai Innovation Institute

Contact: jqliu@cs.unc.edu, wangaoran@pjlab.org.cn

Paper Code Dataset

Abstract

Automated discovery of physical laws from observational data is a grand challenge in AI. Current methods, relying on symbolic regression or LLMs, are limited to uni-modal data and overlook the rich, visual phenomenological representations of motion that are indispensable to physicists. This "sensory deprivation" severely weakens their ability to interpret the inherent spatio-temporal patterns within dynamic phenomena.

To address this gap, we propose we propose VIPER-R1, a multimodal model for Visual Induction for Physics-based Equation Reasoning to discover fundamental symbolic formulas.

The model is trained via a curriculum of Motion Structure Induction (MSI), using supervised fine-tuning to interpret kinematic phase portraits and construct hypotheses guided by a Causal Chain of Thought (C-CoT), followed by Reward-Guided Symbolic Calibration (RGSC) to purify the formula's structure with reinforcement learning. During inference, the trained VIPER acts as an agent: it first posits a high-confidence symbolic ansatz, then proactively invokes an external symbolic regression tool to perform Symbolic Residual Realignment (SR²). This final step, analogous to a physicist's perturbation analysis, reconciles the theoretical model with empirical data.

To support this research, we introduce PhysSymbol, a new 5,000-instance multimodal corpus. Experiments show that VIPER-R1 consistently outperforms state-of-the-art VLM baselines in accuracy and interpretability, enabling more precise discovery of physical laws.

Key Contributions

VIPER-R1, a multimodal framework that simulates the scientific reasoning process by integrating visual perception, trajectory data, and symbolic reasoning
A two-stage training pipeline featuring Motion Structure Induction (MSI) and Reward-Guided Symbolic Calibration (RGSC)
An agentic refinement stage, Symbolic Residual Realignment (SR²), where the VLM proactively utilizes external tools to harmonize theoretical hypotheses with empirical data
PhysSymbol, a new comprehensive benchmark of 5,000 multimodal instances for physics formula discovery

Methodology

Our framework consists of a comprehensive two-stage pipeline designed to emulate the cognitive workflow of physicists in discovering physical laws from visual observations.

Motion Structure Induction (MSI)

Step 1: Joint Induction

Joint generation of Causal Chain of Thought (C-CoT) and initial Symbolic Ansatz from visual evidence and trajectory data.

L_MSI-1 = -𝔼_{(E,Y)~D_phys} Σ_t=1^|Y| log π_θ(y_t | E, y_<t)

Step 2: C-CoT-Guided Formulation

Refines symbolic formulation by conditioning on ground-truth reasoning chain, focusing on syntax and semantics of physical formalisms.

L_MSI-2 = -𝔼_{(E,C,S)~D_phys} Σ_t=1^|S| log π_θ(s_t | E, C, s_<t)

Reward-Guided Symbolic Calibration (RGSC)

Uses Group Relative Policy Optimization (GRPO) to refine symbolic hypotheses through a sophisticated reward system:

Format Reward (R_format)

Ensures adherence to predefined template structure

Structural Reward (R_structural)

Parameter-agnostic Jaccard similarity for topological correctness

Accuracy Reward (R_accuracy)

Binary reward for exact symbolic matches

R(S_i) = w_fR_format(S_i) + w_sR_structural(S_i, S_GT) + w_aR_accuracy(S_i, S_GT)

Symbolic Residual Realignment (SR²)

Agentic refinement process where VIPER-R1 invokes external symbolic regression tools to correct residual errors:

Step 1: Calculate residual field: r(t) = a_GT(t) - a_VLM(x, v, t)

Step 2: Apply symbolic regression: a_residual(x, v, t) ← SR(x, v, t, r(t))

Step 3: Theory realignment: a_final(x, v, t) = a_VLM(x, v, t) + a_residual(x, v, t)

Experimental Results

Main Results

VIPER-R1 significantly outperforms state-of-the-art VLMs on the PhysSymbol benchmark:

0.812 Structural Score

56.7% improvement over best baseline (Claude-4-Sonnet)

0.487 Accuracy Score

45.4% improvement over top zero-shot model

0.032 Post-SR² MSE

3× lower error than best baseline (0.091)

Performance Comparison

Performance comparison across different SOTA VLMs

Ablation Study

Each component of our framework contributes significantly to the overall performance:

Contribution of MSI and RGSC stages

Base Model (Qwen-VL-2.5)

Structural: 0.096 | Accuracy: 0.179

+ MSI (SFT only)

Structural: 0.554 | Accuracy: 0.399

+475% structural improvement

+ MSI + RGSC (Full Model)

Structural: 0.812 | Accuracy: 0.487

+746% total improvement

Case Study & Analysis

Complex System: Non-linear Damping with Stochastic Noise

We examine a challenging system governed by a(t) = -kx - cv³ + η(t), combining linear restoring force, non-linear damping, and stochastic noise. This case demonstrates VIPER-R1's ability to identify and integrate components with fundamentally different characteristics.

Phase space and trajectory analysis of complex dynamical system

Chain-of-Thought Analysis

Visual Pattern Recognition

Oscillatory x(t): Linear restoring force (-k·x)
Spiral structures in phase space: Non-linear damping (-c·v³)
Irregular fluctuations: Stochastic noise η(t)

Ground Truth

a(t) = -1.542x - 2.766v³ + 0.450η(t)

VIPER-R1 Output

a(t) = -1.454x - 2.834v³ + 0.447η(t)

Remarkable Accuracy: The model correctly identifies all three physical components with high quantitative precision

Additional Case Studies

Damped Oscillator

Linear system with velocity damping

a = -kx - cv

Driven System

Periodic external forcing

a = -kx + F₀cos(ωt)

Non-linear Potential

Cubic restoring force

a = -kx - αx³

Multi-scale Dynamics

Multiple time scales

a = -k₁x - k₂x cos(ωt)

PhysSymbol Dataset

We introduce PhysSymbol, a comprehensive benchmark containing 5,000 multimodal instances for physics formula discovery. Each instance comprises:

Kinematic phase portraits (velocity vs. position)
Trajectory plots (position vs. time)
Numerical motion data (t, x(t), v(t), a(t))
Ground-truth symbolic equations
Chain-of-Thought reasoning explanations

5,000 Total Instances

11 Physics Term Types

2 Visualization Types

Sample Data

Phase space trajectory analysis

Time series trajectory visualization

Combined multimodal analysis

Download PhysSymbol Dataset

Dataset will be released upon paper acceptance

Citation

@article{liu2025VIPERr1,
  title={Mimicking the Physicist's Eye: A VLM-centric Approach for Physics Formula Discovery},
  author={Liu, Jiaqi and Lai, Songning and Li, Pengze and Yu, Di and Zhou, Wenjie and Zhou, Yiyang and Xia, Peng and Wang, Zijun and Chen, Xi and Tang, Shixiang and Bai, Lei and Ouyang, Wanli and Ding, Mingyu and Yao, Huaxiu and Wang, Aoran},
  journal={arXiv preprint arXiv:2508.17380},
  year={2025}
}