1. 4DNeX: Feed-Forward 4D Generative Modeling Made Easy
Authors: Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, Ziwei Liu β’
Published: 2025-08-18 β’
Source: arXiv
We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image-to-4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.
2. IGFuse: Interactive 3D Gaussian Scene Reconstruction via Multi-Scans Fusion
Authors: Wenhao Hu, Zesheng Li, Haonan Zhou, Liu Liu, Xuexiang Wen, Zhizhong Su, Xi Li, Gaoang Wang β’
Published: 2025-08-18 β’
Source: arXiv
Reconstructing complete and interactive 3D scenes remains a fundamental challenge in computer vision and robotics, particularly due to persistent object occlusions and limited sensor coverage. Multiview observations from a single scene scan often fail to capture the full structural details. Existing approaches typically rely on multi stage pipelines, such as segmentation, background completion, and inpainting or require per-object dense scanning, both of which are error-prone, and not easily scalable. We propose IGFuse, a novel framework that reconstructs interactive Gaussian scene by fusing observations from multiple scans, where natural object rearrangement between captures reveal previously occluded regions. Our method constructs segmentation aware Gaussian fields and enforces bi-directional photometric and semantic consistency across scans. To handle spatial misalignments, we introduce a pseudo-intermediate scene state for unified alignment, alongside collaborative co-pruning strategies to refine geometry. IGFuse enables high fidelity rendering and object level scene manipulation without dense observations or complex pipelines. Extensive experiments validate the framework's strong generalization to novel scene configurations, demonstrating its effectiveness for real world 3D reconstruction and real-to-simulation transfer. Our project page is available online.
3. Driven-Dissipative Interpretation of Measurement-Induced State Transitions Beyond Semiclassical Predictions
Authors: Bo-Syun Pan, Yen-Hsiang Lin, Chiao-Hsuan Wang β’
Published: 2025-08-18 β’
Source: arXiv
Dispersive readout plays a central role in superconducting quantum computing, enabling quantum nondemolition (QND) measurements of qubits through a coupled microwave resonator. However, under strong readout drives, multi-photon resonances can cause measurement-induced state transition (MIST), resulting in qubit leakage out of the computational subspace and compromising the QND character. We present a driven-dissipative interpretation of MIST using a reduced quantum model that captures the dynamics and entanglement structure underlying the breakdown of QND measurement, a feature inaccessible to previous semiclassical treatments. A super-MIST regime under strong drive is uncovered, characterized by steady-state qubit inversion and slow relaxation beyond the semiclassical Landau-Zener predictions. We further identify a transient readout condition in which the resonator becomes highly populated while the qubit remains near its original state. These results are broadly applicable to superconducting qubits such as fluxonium and transmon, unveil the nonequilibrium dynamics of MIST, and highlight strongly driven regimes that can be leveraged for measurement optimization.
4. MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models
Authors: Haoyu He, Katrin Renz, Yong Cao, Andreas Geiger β’
Published: 2025-08-18 β’
Source: arXiv
Diffusion language models, as a promising alternative to traditional autoregressive (AR) models, enable faster generation and richer conditioning on bidirectional context. However, they suffer from a key discrepancy between training and inference: during inference, MDLMs progressively reveal the structure of the generated sequence by producing fewer and fewer masked tokens, whereas this structure is ignored in training as tokens are masked at random. Although this discrepancy between training and inference can lead to suboptimal performance, it has been largely overlooked by previous works, leaving closing this gap between the two stages an open problem. To address this, we frame the problem of learning effective denoising trajectories as a sequential decision-making problem and use the resulting framework to apply reinforcement learning. We propose a novel Masked Diffusion Policy Optimization (MDPO) to exploit the Markov property diffusion possesses and explicitly train the model under the same progressive refining schedule used at inference. MDPO matches the performance of the previous state-of-the-art (SOTA) method with 60x fewer gradient updates, while achieving average improvements of 9.6% on MATH500 and 54.2% on Countdown over SOTA when trained within the same number of weight updates. Additionally, we improve the remasking strategy of MDLMs as a plug-in inference replacement to overcome the limitation that the model cannot refine tokens flexibly. This simple yet effective training-free strategy, what we refer to as RCR, consistently improves performance and yields additional gains when combined with MDPO. Our findings establish great potential for investigating the discrepancy between pre-training and inference of MDLMs. Code: https://github.com/autonomousvision/mdpo. Project Page: https://cli212.github.io/MDPO/.
5. Strain-induced Ettingshausen effect in spin-orbit coupled noncentrosymmetric metals
Authors: Gautham Varma K, Azaz Ahmad, Gargee Sharma β’
Published: 2025-08-18 β’
Source: arXiv
Elastic deformations couple with electronic degrees of freedom in materials to generate gauge fields that lead to interesting transport properties. Recently, it has been well studied that strain-induced chiral magnetic fields in Weyl semimetals lead to interesting magnetotransport induced by the chiral anomaly (CA). Recent studies have revealed that CA is not necessarily only a Weyl-node property, but is rather a Fermi surface property, and is also present in a more general class of materials, for example, in spin orbit-coupled noncentrosymmetric metals (SOC-NCMs). The interplay of strain, CA, and charge and thermomagnetic transport in SOC-NCMs, however, remains unexplored. Here we resolve this gap. Using a tight-binding model for SOC-NCMs, we first demonstrate that strain in SOC-NCMs induces anisotropy in the spin-orbit coupling and generates an axial electric field. Then, using the quasi-classical Boltzmann transport formalism with momentum-dependent intraband and interband scattering processes, we show that strain in the presence of external magnetic field can generate temperature gradients via the Nernst-Ettingshausen effect, whose direction and behavior depends the on interplay of multiple factors: the angle between the applied strain and magnetic field, the presence of the chiral anomaly, the Lorentz force, and the strength of interband scattering. We further reveal that time-reversal symmetry breaking in the presence of an external magnetic field generates the Berry-curvature-driven anomalous Ettingshausen effect, which is qualitatively distinct from the conventional Lorentz-force-driven counterpart. In light of recent and forthcoming theoretical and experimental advances in the field of SOC-NCMs, we find our study to be particularly timely and relevant.
6. Topological invariant for finite systems in the presence of disorder
Authors: Robert Eissele, Binayyak B. Roy, Sumanta Tewari, Tudor D. Stanescu β’
Published: 2025-08-18 β’
Source: arXiv
Topological invariants, rigorously defined only in the thermodynamic limit, have been generalized to topological indicators applicable to finite-size disordered systems. However, in many experimentally relevant situations, such as semiconductor-superconductor (SM-SC) hybrid nanowires hosting Majorana zero modes, the interplay between strong disorder and finite-size effects renders these indicators (e.g., the so-called topological visibility) biased and ill-defined, significantly limiting their usefulness. In this paper, we propose the topological invariant rigorously defined for an infinite system constructed by periodically repeating the original finite disordered system, as a topological indicator. Using the one-dimensional SM-SC hybrid nanowire as an example, we show that this general and transparent approach yields faithful topological indicators free from the biases affecting commonly used finite-size indicators, capturing the nature (topological or trivial) of the phase at generic points in parameter space, and providing a reliable tool for interpreting experimental results.
7. Aligned Stellar Obliquities for Two Hot Jupiter-hosting M Dwarfs Revealed by MAROON-X: Implications for Hot Jupiter Formation
Authors: Drew Weisserman, Erik Gillis, Ryan Cloutier, Nina Brown, Jacob L. Bean, Andreas Seifahrt, Tanya Das, Madison Brady, Bertram Bitsch, Emily Deibert, Thomas M. Evans-Soma, Noah Fenlon, Laura Kreidberg, Michael Line, Ralph Pudritz, Evgenya L. Shkolnik, Luis Welbanks β’
Published: 2025-08-18 β’
Source: arXiv
Hot Jupiters (HJs) are $2-3\times$ less common around early M dwarfs than around AFGK stars, suggesting that HJs may form and/or migrate via distinct pathways around different types of stars. One source of insight into HJ formation mechanisms is to trace their dynamical histories through measurements of host stellar obliquities via the Rossiter-McLaughlin (RM) effect. Here we present measurements of the RM effect for the HJs TOI-3714 b and TOI-5293 A b using the Gemini-North/MAROON-X spectrograph. Our measurements represent just the second and third hot Jupiters around M dwarfs (HJMD) with a detection of the RM effect. We find that both systems are well-aligned with sky-projected obliquities of $\lambda = 21^{+14}_{-11}$$\mathrm{^{\circ}}$ and $-12^{+19}_{-14}$$\mathrm{^{\circ}}$ and deprojected obliquities of $\psi = 26^{+11}_{-10}$$\mathrm{^{\circ}}$ and $24^{+11}_{-10}$$\mathrm{^{\circ}}$ for TOI-3714 and TOI-5293 A, respectively. Both stars are in wide binary systems. We refine the stellar parameters by decontaminating their unresolved $K_s$-band photometry and constrain the binary orbits using Gaia DR3 astrometry. We find that the minimum mutual inclination of the planet and binary companion in the TOI-5293 system is sufficiently large to drive Kozai-Lidov (KL) migration while the result for TOI-3714 is inconclusive. We present a population-level analysis of HJs around AFGK versus early M dwarfs and argue that KL migration is more efficient around the latter, which is expected to produce misaligned stellar obliquities in HJMD systems in the absence of efficient tidal damping. The emerging population of well-aligned HJMD hosts supports the expectation that M dwarfs, with their deep convective envelopes, do efficiently dampen misaligned obliquities.
8. Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation
Authors: David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A. Smith, Hannaneh Hajishirzi, Kyle Lo, Jesse Dodge β’
Published: 2025-08-18 β’
Source: arXiv
Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more reliable for such decisions, and interventions to design higher-quality evaluation benchmarks. We introduce two key metrics that show differences in current benchmarks: signal, a benchmark's ability to separate better models from worse models, and noise, a benchmark's sensitivity to random variability between training steps. We demonstrate that benchmarks with a better signal-to-noise ratio are more reliable when making decisions at small scale, and those with less noise have lower scaling law prediction error. These results suggest that improving signal or noise will lead to more useful benchmarks, so we introduce three interventions designed to directly affect signal or noise. For example, we propose that switching to a metric that has better signal and noise (e.g., perplexity rather than accuracy) leads to better reliability and improved scaling law error. We also find that filtering noisy subtasks, to improve an aggregate signal-to-noise ratio, leads to more reliable multi-task evaluations. We also find that averaging the output of a model's intermediate checkpoints to reduce noise leads to consistent improvements. We conclude by recommending that those creating new benchmarks, or selecting which existing benchmarks to use, aim for high signal and low noise. We use 30 benchmarks for these experiments, and 375 open-weight language models from 60M to 32B parameters, resulting in a new, publicly available dataset of 900K evaluation benchmark results, totaling 200M instances.
9. Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks
Authors: Ruofan Lu, Yichen Li, Yintong Huo β’
Published: 2025-08-18 β’
Source: arXiv
Autonomous agent systems powered by Large Language Models (LLMs) have demonstrated promising capabilities in automating complex tasks. However, current evaluations largely rely on success rates without systematically analyzing the interactions, communication mechanisms, and failure causes within these systems. To bridge this gap, we present a benchmark of 34 representative programmable tasks designed to rigorously assess autonomous agents. Using this benchmark, we evaluate three popular open-source agent frameworks combined with two LLM backbones, observing a task completion rate of approximately 50%. Through in-depth failure analysis, we develop a three-tier taxonomy of failure causes aligned with task phases, highlighting planning errors, task execution issues, and incorrect response generation. Based on these insights, we propose actionable improvements to enhance agent planning and self-diagnosis capabilities. Our failure taxonomy, together with mitigation advice, provides an empirical foundation for developing more robust and effective autonomous agent systems in the future.
10. Has GPT-5 Achieved Spatial Intelligence? An Empirical Study
Authors: Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang β’
Published: 2025-08-18 β’
Source: arXiv
Multi-modal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, which are fundamental capabilities to achieving artificial general intelligence. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models stand on the path toward spatial intelligence. First, we propose a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and discuss the challenges in ensuring fair evaluation. We then evaluate state-of-the-art proprietary and open-source models on eight key benchmarks, at a cost exceeding one billion total tokens. Our empirical study reveals that (1) GPT-5 demonstrates unprecedented strength in spatial intelligence, yet (2) still falls short of human performance across a broad spectrum of tasks. Moreover, we (3) identify the more challenging spatial intelligence problems for multi-modal models, and (4) proprietary models do not exhibit a decisive advantage when facing the most difficult problems. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans yet fail even the most advanced multi-modal models.
11. The ALPINE-CRISTAL-JWST survey: spatially resolved star formation relations at $z\sim5$
Authors: C. Accard, M. BΓ©thermin, M. Boquien, V. Buat, L. Vallini, F. Renaud, K. Kraljic, M. Aravena, P. Cassata, E. da Cunha, P. Dam, I. de Looze, M. Dessauges-Zavadsky, Y. Dubois, A. Faisst, Y. Fudamoto, M. Ginolfi, C. Gruppioni, S. Han, R. Herrera-Camus, H. Inami, A. M. Koekemoer, B. C. Lemaux, J. Li, Y. Li, B. Mobasher, J. Molina, A. Nanni, M. Palla, F. Pozzi, M. RelaΓ±o, M. Romano, P. Sawant, J. Spilker, A. Tsujita, E. Veraldi, V. Villanueva, W. Wang, S. K. Yi, G. Zamorani β’
Published: 2025-08-18 β’
Source: arXiv
Star formation governs galaxy evolution, shaping stellar mass assembly and gas consumption across cosmic time. The Kennicutt-Schmidt (KS) relation, linking star formation rate (SFR) and gas surface densities, is fundamental to understand star formation regulation, yet remains poorly constrained at $z > 2$ due to observational limitations and uncertainties in locally calibrated gas tracers. The [CII] $158 {\rm \mu m}$ line has recently emerged as a key probe of the cold ISM and star formation in the early Universe. We investigate whether the resolved [CII]-SFR and KS relations established at low redshift remain valid at $4 < z < 6$ by analysing 13 main-sequence galaxies from the ALPINE and CRISTAL surveys, using multi-wavelength data (HST, JWST, ALMA) at $\sim2$ kpc resolution. We perform pixel-by-pixel spectral energy distribution (SED) modelling with CIGALE on resolution-homogenised images. We develop a statistical framework to fit the [CII]-SFR relation that accounts for pixel covariance and compare our results to classical fitting methods. We test two [CII]-to-gas conversion prescriptions to assess their impact on inferred gas surface densities and depletion times. We find a resolved [CII]-SFR relation with a slope of $0.87 \pm 0.15$ and intrinsic scatter of $0.19 \pm 0.03$ dex, which is shallower and tighter than previous studies at $z\sim5$. The resolved KS relation is highly sensitive to the [CII]-to-gas conversion factor: using a fixed global $\alpha_{\rm [CII]}$ yields depletion times of $0.5$-$1$ Gyr, while a surface brightness-dependent $W_{\rm [CII]}$, places some galaxies with high gas density in the starburst regime ($<0.1$ Gyr). Future inputs from both simulations and observations are required to better understand how the [CII]-to-gas conversion factor depends on local ISM properties. We need to break this fundamental limit to properly study the KS relation at $z\gtrsim4$.
12. Spot the BlindSpots: Systematic Identification and Quantification of Fine-Grained LLM Biases in Contact Center Summaries
Authors: Kawin Mayilvaghanan, Siddhant Gupta, Ayush Kumar β’
Published: 2025-08-18 β’
Source: arXiv
Abstractive summarization is a core application in contact centers, where Large Language Models (LLMs) generate millions of summaries of call transcripts daily. Despite their apparent quality, it remains unclear whether LLMs systematically under- or over-attend to specific aspects of the transcript, potentially introducing biases in the generated summary. While prior work has examined social and positional biases, the specific forms of bias pertinent to contact center operations - which we term Operational Bias - have remained unexplored. To address this gap, we introduce BlindSpot, a framework built upon a taxonomy of 15 operational bias dimensions (e.g., disfluency, speaker, topic) for the identification and quantification of these biases. BlindSpot leverages an LLM as a zero-shot classifier to derive categorical distributions for each bias dimension in a pair of transcript and its summary. The bias is then quantified using two metrics: Fidelity Gap (the JS Divergence between distributions) and Coverage (the percentage of source labels omitted). Using BlindSpot, we conducted an empirical study with 2500 real call transcripts and their summaries generated by 20 LLMs of varying scales and families (e.g., GPT, Llama, Claude). Our analysis reveals that biases are systemic and present across all evaluated models, regardless of size or family.
13. Rare event sampling for moving targets: extremes of temperature and daily precipitation in a general circulation model
Authors: Justin Finkel, Paul A. O'Gorman β’
Published: 2025-08-18 β’
Source: arXiv
Extreme weather events epitomize high cost: to society through their physical impacts, and to computer servers that are used to simulate them to provide information to mitigate those impacts. It costs hundreds of years to sample a few once-per-century events with straightforward model integration, but that cost can be much reduced with rare event sampling, which nudges ensembles of simulations to convert moderate events to severe ones, e.g., by steering a cyclone directly through a region of interest. With proper statistical accounting, rare event algorithms can provide quantitative climate risk assessment at reduced cost. But this can only work if ensemble members diverge fast enough. Sudden, transient events characteristic of Earth's midlatitude storm track regions, such as heavy precipitation and heat extremes, pose a particular challenge because they come and go faster than an ensemble can explore the possibilities. Here we extend standard rare event algorithms to handle this challenging case in an idealized atmospheric general circulation model, achieving 5-10 times sped-up estimation of long return periods, such as 100-150 years from only 20 years of simulation for extremes of daily precipitation and surface temperature. The algorithm, called TEAMS (``trying-early adaptive multilevel splitting''), was developed previously in Finkel and O'Gorman (2024) using a toy chaotic system, and relies on a key parameter -- the advance split time -- which may be estimated based on simple diagnostics of ensemble dispersion rates. The results are promising for accelerated risk assessment across a wide range of physical hazards using more realistic and complex models with acute computational constraints.
14. Activity in White Dwarf Debris Disks I: Spitzer Legacy Reveals Variability Incompatible with the Canonical Model
Authors: Hiba Tu Noor, Jay Farihi, Scott J. Kenyon, Roman R. Rafikov, Mark C. Wyatt, Kate Y. L. Su, Carl Melis, Andrew Swan, Thomas G. Wilson, Boris T. GΓ€nsicke, Amy Bonsor, Laura K. Rogers, Seth Redfield, Mukremin Kilic β’
Published: 2025-08-18 β’
Source: arXiv
This study presents all available, multi-epoch 3.6 and 4.5 $\mu$m photometry from Spitzer Space Telescope observations of white dwarf debris disks, including weekly cadence observations of 16 relatively bright systems, and 5 h staring-mode observations for five of these. Significant variability is detected in 85 per cent of disks and across all timescales probed, from minutes to weeks to years, where the largest flux changes correlate with the longest time baselines, and the infrared excesses persist utterly. While each source is idiosyncratic, the overall results indicate the most variable disks correlate with those that are the brightest (dustiest), and also among those with detected gas, demonstrating both dust and gas are produced via ongoing collisions. There is a correlation between flux and colour changes, where disks tend to appear redder when dimmer and bluer when brighter, consistent with an excess of small dust grains produced in collisions, followed by a gradual return to equilibrium. The overall results are a drastic departure from the predictions of the canonical - geometrically thin, optically thick - disk in both flux and colour, but are broadly consistent with collisional evolution based on a simple model. The data presented herein constitute a legacy resource that can inform time-series studies of polluted and dusty white dwarfs, and importantly serve as a basis for future disk modelling, beyond the pioneering canonical framework.
15. AutoBnB-RAG: Enhancing Multi-Agent Incident Response with Retrieval-Augmented Generation
Authors: Zefang Liu, Arman Anwar β’
Published: 2025-08-18 β’
Source: arXiv
Incident response (IR) requires fast, coordinated, and well-informed decision-making to contain and mitigate cyber threats. While large language models (LLMs) have shown promise as autonomous agents in simulated IR settings, their reasoning is often limited by a lack of access to external knowledge. In this work, we present AutoBnB-RAG, an extension of the AutoBnB framework that incorporates retrieval-augmented generation (RAG) into multi-agent incident response simulations. Built on the Backdoors & Breaches (B&B) tabletop game environment, AutoBnB-RAG enables agents to issue retrieval queries and incorporate external evidence during collaborative investigations. We introduce two retrieval settings: one grounded in curated technical documentation (RAG-Wiki), and another using narrative-style incident reports (RAG-News). We evaluate performance across eight team structures, including newly introduced argumentative configurations designed to promote critical reasoning. To validate practical utility, we also simulate real-world cyber incidents based on public breach reports, demonstrating AutoBnB-RAG's ability to reconstruct complex multi-stage attacks. Our results show that retrieval augmentation improves decision quality and success rates across diverse organizational models. This work demonstrates the value of integrating retrieval mechanisms into LLM-based multi-agent systems for cybersecurity decision-making.
16. Choosing the Right Engine in the Virtual Reality Landscape
Authors: Santiago Berrezueta-Guzman, Stefan Wagner β’
Published: 2025-08-18 β’
Source: arXiv
Virtual reality (VR) development relies on game engines to provide real-time rendering, physics simulation, and interaction systems. Among the most widely used game engines, Unreal Engine and Unity dominate the industry, offering distinct advantages in graphics rendering, performance optimization, usability, resource requirements, and scalability. This study presents a comprehensive comparative analysis of both engines, evaluating their capabilities and trade-offs through empirical assessments and real-world case studies of large-scale VR projects. The findings highlight key factors such as rendering fidelity, computational efficiency, cross-platform compatibility, and development workflows. These provide practical insights for selecting the most suitable engine based on project-specific needs. Furthermore, emerging trends in artificial intelligence (AI)-driven enhancements, including Deep Learning Super Sampling (DLSS) and large language models (LLMs), are explored to assess their impact on VR development workflows. By aligning engine capabilities with technical and creative requirements, developers can overcome performance bottlenecks, enhance immersion, and streamline optimization techniques. This study serves as a valuable resource for VR developers, researchers, and industry professionals, offering data-driven recommendations to navigate the evolving landscape of VR technology.
17. Contrastive Representations for Temporal Reasoning
Authors: Alicja Ziarko, Michal Bortkiewicz, Michal Zawalski, Benjamin Eysenbach, Piotr Milos β’
Published: 2025-08-18 β’
Source: arXiv
In classical AI, perception relies on learning state-based representations, while planning, which can be thought of as temporal reasoning over action sequences, is typically achieved through search. We study whether such reasoning can instead emerge from representations that capture both perceptual and temporal structure. We show that standard temporal contrastive learning, despite its popularity, often fails to capture temporal structure due to its reliance on spurious features. To address this, we introduce Combinatorial Representations for Temporal Reasoning (CRTR), a method that uses a negative sampling scheme to provably remove these spurious features and facilitate temporal reasoning. CRTR achieves strong results on domains with complex temporal structure, such as Sokoban and Rubik's Cube. In particular, for the Rubik's Cube, CRTR learns representations that generalize across all initial states and allow it to solve the puzzle using fewer search steps than BestFS, though with longer solutions. To our knowledge, this is the first method that efficiently solves arbitrary Cube states using only learned representations, without relying on an external search algorithm.
18. Causally-Guided Pairwise Transformer -- Towards Foundational Digital Twins in Process Industry
Authors: Michael Mayr, Georgios C. Chasparis β’
Published: 2025-08-18 β’
Source: arXiv
Foundational modelling of multi-dimensional time-series data in industrial systems presents a central trade-off: channel-dependent (CD) models capture specific cross-variable dynamics but lack robustness and adaptability as model layers are commonly bound to the data dimensionality of the tackled use-case, while channel-independent (CI) models offer generality at the cost of modelling the explicit interactions crucial for system-level predictive regression tasks. To resolve this, we propose the Causally-Guided Pairwise Transformer (CGPT), a novel architecture that integrates a known causal graph as an inductive bias. The core of CGPT is built around a pairwise modeling paradigm, tackling the CD/CI conflict by decomposing the multidimensional data into pairs. The model uses channel-agnostic learnable layers where all parameter dimensions are independent of the number of variables. CGPT enforces a CD information flow at the pair-level and CI-like generalization across pairs. This approach disentangles complex system dynamics and results in a highly flexible architecture that ensures scalability and any-variate adaptability. We validate CGPT on a suite of synthetic and real-world industrial datasets on long-term and one-step forecasting tasks designed to simulate common industrial complexities. Results demonstrate that CGPT significantly outperforms both CI and CD baselines in predictive accuracy and shows competitive performance with end-to-end trained CD models while remaining agnostic to the problem dimensionality.
19. All for law and law for all: Adaptive RAG Pipeline for Legal Research
Authors: Figarri Keisha, Prince Singh, Pallavi, Dion Fernandes, Aravindh Manivannan, Ilham Wicaksono, Faisal Ahmad β’
Published: 2025-08-18 β’
Source: arXiv
Retrieval-Augmented Generation (RAG) mitigates hallucinations by grounding large language model outputs in cited sources, a capability that is especially critical in the legal domain. We present an end-to-end RAG pipeline that revisits and extends the LegalBenchRAG baseline with three targeted enhancements: (i) a context-aware query translator that disentangles document references from natural-language questions and adapts retrieval depth and response style based on expertise and specificity, (ii) open-source retrieval strategies using SBERT and GTE embeddings that achieve substantial performance gains (improving Recall@K by 30-95\% and Precision@K by $\sim$2.5$\times$ for $K>4$) while remaining cost-efficient, and (iii) a comprehensive evaluation and generation framework that combines RAGAS, BERTScore-F1, and ROUGE-Recall to assess semantic alignment and faithfulness across models and prompt designs. Our results show that carefully designed open-source pipelines can rival or outperform proprietary approaches in retrieval quality, while a custom legal-grounded prompt consistently produces more faithful and contextually relevant answers than baseline prompting. Taken together, these contributions demonstrate the potential of task-aware, component-level tuning to deliver legally grounded, reproducible, and cost-effective RAG systems for legal research assistance.
20. Precise Action-to-Video Generation Through Visual Action Prompts
Authors: Yuang Wang, Chao Wen, Haoyu Guo, Sida Peng, Minghan Qin, Hujun Bao, Xiaowei Zhou, Ruizhen Hu β’
Published: 2025-08-18 β’
Source: arXiv
We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality trade-off: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object interactions (HOI) and dexterous robotic manipulation - enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach. Project page: https://zju3dv.github.io/VAP/.
21. Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy
Authors: Tianyi Zhang, Haonan Duan, Haoran Hao, Yu Qiao, Jifeng Dai, Zhi Hou β’
Published: 2025-08-18 β’
Source: arXiv
Vision-Language-Action (VLA) models frequently encounter challenges in generalizing to real-world environments due to inherent discrepancies between observation and action spaces. Although training data are collected from diverse camera perspectives, the models typically predict end-effector poses within the robot base coordinate frame, resulting in spatial inconsistencies. To mitigate this limitation, we introduce the Observation-Centric VLA (OC-VLA) framework, which grounds action predictions directly in the camera observation space. Leveraging the camera's extrinsic calibration matrix, OC-VLA transforms end-effector poses from the robot base coordinate system into the camera coordinate system, thereby unifying prediction targets across heterogeneous viewpoints. This lightweight, plug-and-play strategy ensures robust alignment between perception and action, substantially improving model resilience to camera viewpoint variations. The proposed approach is readily compatible with existing VLA architectures, requiring no substantial modifications. Comprehensive evaluations on both simulated and real-world robotic manipulation tasks demonstrate that OC-VLA accelerates convergence, enhances task success rates, and improves cross-view generalization. The code will be publicly available.
22. Real-Time Beach Litter Detection and Counting: A Comparative Analysis of RT-DETR Model Variants
Authors: Miftahul Huda, Arsyiah Azahra, Putri Maulida Chairani, Dimas Rizky Ramadhani, Nabila Azhari, Ade Lailani β’
Published: 2025-08-18 β’
Source: arXiv
Coastal pollution is a pressing global environmental issue, necessitating scalable and automated solutions for monitoring and management. This study investigates the efficacy of the Real-Time Detection Transformer (RT-DETR), a state-of-the-art, end-to-end object detection model, for the automated detection and counting of beach litter. A rigorous comparative analysis is conducted between two model variants, RT-DETR-Large (RT-DETR-L) and RT-DETR-Extra-Large (RT-DETR-X), trained on a publicly available dataset of coastal debris. The evaluation reveals that the RT-DETR-X model achieves marginally superior accuracy, with a mean Average Precision at 50\% IoU (mAP@50) of 0.816 and a mAP@50-95 of 0.612, compared to the RT-DETR-L model's 0.810 and 0.606, respectively. However, this minor performance gain is realized at a significant computational cost; the RT-DETR-L model demonstrates a substantially faster inference time of 20.1 ms versus 34.5 ms for the RT-DETR-X. The findings suggest that the RT-DETR-L model offers a more practical and efficient solution for real-time, in-field deployment due to its superior balance of processing speed and detection accuracy. This research provides valuable insights into the application of advanced Transformer-based detectors for environmental conservation, highlighting the critical trade-offs between model complexity and operational viability.
23. A Perfectly Truthful Calibration Measure
Authors: Jason Hartline, Lunjia Hu, Yifan Wu β’
Published: 2025-08-18 β’
Source: arXiv
Calibration requires that predictions are conditionally unbiased and, therefore, reliably interpretable as probabilities. Calibration measures quantify how far a predictor is from perfect calibration. As introduced by Haghtalab et al. (2024), a calibration measure is truthful if it is minimized in expectation when a predictor outputs the ground-truth probabilities. Although predicting the true probabilities guarantees perfect calibration, in reality, when calibration is evaluated on a finite sample, predicting the truth is not guaranteed to minimize any known calibration measure. All known calibration measures incentivize predictors to lie in order to appear more calibrated on a finite sample. Such lack of truthfulness motivated Haghtalab et al. (2024) and Qiao and Zhao (2025) to construct approximately truthful calibration measures in the sequential prediction setting, but no perfectly truthful calibration measure was known to exist even in the more basic batch setting. We design a perfectly truthful calibration measure in the batch setting: averaged two-bin calibration error (ATB). In addition to being truthful, ATB is sound, complete, continuous, and quadratically related to two existing calibration measures: the smooth calibration error (smCal) and the (lower) distance to calibration (distCal). The simplicity in our definition of ATB makes it efficient and straightforward to compute. ATB allows faster estimation algorithms with significantly easier implementations than smCal and distCal, achieving improved running time and simplicity for the calibration testing problem studied by Hu et al. (2024). We also introduce a general recipe for constructing truthful measures, which proves the truthfulness of ATB as a special case and allows us to construct other truthful calibration measures such as quantile-binned l_2-ECE.
24. Outlier Detection of Poisson-Distributed Targets Using a Seabed Sensor Network
Authors: Mingyu Kim, Daniel Stilwell, Jorge Jimenez β’
Published: 2025-08-18 β’
Source: arXiv
This paper presents a framework for classifying and detecting spatial commission outliers in maritime environments using seabed acoustic sensor networks and log Gaussian Cox processes (LGCPs). By modeling target arrivals as a mixture of normal and outlier processes, we estimate the probability that a newly observed event is an outlier. We propose a second-order approximation of this probability that incorporates both the mean and variance of the normal intensity function, providing improved classification accuracy compared to mean-only approaches. We analytically show that our method yields a tighter bound to the true probability using Jensen's inequality. To enhance detection, we integrate a real-time, near-optimal sensor placement strategy that dynamically adjusts sensor locations based on the evolving outlier intensity. The proposed framework is validated using real ship traffic data near Norfolk, Virginia, where numerical results demonstrate the effectiveness of our approach in improving both classification performance and outlier detection through sensor deployment.
25. Denoising diffusion models for inverse design of inflatable structures with programmable deformations
Authors: Sara Karimi, Nikolaos N. Vlassis β’
Published: 2025-08-18 β’
Source: arXiv
Programmable structures are systems whose undeformed geometries and material property distributions are deliberately designed to achieve prescribed deformed configurations under specific loading conditions. Inflatable structures are a prominent example, using internal pressurization to realize large, nonlinear deformations in applications ranging from soft robotics and deployable aerospace systems to biomedical devices and adaptive architecture. We present a generative design framework based on denoising diffusion probabilistic models (DDPMs) for the inverse design of elastic structures undergoing large, nonlinear deformations under pressure-driven actuation. The method formulates the inverse design as a conditional generation task, using geometric descriptors of target deformed states as inputs and outputting image-based representations of the undeformed configuration. Representing these configurations as simple images is achieved by establishing a pre- and postprocessing pipeline that involves a fixed image processing, simulation setup, and descriptor extraction methods. Numerical experiments with scalar and higher-dimensional descriptors show that the framework can quickly produce diverse undeformed configurations that achieve the desired deformations when inflated, enabling parallel exploration of viable design candidates while accommodating complex constraints.
26. Hybrid Deep Reconstruction for Vignetting-Free Upconversion Imaging through Scattering in ENZ Materials
Authors: Hao Zhang, Yang Xu, Wenwen Zhang, Saumya Choudhary, M. Zahirul Alam, Long D. Nguyen, Matthew Klein, Shivashankar Vangala, J. Keith Miller, Eric G. Johnson, Joshua R. Hendrickson, Robert W. Boyd, Sergio Carbajo β’
Published: 2025-08-18 β’
Source: arXiv
Optical imaging through turbid or heterogeneous environments (collectively referred to as complex media) is fundamentally challenged by scattering, which scrambles structured spatial and phase information. To address this, we propose a hybrid-supervised deep learning framework to reconstruct high-fidelity images from nonlinear scattering measurements acquired with a time-gated epsilon-near-zero (ENZ) imaging system. The system leverages four-wave mixing (FWM) in subwavelength indium tin oxide (ITO) films to temporally isolate ballistic photons, thus rejecting multiply scattered light and enhancing contrast. To recover structured features from these signals, we introduce DeepTimeGate, a U-Net-based supervised model that performs initial reconstruction, followed by a Deep Image Prior (DIP) refinement stage using self-supervised learning. Our approach demonstrates strong performance across different imaging scenarios, including binary resolution patterns and complex vortex-phase masks, under varied scattering conditions. Compared to raw scattering inputs, it boosts average PSNR by 124%, SSIM by 231%, and achieves a 10 times improvement in intersection-over-union (IoU). Beyond enhancing fidelity, our method removes the vignetting effect and expands the effective field-of-view compared to the ENZ-based optical time gate output. These results suggest broad applicability in biomedical imaging, in-solution diagnostics, and other scenarios where conventional optical imaging fails due to scattering.
27. VerilogLAVD: LLM-Aided Rule Generation for Vulnerability Detection in Verilog
Authors: Xiang Long, Yingjie Xia, Xiyuan Chen, Li Kuang β’
Published: 2025-08-18 β’
Source: arXiv
Timely detection of hardware vulnerabilities during the early design stage is critical for reducing remediation costs. Existing early detection techniques often require specialized security expertise, limiting their usability. Recent efforts have explored the use of large language models (LLMs) for Verilog vulnerability detection. However, LLMs struggle to capture the structure in Verilog code, resulting in inconsistent detection results. To this end, we propose VerilogLAVD, the first LLM-aided graph traversal rule generation approach for Verilog vulnerability detection. Our approach introduces the Verilog Property Graph (VeriPG), a unified representation of Verilog code. It combines syntactic features extracted from the abstract syntax tree (AST) with semantic information derived from control flow and data dependency graphs. We leverage LLMs to generate VeriPG-based detection rules from Common Weakness Enumeration (CWE) descriptions. These rules guide the rule executor that traversal VeriPG for potential vulnerabilities. To evaluate VerilogLAVD, we build a dataset collected from open-source repositories and synthesized data. In our empirical evaluation on 77 Verilog designs encompassing 12 CWE types, VerilogLAVD achieves an F1-score of 0.54. Compared to the LLM-only and LLM with external knowledge baselines, VerilogLAVD improves F1-score by 0.31 and 0.27, respectively.
28. DMS:Diffusion-Based Multi-Baseline Stereo Generation for Improving Self-Supervised Depth Estimation
Authors: Zihua Liu, Yizhou Li, Songyan Zhang, Masatoshi Okutomi β’
Published: 2025-08-18 β’
Source: arXiv
While supervised stereo matching and monocular depth estimation have advanced significantly with learning-based algorithms, self-supervised methods using stereo images as supervision signals have received relatively less focus and require further investigation. A primary challenge arises from ambiguity introduced during photometric reconstruction, particularly due to missing corresponding pixels in ill-posed regions of the target view, such as occlusions and out-of-frame areas. To address this and establish explicit photometric correspondences, we propose DMS, a model-agnostic approach that utilizes geometric priors from diffusion models to synthesize novel views along the epipolar direction, guided by directional prompts. Specifically, we finetune a Stable Diffusion model to simulate perspectives at key positions: left-left view shifted from the left camera, right-right view shifted from the right camera, along with an additional novel view between the left and right cameras. These synthesized views supplement occluded pixels, enabling explicit photometric reconstruction. Our proposed DMS is a cost-free, ''plug-and-play'' method that seamlessly enhances self-supervised stereo matching and monocular depth estimation, and relies solely on unlabeled stereo image pairs for both training and synthesizing. Extensive experiments demonstrate the effectiveness of our approach, with up to 35% outlier reduction and state-of-the-art performance across multiple benchmark datasets.
29. Exploiting Convexity of Neural Networks in Dynamic Operating Envelope Optimization for Distributed Energy Resources
Authors: Hongyi Li, Liming Liu, Yunyi Li, Zhaoyu Wang β’
Published: 2025-08-18 β’
Source: arXiv
The increasing penetration of distributed energy resources (DERs) brings opportunities and challenges to the operation of distribution systems. To ensure network integrity, dynamic operating envelopes (DOEs) are issued by utilities to DERs as their time-varying export/import power limits. Due to the non-convex nature of power flow equations, the optimization of DOEs faces a dilemma of solution accuracy and computation efficiency. To bridge this gap, in this paper, we facilitate DOE optimization by exploiting the convexity of input convex neural networks (ICNNs). A DOE optimization model is first presented, comprehensively considering multiple operational constraints. We propose a constraint embedding method that allows us to replace the non-convex power flow constraints with trained ICNN models and convexify the problem. To further speed up DOE optimization, we propose a linear relaxation of the ICNN-based DOE optimization problem, for which the tightness is theoretically proven. The effectiveness of the proposed method is validated with numerical case studies. Results show that the proposed ICNN-based method outperforms other benchmark methods in optimizing DOEs in terms of both solution quality and solution time.
30. Seeing the Many: Exploring Parameter Distributions Conditioned on Features in Surrogates
Authors: Xiaohan Wang, Zhimin Li, Joshua A. Levine, Matthew Berger β’
Published: 2025-08-18 β’
Source: arXiv
Recently, neural surrogate models have emerged as a compelling alternative to traditional simulation workflows. This is accomplished by modeling the underlying function of scientific simulations, removing the need to run expensive simulations. Beyond just mapping from input parameter to output, surrogates have also been shown useful for inverse problems: output to input parameters. Inverse problems can be understood as search, where we aim to find parameters whose surrogate outputs contain a specified feature. Yet finding these parameters can be costly, especially for high-dimensional parameter spaces. Thus, existing surrogate-based solutions primarily focus on finding a small set of matching parameters, in the process overlooking the broader picture of plausible parameters. Our work aims to model and visualize the distribution of possible input parameters that produce a given output feature. To achieve this goal, we aim to address two challenges: (1) the approximation error inherent in the surrogate model and (2) forming the parameter distribution in an interactive manner. We model error via density estimation, reporting high density only if a given parameter configuration is close to training parameters, measured both over the input and output space. Our density estimate is used to form a prior belief on parameters, and when combined with a likelihood on features, gives us an efficient way to sample plausible parameter configurations that generate a target output feature. We demonstrate the usability of our solution through a visualization interface by performing feature-driven parameter analysis over the input parameter space of three simulation datasets. Source code is available at https://github.com/matthewberger/seeing-the-many
31. Congested Clique Counting for Local Gibbs Distributions
Authors: Joshua Z. Sobel β’
Published: 2025-08-18 β’
Source: arXiv
There are well established reductions between combinatorial sampling and counting problems (Jerrum, Valiant, Vazirani TCS 1986). Building off of a very recent parallel algorithm utilizing this connection (Liu, Yin, Zhang arxiv 2024), we demonstrate the first approximate counting algorithm in the CongestedClique for a wide range of problems. Most interestingly, we present an algorithm for approximating the number of $q$-colorings of a graph within $\epsilon$-multiplicative error, when $q>\alpha\Delta$ for any constant $\alpha>2$, in $\Tilde{O}\big(\frac{n^{1/3}}{\epsilon^2}\big)$ rounds. More generally, we achieve a runtime of $\Tilde{O}\big(\frac{n^{1/3}}{\epsilon^2}\big)$ rounds for approximating the partition function of Gibbs distributions defined over graphs when simple locality and fast mixing conditions hold. Gibbs distributions are widely used in fields such as machine learning and statistical physics. We obtain our result by providing an algorithm to draw $n$ random samples from a distributed Markov chain in parallel, using similar ideas to triangle counting (Dolev, Lenzen, Peled DISC 2012) and semiring matrix multiplication (Censor-Hillel, Kaski, Korhonen, Lenzen, Paz, Suomela PODC 2015). Aside from counting problems, this result may be interesting for other applications requiring a large number of samples. In the special case of estimating the partition function of the hardcore model, also known as counting weighted independent sets, we can do even better and achieve an $\Tilde{O}\big(\frac{1}{\epsilon^2}\big)$ round algorithm, when the fugacity $\lambda \leq \frac{\alpha}{\Delta-1}$, where $\alpha$ is an arbitrary constant less than $1$.
32. Molecular Hydrogen in High-redshift Damped Lyman-Ξ± Absorbers
Authors: Alon Gurman, Amiel Sternberg, Shmuel Bialy, Rachel K. Cochrane, Jonathan Stern β’
Published: 2025-08-18 β’
Source: arXiv
Simulations predict that circumgalactic hydrogen gas surrounding massive ($M_{\rm{halo}}^{z=1}=10^{12}-10^{13}\ M_{\odot}$) galaxies at $z\sim4$ may be predominantly neutral, and could produce damped Ly$\alpha$ absorbers (DLAs) along sight-lines to background quasars \citep{Stern2021}. A circumgalactic medium (CGM) origin for DLAs naturally explains high redshift HI absorption-selected galaxy detections at physical separations much greater than the likely extents of the galaxy disks \citep{Neeleman2017, Neeleman2019}. The observed $z\sim 4$ DLA HI column densities are large and comparable to interstellar (ISM) gas columns at which substantial molecular hydrogen (H$_2$) abundances occur. We therefore investigate the possible molecular content of high-redshift CGM gas, and its potential detectability via (rest-frame) far-ultraviolet (UV) absorption line studies. For this purpose we develop an analytic sub-grid model for HI-to-H$_2$ transitions and incorporate the model with zoom-in FIRE-2 simulations of evolving high-$z$ galaxies. We include dust absorption and scattering computations for the transfer of photodissociating Lyman-Werner (LW) band radiation. We find that the typical extents of detectable H$_2$ sightlines are $\approx 0.1\, R_{\rm vir}$, independent of redshift from $z=2.5$ to 5. We argue that a CGM origin for DLAs naturally explains the low detection rates of H$_2$ in DLA observations, as the low CGM densities and relatively strong far-UV fields lead to molecular fractions much lower than observed in the ISM at comparable HI columns.
33. DocHPLT: A Massively Multilingual Document-Level Translation Dataset
Authors: DayyΓ‘n O'Brien, Bhavitvya Malik, Ona de Gibert, Pinzhen Chen, Barry Haddow, JΓΆrg Tiedemann β’
Published: 2025-08-18 β’
Source: arXiv
Existing document-level machine translation resources are only available for a handful of languages, mostly high-resourced ones. To facilitate the training and evaluation of document-level translation and, more broadly, long-context modeling for global communities, we create DocHPLT, the largest publicly available document-level translation dataset to date. It contains 124 million aligned document pairs across 50 languages paired with English, comprising 4.26 billion sentences, with further possibility to provide 2500 bonus pairs not involving English. Unlike previous reconstruction-based approaches that piece together documents from sentence-level data, we modify an existing web extraction pipeline to preserve complete document integrity from the source, retaining all content including unaligned portions. After our preliminary experiments identify the optimal training context strategy for document-level translation, we demonstrate that LLMs fine-tuned on DocHPLT substantially outperform off-the-shelf instruction-tuned baselines, with particularly dramatic improvements for under-resourced languages. We open-source the dataset under a permissive license, providing essential infrastructure for advancing multilingual document-level translation.
34. Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
Authors: Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen, Liqiang Nie β’
Published: 2025-08-18 β’
Source: arXiv
Robotic manipulation, a key frontier in robotics and embodied AI, requires precise motor control and multimodal understanding, yet traditional rule-based methods fail to scale or generalize in unstructured, novel environments. In recent years, Vision-Language-Action (VLA) models, built upon Large Vision-Language Models (VLMs) pretrained on vast image-text datasets, have emerged as a transformative paradigm. This survey provides the first systematic, taxonomy-oriented review of large VLM-based VLA models for robotic manipulation. We begin by clearly defining large VLM-based VLA models and delineating two principal architectural paradigms: (1) monolithic models, encompassing single-system and dual-system designs with differing levels of integration; and (2) hierarchical models, which explicitly decouple planning from execution via interpretable intermediate representations. Building on this foundation, we present an in-depth examination of large VLM-based VLA models: (1) integration with advanced domains, including reinforcement learning, training-free optimization, learning from human videos, and world model integration; (2) synthesis of distinctive characteristics, consolidating architectural traits, operational strengths, and the datasets and benchmarks that support their development; (3) identification of promising directions, including memory mechanisms, 4D perception, efficient adaptation, multi-agent cooperation, and other emerging capabilities. This survey consolidates recent advances to resolve inconsistencies in existing taxonomies, mitigate research fragmentation, and fill a critical gap through the systematic integration of studies at the intersection of large VLMs and robotic manipulation. We provide a regularly updated project page to document ongoing progress: https://github.com/JiuTian-VL/Large-VLM-based-VLA-for-Robotic-Manipulation.
35. Surrogate-based Bayesian calibration methods for climate models: a comparison of traditional and non-traditional approaches
Authors: Maike F. Holthuijzen, Atlanta Chakraborty, Elizabeth Krath, Tommie Catanach β’
Published: 2025-08-18 β’
Source: arXiv
Parameter calibration is crucial for reducing uncertainty and improving simulation accuracy in physics-based models, yet computational constraints pose significant challenges. Bayesian calibration methods offer a principled framework for combining prior knowledge with data while rigorously quantifying uncertainty. In this work, we compare four emulator-based Bayesian calibration methods: Calibrate-Emulate-Sample (CES), History Matching (HM), Bayesian Optimal Experimental Design (BOED), and a novel Goal-Oriented BOED (GBOED) approach, using the Lorenz '96 multiscale system as a testbed. Our GBOED formulation explicitly targets calibration-relevant quantities and leverages information-theoretic criteria for data selection. We assess each method in terms of calibration accuracy, uncertainty quantification, computational cost, and convergence behavior. We evaluate each method's performance in balancing computational cost, implementation complexity, and uncertainty quantification (UQ), with additional insights into convergence behavior as model evaluations increase. We find CES offers excellent performance but at high computational expense, while GBOED achieves comparable accuracy using fewer model evaluations. Standard BOED underperforms with respect to calibration accuracy, and HM shows moderate effectiveness but can be useful as a precursor. Our results highlight trade-offs among Bayesian strategies and demonstrate the promise of goal-oriented design in calibration workflows.