πŸ€– AI Research Papers

September 20, 2025

πŸ€– AI-Generated Research Summary

Comprehensive Summary of 35 Recent Papers on AI, LLMs, Agents, and Workflows


1. Key Research Trends

a. Large Language Models (LLMs) and Foundation Models - Efficiency and Fairness: Multiple works (e.g., Fair-GPTQ, The Energy-Efficient Hierarchical Neural Network) focus on reducing the computational and memory footprint of LLMs, while also addressing fairness and bias. - Evaluation and Tokenization: Papers like "Mind the Gap" and "CodeFuse-CR-Bench" highlight the importance of robust, context-rich evaluation and the impact of seemingly minor design choices (e.g., tokenization) on LLM performance.

b. Embodied and Interactive Agents - Vision-Language-Action (VLA) Models: There is a strong trend toward integrating vision, language, and action for embodied agents (e.g., RynnVLA-001, Robot Control Stack, Ask-to-Clarify), enabling more natural and adaptive human-robot interaction. - Human-in-the-Loop and Collaboration: Several works (e.g., HICS-SLAM, Ask-to-Clarify) emphasize multi-turn dialogue, human feedback, and collaborative workflows for improved agent performance in real-world settings.

c. Generative and Diffusion Models - 3D/4D Generation and Control: Advances in generative models (e.g., WorldForge, GenKOL) are pushing the boundaries of controllable, high-fidelity content creation for virtual environments and marketing. - Synthetic Data Generation: Automated dataset creation for underrepresented modalities (e.g., SynParaSpeech for paralinguistic speech, MSDD for agriculture) is a growing trend.

d. Robotics and Simulation - Scalable Learning and Simulation: Papers like ExT and Parallel Simulation of Contact and Actuation for Soft Growing Robots focus on scalable, multi-task pretraining and realistic simulation for robust robot learning and planning. - Bridging Simulation and Reality: There is a push to close the sim-to-real gap, leveraging large-scale data and human demonstrations.

e. Explainability, Security, and Trust - Explainable AI (XAI): Integration of XAI with secure technologies (e.g., blockchain in healthcare) is being explored for trustworthy, transparent AI systems. - Security in New Modalities: The vulnerabilities of LLM-integrated XR systems (Evil Vizier) are being systematically analyzed.

f. Benchmarking and Evaluation - Comprehensive Benchmarks: New benchmarks (e.g., CodeFuse-CR-Bench, STEP) are being developed for holistic, real-world evaluation of AI systems, especially in code review and trajectory prediction.


2. Breakthrough Findings


3. Methodological Approaches


4. Applications and Use Cases


5. Future Directions


Conclusion

This collection of papers reflects a vibrant, rapidly evolving landscape in AI research, with strong emphasis on scalability, efficiency, fairness, and real-world applicability. The integration of LLMs and generative models into embodied agents, the push for explainable and secure AI, and the development of robust benchmarks and synthetic datasets are shaping the next generation of intelligent systems. Future research will likely focus on human-AI collaboration, energy-efficient architectures, and trustworthy deployment across diverse domains.

πŸ“š arXiv (35 papers)
1. Out-of-Sight Trajectories: Tracking, Fusion, and Prediction
Authors: Haichao Zhang, Yi Xu, Yun Fu β€’ Published: 2025-09-18 β€’ Source: arXiv
Trajectory prediction is a critical task in computer vision and autonomous systems, playing a key role in autonomous driving, robotics, surveillance, and virtual reality. Existing methods often rely on complete and noise-free observational data, overlooking the challenges associated with out-of-sight objects and the inherent noise in sensor data caused by limited camera coverage, obstructions, and the absence of ground truth for denoised trajectories. These limitations pose safety risks and hinder reliable prediction in real-world scenarios. In this extended work, we present advancements in Out-of-Sight Trajectory (OST), a novel task that predicts the noise-free visual trajectories of out-of-sight objects using noisy sensor data. Building on our previous research, we broaden the scope of Out-of-Sight Trajectory Prediction (OOSTraj) to include pedestrians and vehicles, extending its applicability to autonomous driving, robotics, surveillance, and virtual reality. Our enhanced Vision-Positioning Denoising Module leverages camera calibration to establish a vision-positioning mapping, addressing the lack of visual references, while effectively denoising noisy sensor data in an unsupervised manner. Through extensive evaluations on the Vi-Fi and JRDB datasets, our approach achieves state-of-the-art performance in both trajectory denoising and prediction, significantly surpassing previous baselines. Additionally, we introduce comparisons with traditional denoising methods, such as Kalman filtering, and adapt recent trajectory prediction models to our task, providing a comprehensive benchmark. This work represents the first initiative to integrate vision-positioning projection for denoising noisy sensor trajectories of out-of-sight agents, paving the way for future advances. The code and preprocessed datasets are available at github.com/Hai-chao-Zhang/OST
2. Evil Vizier: Vulnerabilities of LLM-Integrated XR Systems
Authors: Yicheng Zhang, Zijian Huang, Sophie Chen, Erfan Shayegani, Jiasi Chen, Nael Abu-Ghazaleh β€’ Published: 2025-09-18 β€’ Source: arXiv
Extended reality (XR) applications increasingly integrate Large Language Models (LLMs) to enhance user experience, scene understanding, and even generate executable XR content, and are often called "AI glasses". Despite these potential benefits, the integrated XR-LLM pipeline makes XR applications vulnerable to new forms of attacks. In this paper, we analyze LLM-Integated XR systems in the literature and in practice and categorize them along different dimensions from a systems perspective. Building on this categorization, we identify a common threat model and demonstrate a series of proof-of-concept attacks on multiple XR platforms that employ various LLM models (Meta Quest 3, Meta Ray-Ban, Android, and Microsoft HoloLens 2 running Llama and GPT models). Although these platforms each implement LLM integration differently, they share vulnerabilities where an attacker can modify the public context surrounding a legitimate LLM query, resulting in erroneous visual or auditory feedback to users, thus compromising their safety or privacy, sowing confusion, or other harmful effects. To defend against these threats, we discuss mitigation strategies and best practices for developers, including an initial defense prototype, and call on the community to develop new protection mechanisms to mitigate these risks.
3. RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation
Authors: Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li β€’ Published: 2025-09-18 β€’ Source: arXiv
This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.
4. Fair-GPTQ: Bias-Aware Quantization for Large Language Models
Authors: Irina Proskurina, Guillaume Metzler, Julien Velcin β€’ Published: 2025-09-18 β€’ Source: arXiv
High memory demands of generative language models have drawn attention to quantization, which reduces computational cost, memory usage, and latency by mapping model weights to lower-precision integers. Approaches such as GPTQ effectively minimize input-weight product errors during quantization; however, recent empirical studies show that they can increase biased outputs and degrade performance on fairness benchmarks, and it remains unclear which specific weights cause this issue. In this work, we draw new links between quantization and model fairness by adding explicit group-fairness constraints to the quantization objective and introduce Fair-GPTQ, the first quantization method explicitly designed to reduce unfairness in large language models. The added constraints guide the learning of the rounding operation toward less-biased text generation for protected groups. Specifically, we focus on stereotype generation involving occupational bias and discriminatory language spanning gender, race, and religion. Fair-GPTQ has minimal impact on performance, preserving at least 90% of baseline accuracy on zero-shot benchmarks, reduces unfairness relative to a half-precision model, and retains the memory and speed benefits of 4-bit quantization. We also compare the performance of Fair-GPTQ with existing debiasing methods and find that it achieves performance on par with the iterative null-space projection debiasing approach on racial-stereotype benchmarks. Overall, the results validate our theoretical solution to the quantization problem with a group-bias term, highlight its applicability for reducing group bias at quantization time in generative models, and demonstrate that our approach can further be used to analyze channel- and weight-level contributions to fairness during quantization.
5. Circuit-based chatacterization of finite-temperature quantum phases and self-correcting quantum memory
Authors: Ruochen Ma, Vedika Khemani, Shengqi Sang β€’ Published: 2025-09-18 β€’ Source: arXiv
Quantum phases at zero temperature can be characterized as equivalence classes under local unitary transformations: two ground states within a gapped phase can be transformed into each other via a local unitary circuit. We generalize this circuit-based characterization of phases to systems at finite-temperature thermal equilibrium described by Gibbs states. We construct a channel circuit that approximately transforms one Gibbs state into another provided the two are connected by a path in parameter space along which a certain correlation-decay condition holds. For finite-dimensional systems of linear size $L$ and approximation error $\epsilon$, the locality of the circuit is ${\rm polylog}({\rm poly}(L)/\epsilon)$. The correlation-decay condition, which we specify, is expected to be satisfied in the interior of many noncritical thermal phases, including those displaying discrete symmetry breaking and topological order. As an application, we show that any system in the same thermal phase as a zero-temperature topological code coherently preserves quantum information for a macroscopically long time, establishing self-correction as a universal property of thermal phases. As part of the proof, we provide explicit encoding and decoding channel circuits to encode information into, and decode it from, a system in thermal equilibrium.
6. TITAN: A Trajectory-Informed Technique for Adaptive Parameter Freezing in Large-Scale VQE
Authors: Yifeng Peng, Xinyi Li, Samuel Yen-Chi Chen, Kaining Zhang, Zhiding Liang, Ying Wang, Yuxuan Du β€’ Published: 2025-09-18 β€’ Source: arXiv
Variational quantum Eigensolver (VQE) is a leading candidate for harnessing quantum computers to advance quantum chemistry and materials simulations, yet its training efficiency deteriorates rapidly for large Hamiltonians. Two issues underlie this bottleneck: (i) the no-cloning theorem imposes a linear growth in circuit evaluations with the number of parameters per gradient step; and (ii) deeper circuits encounter barren plateaus (BPs), leading to exponentially increasing measurement overheads. To address these challenges, here we propose a deep learning framework, dubbed Titan, which identifies and freezes inactive parameters of a given ansatze at initialization for a specific class of Hamiltonians, reducing the optimization overhead without sacrificing accuracy. The motivation of Titan starts with our empirical findings that a subset of parameters consistently has a negligible influence on training dynamics. Its design combines a theoretically grounded data construction strategy, ensuring each training example is informative and BP-resilient, with an adaptive neural architecture that generalizes across ansatze of varying sizes. Across benchmark transverse-field Ising models, Heisenberg models, and multiple molecule systems up to 30 qubits, Titan achieves up to 3 times faster convergence and 40% to 60% fewer circuit evaluations than state-of-the-art baselines, while matching or surpassing their estimation accuracy. By proactively trimming parameter space, Titan lowers hardware demands and offers a scalable path toward utilizing VQE to advance practical quantum chemistry and materials science.
7. Lepton models from non-holomorphic $A^{\prime}_{5}$ modular flavor symmetry
Authors: Cai-Chang Li, Gui-Jun Ding β€’ Published: 2025-09-18 β€’ Source: arXiv
In the framework of non-holomorphic modular invariance approach, we have systematically constructed all minimal lepton models based on the non-holomorphic $A^{\prime}_{5}$ modular symmetry from a bottom-up approach. In these models, the Yukawa couplings are described by polyharmonic Maa{\ss} forms of integer weights at level $N=5$. Under the assumption of Majorana neutrinos, both the Weinberg operator and the type-I seesaw mechanism are considered for neutrino mass generation. All minimal models are found to be based on generalized CP (gCP) symmetry, and each of them depends on five real dimensionless parameters and two overall scales. Through comprehensive numerical scanning, we obtain 6 (4) phenomenologically viable Weinberg operator models and 94 (76) phenomenologically viable seesaw models for normal (inverted) ordering neutrino masses. For each viable model, we present predictions for key neutrino properties, such as lepton masses, CP violation phases, mixing angles, effective Majorana mass for neutrinoless double beta decay and the kinematical mass in beta decay. Furthermore, we provide detailed numerical analysis for two representative models to illustrate our results.
8. Maize Seedling Detection Dataset (MSDD): A Curated High-Resolution RGB Dataset for Seedling Maize Detection and Benchmarking with YOLOv9, YOLO11, YOLOv12 and Faster-RCNN
Authors: Dewi Endah Kharismawati, Toni Kazic β€’ Published: 2025-09-18 β€’ Source: arXiv
Accurate maize seedling detection is crucial for precision agriculture, yet curated datasets remain scarce. We introduce MSDD, a high-quality aerial image dataset for maize seedling stand counting, with applications in early-season crop monitoring, yield prediction, and in-field management. Stand counting determines how many plants germinated, guiding timely decisions such as replanting or adjusting inputs. Traditional methods are labor-intensive and error-prone, while computer vision enables efficient, accurate detection. MSDD contains three classes-single, double, and triple plants-capturing diverse growth stages, planting setups, soil types, lighting conditions, camera angles, and densities, ensuring robustness for real-world use. Benchmarking shows detection is most reliable during V4-V6 stages and under nadir views. Among tested models, YOLO11 is fastest, while YOLOv9 yields the highest accuracy for single plants. Single plant detection achieves precision up to 0.984 and recall up to 0.873, but detecting doubles and triples remains difficult due to rarity and irregular appearance, often from planting errors. Class imbalance further reduces accuracy in multi-plant detection. Despite these challenges, YOLO11 maintains efficient inference at 35 ms per image, with an additional 120 ms for saving outputs. MSDD establishes a strong foundation for developing models that enhance stand counting, optimize resource allocation, and support real-time decision-making. This dataset marks a step toward automating agricultural monitoring and advancing precision agriculture.
9. Parallel Simulation of Contact and Actuation for Soft Growing Robots
Authors: Yitian Gao, Lucas Chen, Priyanka Bhovad, Sicheng Wang, Zachary Kingston, Laura H. Blumenschein β€’ Published: 2025-09-18 β€’ Source: arXiv
Soft growing robots, commonly referred to as vine robots, have demonstrated remarkable ability to interact safely and robustly with unstructured and dynamic environments. It is therefore natural to exploit contact with the environment for planning and design optimization tasks. Previous research has focused on planning under contact for passively deforming robots with pre-formed bends. However, adding active steering to these soft growing robots is necessary for successful navigation in more complex environments. To this end, we develop a unified modeling framework that integrates vine robot growth, bending, actuation, and obstacle contact. We extend the beam moment model to include the effects of actuation on kinematics under growth and then use these models to develop a fast parallel simulation framework. We validate our model and simulator with real robot experiments. To showcase the capabilities of our framework, we apply our model in a design optimization task to find designs for vine robots navigating through cluttered environments, identifying designs that minimize the number of required actuators by exploiting environmental contacts. We show the robustness of the designs to environmental and manufacturing uncertainties. Finally, we fabricate an optimized design and successfully deploy it in an obstacle-rich environment.
10. Influence of the Spectral Energy Distribution of Reionization-Era Sources on the Lyman-$Ξ±$ Forest
Authors: Arghyadeep Basu, Benedetta Ciardi, James S. Bolton, Matteo Viel, Enrico Garaldi β€’ Published: 2025-09-18 β€’ Source: arXiv
Interpreting Lyman-$\alpha$ forest properties during the epoch of reionization requires assumptions about the spectral energy distribution (SED) of ionizing sources. These are often simplified to blackbody or power-law spectra, potentially overlooking contributions from high-energy processes. In this work, we investigate how different SED models of reionization-era sources shape the thermal and ionization state of the intergalactic medium (IGM) and imprint on the Ly$\alpha$ forest during the late stages of reionization. We perform $3D$ radiative transfer simulations with CRASH, post-processed on Sherwood-type hydrodynamical outputs, exploring both physically motivated SEDs including X-ray binaries, Bremsstrahlung from shock-heated interstellar medium, and binary stars and idealized blackbody and power-law spectra. While the large-scale morphology of ionized regions is broadly similar across all models, harder spectral components extend partially ionized zones, produce larger He III regions, and heat the surrounding IGM. By adopting simplified spectra there is the risk of underestimating the contribution of high-energy sources, which subtly alter the effective optical depth, the flux power, and the local transmissivity, potentially biasing constraints on the thermal and ionization history of the IGM. The differences across models are most pronounced in the behavior of the proximity zone and in the power at intermediate scales, offering the most promising diagnostics to disentangle source populations. With upcoming high precision measurements from ELT and DESI, realistic SED modelling will be essential for robustly connecting Ly$\alpha$ forest observations to the sources driving the end of reionization.
11. A Race Bias Free Face Aging Model for Reliable Kinship Verification
Authors: Ali Nazari, Bardiya Kariminia, Mohsen Ebrahimi Moghaddam β€’ Published: 2025-09-18 β€’ Source: arXiv
The age gap in kinship verification addresses the time difference between the photos of the parent and the child. Moreover, their same-age photos are often unavailable, and face aging models are racially biased, which impacts the likeness of photos. Therefore, we propose a face aging GAN model, RA-GAN, consisting of two new modules, RACEpSp and a feature mixer, to produce racially unbiased images. The unbiased synthesized photos are used in kinship verification to investigate the results of verifying same-age parent-child images. The experiments demonstrate that our RA-GAN outperforms SAM-GAN on an average of 13.14\% across all age groups, and CUSP-GAN in the 60+ age group by 9.1\% in terms of racial accuracy. Moreover, RA-GAN can preserve subjects' identities better than SAM-GAN and CUSP-GAN across all age groups. Additionally, we demonstrate that transforming parent and child images from the KinFaceW-I and KinFaceW-II datasets to the same age can enhance the verification accuracy across all age groups. The accuracy increases with our RA-GAN for the kinship relationships of father-son and father-daughter, mother-son, and mother-daughter, which are 5.22, 5.12, 1.63, and 0.41, respectively, on KinFaceW-I. Additionally, the accuracy for the relationships of father-daughter, father-son, and mother-son is 2.9, 0.39, and 1.6 on KinFaceW-II, respectively. The code is available at~\href{https://github.com/bardiya2254kariminia/An-Age-Transformation-whitout-racial-bias-for-Kinship-verification}{Github}
12. To CLEAN or not to CLEAN: Data Processing in the ngVLA era
Authors: Hendrik MΓΌller β€’ Published: 2025-09-18 β€’ Source: arXiv
Radio interferometric imaging has long relied on the CLEAN algorithm, valued for its speed, robustness, and integration with calibration pipelines. However, next-generation facilities such as the ngVLA, SKA, and ALMAs Wideband Sensitivity Upgrade will produce data volumes and dynamic ranges that exceed the scalability of traditional methods. CLEAN remains dominant due to its simplicity and accumulated expertise, yet its assumption of modeling the sky as point sources limits its ability to recover extended emission and hampers automation. We review CLEANs limitations and survey alternatives, including multiscale extensions, compressive sensing, Regularized Maximum Likelihood, Bayesian inference, and AI-driven approaches. Forward-modeling methods enable higher fidelity, flexible priors, and uncertainty quantification, albeit at greater computational cost. Hybrid approaches such as Autocorr-CLEAN, CG-CLEAN, and PolyCLEAN retain CLEANs workflow while incorporating modern optimization. We argue hybrids are best suited for the near term, while Bayesian and AI-based frameworks represent the long-term future of interferometric imaging.
13. Leveraging Geometric Visual Illusions as Perceptual Inductive Biases for Vision Models
Authors: Haobo Yang, Minghao Guo, Dequan Yang, Wenyu Wang β€’ Published: 2025-09-18 β€’ Source: arXiv
Contemporary deep learning models have achieved impressive performance in image classification by primarily leveraging statistical regularities within large datasets, but they rarely incorporate structured insights drawn directly from perceptual psychology. To explore the potential of perceptually motivated inductive biases, we propose integrating classic geometric visual illusions well-studied phenomena from human perception into standard image-classification training pipelines. Specifically, we introduce a synthetic, parametric geometric-illusion dataset and evaluate three multi-source learning strategies that combine illusion recognition tasks with ImageNet classification objectives. Our experiments reveal two key conceptual insights: (i) incorporating geometric illusions as auxiliary supervision systematically improves generalization, especially in visually challenging cases involving intricate contours and fine textures; and (ii) perceptually driven inductive biases, even when derived from synthetic stimuli traditionally considered unrelated to natural image recognition, can enhance the structural sensitivity of both CNN and transformer-based architectures. These results demonstrate a novel integration of perceptual science and machine learning and suggest new directions for embedding perceptual priors into vision model design.
14. Accelerated Discovery of Topological Conductors for Nanoscale Interconnects
Authors: Alexander C. Tyner, William Rogers, Po-Hsin Shih, Yi-Hsin Tu, Gengchiau Liang, Hsin Lin, Ching-Tzu Chen, James M. Rondinelli β€’ Published: 2025-09-18 β€’ Source: arXiv
The sharp increase in resistivity of copper interconnects at ultra-scaled dimensions threatens the continued miniaturization of integrated circuits. Topological semimetals (TSMs) with gapless surface states (Fermi arcs) provide conduction channels resistant to localization. Here we develop an efficient computational framework to quantify 0K surface-state transmission in nanowires derived from Wannier tight-binding models of topological conductors that faithfully reproduce relativistic density functional theory results. Sparse matrix techniques enable scalable simulations incorporating disorder and surface roughness, allowing systematic materials screening across sizes, chemical potentials, and transport directions. A dataset of 3000 surface transmission values reveals TiS, ZrB$_{2}$, and nitrides AN where A=(Mo, Ta, W) as candidates with conductance matching or exceeding copper and benchmark TSMs NbAs and NbP. This dataset further supports machine learning models for rapid interconnect compound identification. Our results highlight the promise of topological conductors in overcoming copper's scaling limits and provide a roadmap for data-driven discovery of next-generation interconnects.
15. WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance
Authors: Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, Chi Zhang β€’ Published: 2025-09-18 β€’ Source: arXiv
Recent video diffusion models demonstrate strong potential in spatial intelligence tasks due to their rich latent world priors. However, this potential is hindered by their limited controllability and geometric inconsistency, creating a gap between their strong priors and their practical use in 3D/4D tasks. As a result, current approaches often rely on retraining or fine-tuning, which risks degrading pretrained knowledge and incurs high computational costs. To address this, we propose WorldForge, a training-free, inference-time framework composed of three tightly coupled modules. Intra-Step Recursive Refinement introduces a recursive refinement mechanism during inference, which repeatedly optimizes network predictions within each denoising step to enable precise trajectory injection. Flow-Gated Latent Fusion leverages optical flow similarity to decouple motion from appearance in the latent space and selectively inject trajectory guidance into motion-related channels. Dual-Path Self-Corrective Guidance compares guided and unguided denoising paths to adaptively correct trajectory drift caused by noisy or misaligned structural signals. Together, these components inject fine-grained, trajectory-aligned guidance without training, achieving both accurate motion control and photorealistic content generation. Extensive experiments across diverse benchmarks validate our method's superiority in realism, trajectory consistency, and visual fidelity. This work introduces a novel plug-and-play paradigm for controllable video synthesis, offering a new perspective on leveraging generative priors for spatial intelligence.
16. Learning Mechanistic Subtypes of Neurodegeneration with a Physics-Informed Variational Autoencoder Mixture Model
Authors: Sanduni Pinnawala, Annabelle Hartanto, Ivor J. A. Simpson, Peter A. Wijeratne β€’ Published: 2025-09-18 β€’ Source: arXiv
Modelling the underlying mechanisms of neurodegenerative diseases demands methods that capture heterogeneous and spatially varying dynamics from sparse, high-dimensional neuroimaging data. Integrating partial differential equation (PDE) based physics knowledge with machine learning provides enhanced interpretability and utility over classic numerical methods. However, current physics-integrated machine learning methods are limited to considering a single PDE, severely limiting their application to diseases where multiple mechanisms are responsible for different groups (i.e., subtypes) and aggravating problems with model misspecification and degeneracy. Here, we present a deep generative model for learning mixtures of latent dynamic models governed by physics-based PDEs, going beyond traditional approaches that assume a single PDE structure. Our method integrates reaction-diffusion PDEs within a variational autoencoder (VAE) mixture model framework, supporting inference of subtypes of interpretable latent variables (e.g. diffusivity and reaction rates) from neuroimaging data. We evaluate our method on synthetic benchmarks and demonstrate its potential for uncovering mechanistic subtypes of Alzheimer's disease progression from positron emission tomography (PET) data.
17. Low-rank surrogate modeling and stochastic zero-order optimization for training of neural networks with black-box layers
Authors: Andrei Chertkov, Artem Basharin, Mikhail Saygin, Evgeny Frolov, Stanislav Straupe, Ivan Oseledets β€’ Published: 2025-09-18 β€’ Source: arXiv
The growing demand for energy-efficient, high-performance AI systems has led to increased attention on alternative computing platforms (e.g., photonic, neuromorphic) due to their potential to accelerate learning and inference. However, integrating such physical components into deep learning pipelines remains challenging, as physical devices often offer limited expressiveness, and their non-differentiable nature renders on-device backpropagation difficult or infeasible. This motivates the development of hybrid architectures that combine digital neural networks with reconfigurable physical layers, which effectively behave as black boxes. In this work, we present a framework for the end-to-end training of such hybrid networks. This framework integrates stochastic zeroth-order optimization for updating the physical layer's internal parameters with a dynamic low-rank surrogate model that enables gradient propagation through the physical layer. A key component of our approach is the implicit projector-splitting integrator algorithm, which updates the lightweight surrogate model after each forward pass with minimal hardware queries, thereby avoiding costly full matrix reconstruction. We demonstrate our method across diverse deep learning tasks, including: computer vision, audio classification, and language modeling. Notably, across all modalities, the proposed approach achieves near-digital baseline accuracy and consistently enables effective end-to-end training of hybrid models incorporating various non-differentiable physical components (spatial light modulators, microring resonators, and Mach-Zehnder interferometers). This work bridges hardware-aware deep learning and gradient-free optimization, thereby offering a practical pathway for integrating non-differentiable physical components into scalable, end-to-end trainable AI systems.
18. The Energy-Efficient Hierarchical Neural Network with Fast FPGA-Based Incremental Learning
Authors: Mohammad Saleh Vahdatpour, Huaiyuan Chu, Yanqing Zhang β€’ Published: 2025-09-18 β€’ Source: arXiv
The rising computational and energy demands of deep learning, particularly in large-scale architectures such as foundation models and large language models (LLMs), pose significant challenges to sustainability. Traditional gradient-based training methods are inefficient, requiring numerous iterative updates and high power consumption. To address these limitations, we propose a hybrid framework that combines hierarchical decomposition with FPGA-based direct equation solving and incremental learning. Our method divides the neural network into two functional tiers: lower layers are optimized via single-step equation solving on FPGAs for efficient and parallelizable feature extraction, while higher layers employ adaptive incremental learning to support continual updates without full retraining. Building upon this foundation, we introduce the Compound LLM framework, which explicitly deploys LLM modules across both hierarchy levels. The lower-level LLM handles reusable representation learning with minimal energy overhead, while the upper-level LLM performs adaptive decision-making through energy-aware updates. This integrated design enhances scalability, reduces redundant computation, and aligns with the principles of sustainable AI. Theoretical analysis and architectural insights demonstrate that our method reduces computational costs significantly while preserving high model performance, making it well-suited for edge deployment and real-time adaptation in energy-constrained environments.
19. Emergent Alignment via Competition
Authors: Natalie Collina, Surbhi Goel, Aaron Roth, Emily Ryu, Mirah Shi β€’ Published: 2025-09-18 β€’ Source: arXiv
Aligning AI systems with human values remains a fundamental challenge, but does our inability to create perfectly aligned models preclude obtaining the benefits of alignment? We study a strategic setting where a human user interacts with multiple differently misaligned AI agents, none of which are individually well-aligned. Our key insight is that when the users utility lies approximately within the convex hull of the agents utilities, a condition that becomes easier to satisfy as model diversity increases, strategic competition can yield outcomes comparable to interacting with a perfectly aligned model. We model this as a multi-leader Stackelberg game, extending Bayesian persuasion to multi-round conversations between differently informed parties, and prove three results: (1) when perfect alignment would allow the user to learn her Bayes-optimal action, she can also do so in all equilibria under the convex hull condition (2) under weaker assumptions requiring only approximate utility learning, a non-strategic user employing quantal response achieves near-optimal utility in all equilibria and (3) when the user selects the best single AI after an evaluation period, equilibrium guarantees remain near-optimal without further distributional assumptions. We complement the theory with two sets of experiments.
20. Forecasting and Visualizing Air Quality from Sky Images with Vision-Language Models
Authors: Mohammad Saleh Vahdatpour, Maryam Eyvazi, Yanqing Zhang β€’ Published: 2025-09-18 β€’ Source: arXiv
Air pollution remains a critical threat to public health and environmental sustainability, yet conventional monitoring systems are often constrained by limited spatial coverage and accessibility. This paper proposes an AI-driven agent that predicts ambient air pollution levels from sky images and synthesizes realistic visualizations of pollution scenarios using generative modeling. Our approach combines statistical texture analysis with supervised learning for pollution classification, and leverages vision-language model (VLM)-guided image generation to produce interpretable representations of air quality conditions. The generated visuals simulate varying degrees of pollution, offering a foundation for user-facing interfaces that improve transparency and support informed environmental decision-making. These outputs can be seamlessly integrated into intelligent applications aimed at enhancing situational awareness and encouraging behavioral responses based on real-time forecasts. We validate our method using a dataset of urban sky images and demonstrate its effectiveness in both pollution level estimation and semantically consistent visual synthesis. The system design further incorporates human-centered user experience principles to ensure accessibility, clarity, and public engagement in air quality forecasting. To support scalable and energy-efficient deployment, future iterations will incorporate a green CNN architecture enhanced with FPGA-based incremental learning, enabling real-time inference on edge platforms.
21. Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue
Authors: Xingyao Lin, Xinghao Zhu, Tianyi Lu, Sicheng Xie, Hui Zhang, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang β€’ Published: 2025-09-18 β€’ Source: arXiv
The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.
22. Reinforcement Learning Agent for a 2D Shooter Game
Authors: Thomas Ackermann, Moritz Spang, Hamza A. A. Gardi β€’ Published: 2025-09-18 β€’ Source: arXiv
Reinforcement learning agents in complex game environments often suffer from sparse rewards, training instability, and poor sample efficiency. This paper presents a hybrid training approach that combines offline imitation learning with online reinforcement learning for a 2D shooter game agent. We implement a multi-head neural network with separate outputs for behavioral cloning and Q-learning, unified by shared feature extraction layers with attention mechanisms. Initial experiments using pure deep Q-Networks exhibited significant instability, with agents frequently reverting to poor policies despite occasional good performance. To address this, we developed a hybrid methodology that begins with behavioral cloning on demonstration data from rule-based agents, then transitions to reinforcement learning. Our hybrid approach achieves consistently above 70% win rate against rule-based opponents, substantially outperforming pure reinforcement learning methods which showed high variance and frequent performance degradation. The multi-head architecture enables effective knowledge transfer between learning modes while maintaining training stability. Results demonstrate that combining demonstration-based initialization with reinforcement learning optimization provides a robust solution for developing game AI agents in complex multi-agent environments where pure exploration proves insufficient.
23. Detection of kink oscillations in solar coronal loops by a CNN-LSTM neural network
Authors: Sergey A. Belov, Yu Zhong, Dmitrii Y. Kolotkov, Valery M. Nakariakov β€’ Published: 2025-09-18 β€’ Source: arXiv
A hybrid machine learning model which combines a shallow convolutional neural network and a long short-term memory network (CNN--LSTM), has been developed to automate the detection of kink oscillations in coronal plasma loops within large volumes of high-cadence sequences of imaging data. The network was trained on a set of 10,000 synthetic data cubes designed to mimic sequences of coronal images, achieving an accuracy greater than 98\% on this synthetic dataset. The model was then applied to detect kink oscillations in real data cubes of coronal active regions observed with SDO/AIA in the 171~\AA\ channel. This dataset consisted of 50 samples with visually detected kink oscillations and 128 samples without. Each sample covered an area of 260$\times$260~pixels in the spatial domain and a duration of 30~min with a 12~s cadence in the time domain. Both off-limb and on-disk regions of interest were used. The data were pre-processed by median filtering in the time domain, and Gaussian smoothing and Contrast Limited Adaptive Histogram Equalization in the spatial domain. In the real dataset, the performance of the model was 83.7\%.The model is fully available in open access. We regard the CNN--LSTM model developed as a first step toward creating robust tools for routine solar coronal data mining in the context of coronal oscillation study.
24. Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs
Authors: Mario Sanz-Guerrero, Minh Duc Bui, Katharina von der Wense β€’ Published: 2025-09-18 β€’ Source: arXiv
When evaluating large language models (LLMs) with multiple-choice question answering (MCQA), it is common to end the prompt with the string "Answer:" to facilitate automated answer extraction via next-token probabilities. However, there is no consensus on how to tokenize the space following the colon, often overlooked as a trivial choice. In this paper, we uncover accuracy differences of up to 11% due to this (seemingly irrelevant) tokenization variation as well as reshuffled model rankings, raising concerns about the reliability of LLM comparisons in prior work. Surprisingly, we are able to recommend one specific strategy -- tokenizing the space together with the answer letter -- as we observe consistent and statistically significant performance improvements. Additionally, it improves model calibration, enhancing the reliability of the model's confidence estimates. Our findings underscore the importance of careful evaluation design and highlight the need for standardized, transparent evaluation protocols to ensure reliable and comparable results.
25. ExT: Towards Scalable Autonomous Excavation via Large-Scale Multi-Task Pretraining and Fine-Tuning
Authors: Yifan Zhai, Lorenzo Terenzi, Patrick Frey, Diego Garcia Soto, Pascal Egli, Marco Hutter β€’ Published: 2025-09-18 β€’ Source: arXiv
Scaling up the deployment of autonomous excavators is of great economic and societal importance. Yet it remains a challenging problem, as effective systems must robustly handle unseen worksite conditions and new hardware configurations. Current state-of-the-art approaches rely on highly engineered, task-specific controllers, which require extensive manual tuning for each new scenario. In contrast, recent advances in large-scale pretrained models have shown remarkable adaptability across tasks and embodiments in domains such as manipulation and navigation, but their applicability to heavy construction machinery remains largely unexplored. In this work, we introduce ExT, a unified open-source framework for large-scale demonstration collection, pretraining, and fine-tuning of multitask excavation policies. ExT policies are first trained on large-scale demonstrations collected from a mix of experts, then fine-tuned either with supervised fine-tuning (SFT) or reinforcement learning fine-tuning (RLFT) to specialize to new tasks or operating conditions. Through both simulation and real-world experiments, we show that pretrained ExT policies can execute complete excavation cycles with centimeter-level accuracy, successfully transferring from simulation to real machine with performance comparable to specialized single-task controllers. Furthermore, in simulation, we demonstrate that ExT's fine-tuning pipelines allow rapid adaptation to new tasks, out-of-distribution conditions, and machine configurations, while maintaining strong performance on previously learned tasks. These results highlight the potential of ExT to serve as a foundation for scalable and generalizable autonomous excavation.
26. Blockchain-Enabled Explainable AI for Trusted Healthcare Systems
Authors: Md Talha Mohsin β€’ Published: 2025-09-18 β€’ Source: arXiv
This paper introduces a Blockchain-Integrated Explainable AI Framework (BXHF) for healthcare systems to tackle two essential challenges confronting health information networks: safe data exchange and comprehensible AI-driven clinical decision-making. Our architecture incorporates blockchain, ensuring patient records are immutable, auditable, and tamper-proof, alongside Explainable AI (XAI) methodologies that yield transparent and clinically relevant model predictions. By incorporating security assurances and interpretability requirements into a unified optimization pipeline, BXHF ensures both data-level trust (by verified and encrypted record sharing) and decision-level trust (with auditable and clinically aligned explanations). Its hybrid edge-cloud architecture allows for federated computation across different institutions, enabling collaborative analytics while protecting patient privacy. We demonstrate the framework's applicability through use cases such as cross-border clinical research networks, uncommon illness detection and high-risk intervention decision support. By ensuring transparency, auditability, and regulatory compliance, BXHF improves the credibility, uptake, and effectiveness of AI in healthcare, laying the groundwork for safer and more reliable clinical decision-making.
27. Human Interaction for Collaborative Semantic SLAM using Extended Reality
Authors: Laura Ribeiro, Muhammad Shaheer, Miguel Fernandez-Cortizas, Ali Tourani, Holger Voos, Jose Luis Sanchez-Lopez β€’ Published: 2025-09-18 β€’ Source: arXiv
Semantic SLAM (Simultaneous Localization and Mapping) systems enrich robot maps with structural and semantic information, enabling robots to operate more effectively in complex environments. However, these systems struggle in real-world scenarios with occlusions, incomplete data, or ambiguous geometries, as they cannot fully leverage the higher-level spatial and semantic knowledge humans naturally apply. We introduce HICS-SLAM, a Human-in-the-Loop semantic SLAM framework that uses a shared extended reality environment for real-time collaboration. The system allows human operators to directly interact with and visualize the robot's 3D scene graph, and add high-level semantic concepts (e.g., rooms or structural entities) into the mapping process. We propose a graph-based semantic fusion methodology that integrates these human interventions with robot perception, enabling scalable collaboration for enhanced situational awareness. Experimental evaluations on real-world construction site datasets demonstrate improvements in room detection accuracy, map precision, and semantic completeness compared to automated baselines, demonstrating both the effectiveness of the approach and its potential for future extensions.
28. SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding
Authors: Bingsong Bai, Qihang Lu, Wenbing Yang, Zihan Sun, YueRan Hou, Peilei Jia, Songbai Pu, Ruibo Fu, Yingming Gao, Ya Li, Jun Gao β€’ Published: 2025-09-18 β€’ Source: arXiv
Paralinguistic sounds, like laughter and sighs, are crucial for synthesizing more realistic and engaging speech. However, existing methods typically depend on proprietary datasets, while publicly available resources often suffer from incomplete speech, inaccurate or missing timestamps, and limited real-world relevance. To address these problems, we propose an automated framework for generating large-scale paralinguistic data and apply it to construct the SynParaSpeech dataset. The dataset comprises 6 paralinguistic categories with 118.75 hours of data and precise timestamps, all derived from natural conversational speech. Our contributions lie in introducing the first automated method for constructing large-scale paralinguistic datasets and releasing the SynParaSpeech corpus, which advances speech generation through more natural paralinguistic synthesis and enhances speech understanding by improving paralinguistic event detection. The dataset and audio samples are available at https://github.com/ShawnPi233/SynParaSpeech.
29. Robot Control Stack: A Lean Ecosystem for Robot Learning at Scale
Authors: Tobias JΓΌlg, Pierre Krack, Seongjin Bien, Yannik Blei, Khaled Gamal, Ken Nakahara, Johannes Hechtl, Roberto Calandra, Wolfram Burgard, Florian Walter β€’ Published: 2025-09-18 β€’ Source: arXiv
Vision-Language-Action models (VLAs) mark a major shift in robot learning. They replace specialized architectures and task-tailored components of expert policies with large-scale data collection and setup-specific fine-tuning. In this machine learning-focused workflow that is centered around models and scalable training, traditional robotics software frameworks become a bottleneck, while robot simulations offer only limited support for transitioning from and to real-world experiments. In this work, we close this gap by introducing Robot Control Stack (RCS), a lean ecosystem designed from the ground up to support research in robot learning with large-scale generalist policies. At its core, RCS features a modular and easily extensible layered architecture with a unified interface for simulated and physical robots, facilitating sim-to-real transfer. Despite its minimal footprint and dependencies, it offers a complete feature set, enabling both real-world experiments and large-scale training in simulation. Our contribution is twofold: First, we introduce the architecture of RCS and explain its design principles. Second, we evaluate its usability and performance along the development cycle of VLA and RL policies. Our experiments also provide an extensive evaluation of Octo, OpenVLA, and Pi Zero on multiple robots and shed light on how simulation data can improve real-world policy performance. Our code, datasets, weights, and videos are available at: https://robotcontrolstack.github.io/
30. GenKOL: Modular Generative AI Framework For Scalable Virtual KOL Generation
Authors: Tan-Hiep To, Duy-Khang Nguyen, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le β€’ Published: 2025-09-18 β€’ Source: arXiv
Key Opinion Leader (KOL) play a crucial role in modern marketing by shaping consumer perceptions and enhancing brand credibility. However, collaborating with human KOLs often involves high costs and logistical challenges. To address this, we present GenKOL, an interactive system that empowers marketing professionals to efficiently generate high-quality virtual KOL images using generative AI. GenKOL enables users to dynamically compose promotional visuals through an intuitive interface that integrates multiple AI capabilities, including garment generation, makeup transfer, background synthesis, and hair editing. These capabilities are implemented as modular, interchangeable services that can be deployed flexibly on local machines or in the cloud. This modular architecture ensures adaptability across diverse use cases and computational environments. Our system can significantly streamline the production of branded content, lowering costs and accelerating marketing workflows through scalable virtual KOL creation.
31. Leveraging Reinforcement Learning, Genetic Algorithms and Transformers for background determination in particle physics
Authors: Guillermo Hijano Mendizabal, Davide Lancierini, Alex Marshall, Andrea Mauri, Patrick Haworth Owen, Mitesh Patel, Konstantinos Petridis, Shah Rukh Qasim, Nicola Serra, William Sutcliffe, Hanae Tilquin β€’ Published: 2025-09-18 β€’ Source: arXiv
Experimental studies of beauty hadron decays face significant challenges due to a wide range of backgrounds arising from the numerous possible decay channels with similar final states. For a particular signal decay, the process for ascertaining the most relevant background processes necessitates a detailed analysis of final state particles, potential misidentifications, and kinematic overlaps, which, due to computational limitations, is restricted to the simulation of only the most relevant backgrounds. Moreover, this process typically relies on the physicist's intuition and expertise, as no systematic method exists. This paper has two primary goals. First, from a particle physics perspective, we present a novel approach that utilises Reinforcement Learning (RL) to overcome the aforementioned challenges by systematically determining the critical backgrounds affecting beauty hadron decay measurements. While beauty hadron physics serves as the case study in this work, the proposed strategy is broadly adaptable to other types of particle physics measurements. Second, from a Machine Learning perspective, we introduce a novel algorithm which exploits the synergy between RL and Genetic Algorithms (GAs) for environments with highly sparse rewards and a large trajectory space. This strategy leverages GAs to efficiently explore the trajectory space and identify successful trajectories, which are used to guide the RL agent's training. Our method also incorporates a transformer architecture for the RL agent to handle token sequences representing decays.
32. CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects
Authors: Hanyang Guo, Xunjin Zheng, Zihan Liao, Hang Yu, Peng DI, Ziyin Zhang, Hong-Ning Dai β€’ Published: 2025-09-18 β€’ Source: arXiv
Automated code review (CR) is a key application for Large Language Models (LLMs), but progress is hampered by a "reality gap": existing benchmarks evaluate models on isolated sub-tasks using simplified, context-poor data. This fails to reflect the holistic context-rich nature of real-world CR. To bridge this gap, we introduce CodeFuse-CR-Bench, the first comprehensiveness-aware benchmark for repository-level CR evaluation. CodeFuse-CR-Bench comprises 601 high-quality instances from 70 Python projects covering nine Pull-Request (PR) problem domains, where each instance provides rich, multi-faceted context including the associated issue, PR details, and repository state, enabling end-to-end evaluation. Beyond superficial metrics, we also propose a novel evaluation framework that combines rule-based checks for location and syntax with model-based judgments of review quality. We present the first large-scale assessment of state-of-the-art LLMs on this comprehensive CR task. Our results establish crucial baselines and reveal that (1) no single LLM dominates all aspects of CR; (2) Gemini 2.5 Pro achieves the highest comprehensive performance; and (3) different LLMs exhibit varying robustness to redundant context. These findings highlight the necessity of holistic, multi-dimensional evaluation and provide actionable insights for advancing truly intelligent yet practical CR assistants.
33. RulER: Automated Rule-Based Semantic Error Localization and Repair for Code Translation
Authors: Shuo Jin, Songqiang Chen, Xiaoyuan Xie, Shing-Chi Cheung β€’ Published: 2025-09-18 β€’ Source: arXiv
Automated code translation aims to convert programs between different programming languages while maintaining their functionality. Due to the imperfections of code translation models, the generated translations may contain errors that compromise their reliability. Existing automated debugging methods for code translation rely on code alignments and repair patch templates to locate and fix erroneous translations. However, existing methods lack reliable references to construct code alignments and design repair patch templates, which significantly impacts their localization accuracy and repair effectiveness. To address these limitations, we reintroduce code translation rules and propose a rule-based debugging method for code translation, called RulER. RulER automatically derives code translation rules from correct translations generated by LLMs, enabling the efficient collection of diverse translation rules. In addition, RulER dynamically combines the existing rules on expandable nodes like expressions and tokens to further adaptively align more statements. These rules capture clear and detailed structural correspondences between source and target programming languages. Therefore, they can serve as reliable and reusable references for code alignment and repair template design, enabling RulER to locate and fix translation errors effectively. Our evaluation of RulER on Java-to-C++ and Python-to-C++ translations produced by four code translation models demonstrates that RulER outperforms state-of-the-art methods, BatFix and TransMap. Our experimental results show that RulER outperformed the best baseline by 20% and 272% in terms of error localization rates and repair success rates, respectively. RulER exhibits superior repair performance compared to directly prompting LLMs for patch generation, demonstrating a promising methodology for extracting and leveraging coding knowledge from LLMs.
34. Statistics makes a difference: Machine learning adsorption dynamics of functionalized cyclooctine on Si(001) at DFT accuracy
Authors: Hendrik Weiske, Rhyan Barrett, Ralf Tonner-Zech, Patrick Melix, Julia Westermayr β€’ Published: 2025-09-18 β€’ Source: arXiv
The interpretation of experiments on reactive semiconductor surfaces requires statistically significant sampling of molecular dynamics, but conventional ab initio methods are limited due to prohibitive computational costs. Machine-learning interatomic potentials provide a promising solution, bridging the gap between the chemical accuracy of short ab initio molecular dynamics (AIMD) and the extensive sampling required to simulate experiment. Using ethinyl-functionalized cyclooctyne adsorption on Si(001) as a model system, we demonstrate that conventional AIMD undersamples the configurational space, resulting in discrepancies with scanning tunnelling microscopy and X-ray photoelectron spectroscopy data. To resolve these inconsistencies, we employ pre-trained equivariant message-passing neural networks, fine-tuned on only a few thousand AIMD snapshots, and integrate them into a "molecular-gun" workflow. This approach generates 10,000 independent trajectories more than 1,000 times faster than AIMD. These simulations recover rare intermediates, clarify the competition between adsorption motifs, and reproduce the experimentally dominant on-top [2+2] cycloaddition geometry. Our results show that fine-tuning of pre-trained foundational models enables statistically converged, chemically accurate simulations of bond-forming and bond-breaking events on complex surfaces, providing a scalable route to reconcile atomistic theory with experimental ensemble measurements in semiconductor functionalization.
35. STEP: Structured Training and Evaluation Platform for benchmarking trajectory prediction models
Authors: Julian F. Schumann, Anna MΓ©szΓ‘ros, Jens Kober, Arkady Zgonnikov β€’ Published: 2025-09-18 β€’ Source: arXiv
While trajectory prediction plays a critical role in enabling safe and effective path-planning in automated vehicles, standardized practices for evaluating such models remain underdeveloped. Recent efforts have aimed to unify dataset formats and model interfaces for easier comparisons, yet existing frameworks often fall short in supporting heterogeneous traffic scenarios, joint prediction models, or user documentation. In this work, we introduce STEP -- a new benchmarking framework that addresses these limitations by providing a unified interface for multiple datasets, enforcing consistent training and evaluation conditions, and supporting a wide range of prediction models. We demonstrate the capabilities of STEP in a number of experiments which reveal 1) the limitations of widely-used testing procedures, 2) the importance of joint modeling of agents for better predictions of interactions, and 3) the vulnerability of current state-of-the-art models against both distribution shifts and targeted attacks by adversarial agents. With STEP, we aim to shift the focus from the ``leaderboard'' approach to deeper insights about model behavior and generalization in complex multi-agent settings.