1. The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
Authors: Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, Jonas Geiping β’
Published: 2025-09-11 β’
Source: arXiv
Does continued scaling of large language models (LLMs) yield diminishing returns? Real-world value often stems from the length of task an agent can complete. We start this work by observing the simple but counterintuitive fact that marginal gains in single-step accuracy can compound into exponential improvements in the length of a task a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. We propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. We find that larger models can correctly execute significantly more turns even when small models have 100\% single-turn accuracy. We observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations -- curiously, we observe a self-conditioning effect -- models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. In contrast, recent thinking models do not self-condition, and can also execute much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of task they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.
2. CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models
Authors: Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, Dong Yu β’
Published: 2025-09-11 β’
Source: arXiv
Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for enhancing the reasoning ability of Large Language Models (LLMs). Yet current RLVR methods often explore poorly, leading to premature convergence and entropy collapse. To address this challenge, we introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration. We formalize curiosity with signals from both the actor and the critic: for the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture. Both signals serve as an exploration bonus within the RLVR framework to guide the model. Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses; moreover, we connect the critic-wise bonus to the well-established count-based exploration bonus in RL. Empirically, our method achieves an approximate +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies a calibration collapse mechanism within RLVR, shedding light on common LLM failure modes.
3. SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Authors: Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, Ning Ding β’
Published: 2025-09-11 β’
Source: arXiv
Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms $\pi_0$ on RoboTwin 1.0\&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL
4. Geometric Neural Distance Fields for Learning Human Motion Priors
Authors: Zhengdi Yu, Simone Foti, Linguang Zhang, Amy Zhao, Cem Keskin, Stefanos Zafeiriou, Tolga Birdal β’
Published: 2025-09-11 β’
Source: arXiv
We introduce Neural Riemannian Motion Fields (NRMF), a novel 3D generative human motion prior that enables robust, temporally consistent, and physically plausible 3D motion recovery. Unlike existing VAE or diffusion-based methods, our higher-order motion prior explicitly models the human motion in the zero level set of a collection of neural distance fields (NDFs) corresponding to pose, transition (velocity), and acceleration dynamics. Our framework is rigorous in the sense that our NDFs are constructed on the product space of joint rotations, their angular velocities, and angular accelerations, respecting the geometry of the underlying articulations. We further introduce: (i) a novel adaptive-step hybrid algorithm for projecting onto the set of plausible motions, and (ii) a novel geometric integrator to "roll out" realistic motion trajectories during test-time-optimization and generation. Our experiments show significant and consistent gains: trained on the AMASS dataset, NRMF remarkably generalizes across multiple input modalities and to diverse tasks ranging from denoising to motion in-betweening and fitting to partial 2D / 3D observations.
5. 1.8 per cent measurement of $H_0$ from Cepheids alone
Authors: Richard Stiskalek, Harry Desmond, Eleni Tsaprazi, Alan Heavens, Guilhem Lavaux, Stuart McAlpine, Jens Jasche β’
Published: 2025-09-11 β’
Source: arXiv
One of the most pressing problems in current cosmology is the cause of the Hubble tension. We revisit a two-rung distance ladder, composed only of Cepheid periods and magnitudes, anchor distances in the Milky Way, Large Magellanic Cloud, NGC 4258, and host galaxy redshifts. We adopt the SH0ES data for the most up-to-date and carefully vetted measurements, where the Cepheid hosts were selected to harbour also Type Ia supernovae. We introduce two important improvements: a rigorous selection modelling and a state-of-the-art density and peculiar velocity model using Manticore-Local, based on the Bayesian Origin Reconstruction from Galaxies (BORG) algorithm. We infer $H_0 = 71.7 \pm 1.3\,\mathrm{km}\,\mathrm{s}^{-1}\,\mathrm{Mpc}^{-1}$, assuming the Cepheid host sample was selected by estimated supernova magnitudes. Less plausible selection criteria shift $H_0$ by about one standard deviation. The posterior has a lower central value and a 45 per cent smaller error than a previous study using the same data. The result is also slightly lower than the supernova-based SH0ES inferred value of $H_0 = 73.2 \pm 0.9\,\mathrm{km}\,\mathrm{s}^{-1}\,\mathrm{Mpc}^{-1}$, and is in $3.3\sigma$ tension with the latest standard cosmological model microwave background results. These results demonstrate that a measurement of $H_0$ of sufficient precision to weigh in on the Hubble tension is achievable using second-rung data alone, underscoring the importance of robust and accurate statistical modelling.
6. Retrieval-Augmented Generation for Reliable Interpretation of Radio Regulations
Authors: Zakaria El Kassimi, Fares Fourati, Mohamed-Slim Alouini β’
Published: 2025-09-11 β’
Source: arXiv
We study question answering in the domain of radio regulations, a legally sensitive and high-stakes area. We propose a telecom-specific Retrieval-Augmented Generation (RAG) pipeline and introduce, to our knowledge, the first multiple-choice evaluation set for this domain, constructed from authoritative sources using automated filtering and human validation. To assess retrieval quality, we define a domain-specific retrieval metric, under which our retriever achieves approximately 97% accuracy. Beyond retrieval, our approach consistently improves generation accuracy across all tested models. In particular, while naively inserting documents without structured retrieval yields only marginal gains for GPT-4o (less than 1%), applying our pipeline results in nearly a 12% relative improvement. These findings demonstrate that carefully targeted grounding provides a simple yet strong baseline and an effective domain-specific solution for regulatory question answering. All code and evaluation scripts, along with our derived question-answer dataset, are available at https://github.com/Zakaria010/Radio-RAG.
7. All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens
Authors: Siddarth Mamidanna, Daking Rai, Ziyu Yao, Yilun Zhou β’
Published: 2025-09-11 β’
Source: arXiv
Large language models (LLMs) demonstrate proficiency across numerous computational tasks, yet their inner workings remain unclear. In theory, the combination of causal self-attention and multilayer perceptron layers allows every token to access and compute information based on all preceding tokens. In practice, to what extent are such operations present? In this paper, on mental math tasks (i.e., direct math calculation via next-token prediction without explicit reasoning), we investigate this question in three steps: inhibiting input-specific token computations in the initial layers, restricting the routes of information transfer across token positions in the next few layers, and forcing all computation to happen at the last token in the remaining layers. With two proposed techniques, Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP), we identify an All-for-One subgraph (AF1) with high accuracy on a wide variety of mental math tasks, where meaningful computation occurs very late (in terms of layer depth) and only at the last token, which receives information of other tokens in few specific middle layers. Experiments on a variety of models and arithmetic expressions show that this subgraph is sufficient and necessary for high model performance, transfers across different models, and works on a variety of input styles. Ablations on different CAMA and ABP alternatives reveal their unique advantages over other methods, which may be of independent interest.
8. Reconstructing the origin of black hole mergers using sparse astrophysical models
Authors: V. Gayathri, Giuliano Iorio, Hiromichi Tagawa, Daniel Wysocki, Jeremiah Anglin, Imre Bartos, Shubhagata Bhaumik, Zolt'an Haiman, Michela Mapelli, R. O'Shaughnessy, LingQin Xue β’
Published: 2025-09-11 β’
Source: arXiv
The astrophysical origin of binary black hole mergers discovered by LIGO and Virgo remains uncertain. Efforts to reconstruct the processes that lead to mergers typically rely on either astrophysical models with fixed parameters, or continuous analytical models that can be fit to observations. Given the complexity of astrophysical formation mechanisms, these methods typically cannot fully take into account model uncertainties, nor can they fully capture the underlying processes. Here, we present a merger population analysis that can take a discrete set of simulated model distributions as its input to interpret observations. The analysis can take into account multiple formation scenarios as fractional contributors to the total set of observations, and can naturally account for model uncertainties. We apply this technique to investigate the origin of black hole mergers observed by LIGO Virgo. Specifically, we consider a model of AGN assisted black hole merger distributions, exploring a range of AGN parameters along with several {{SEVN}} population synthesis models that vary in common envelope efficiency parameter ($\alpha$) and metallicity ($Z$). We estimate the posterior distributions for AGN+SEVN models using $87$ BBH detections from the $O1--O3$ observation runs. The inferred total merger rate is $46.2 {Gpc}^{-3} {yr}^{-1}$, with the AGN sub-population contributing $21.2{Gpc}^{-3}{yr}^{-1}$ and the SEVN sub-population contributing $25.0 {Gpc}^{-3} {yr}^{-1}$.
9. Work statistics of sudden Quantum quenches: A random matrix theory perspective on Gaussianity and its deviations
Authors: Miguel Tierz β’
Published: 2025-09-11 β’
Source: arXiv
We show that, for sudden quenches, the work distribution reduces to the statistics of traces of powers of Haar unitaries, which are random unitary matrices drawn uniformly from the unitary group. For translation-invariant quadratic fermionic chains with interactions extending to $m$ neighbors and periodic boundary conditions, the Loschmidt amplitude admits a unitary matrix-model / Toeplitz representation, which yields a work variable of the form $W=\sum_{r\le m} a_r\,\mathrm{Re}\,\mathrm{Tr}\,U^r$ (and in models with pairing terms -- superconducting pairing -- additional $b_r\,\mathrm{Im}\,\mathrm{Tr}\,U^r$ terms appear). By invoking multivariate central limit theorems for vectors of traces of unitaries, we obtain a Gaussian distribution for $P(W)$ with variance $\mathrm{Var}(W)=\frac{1}{2}\sum_r r\,(a_r^2+b_r^2)$ and asymptotic independence across different powers. We also characterise the conditions under which non-Gaussian tails arise, for example from many interaction terms or their slow decay, as well as the appearance of Fisher--Hartwig singularities. We illustrate these mechanisms in the XY chain. Various numerical diagnostics support the analytical results.
10. CryptoGuard: An AI-Based Cryptojacking Detection Dashboard Prototype
Authors: Amitabh Chakravorty, Jess Kropczynski, Nelly Elsayed β’
Published: 2025-09-11 β’
Source: arXiv
With the widespread adoption of cryptocurrencies, cryptojacking has become a significant security threat to crypto wallet users. This paper presents a front-end prototype of an AI-powered security dashboard, namely, CryptoGuard. Developed through a user-centered design process, the prototype was constructed as a high-fidelity, click-through model from Figma mockups to simulate key user interactions. It is designed to assist users in monitoring their login and transaction activity, identifying any suspicious behavior, and enabling them to take action directly within the wallet interface. The dashboard is designed for a general audience, prioritizing an intuitive user experience for non-technical individuals. Although its AI functionality is conceptual, the prototype demonstrates features like visual alerts and reporting. This work is positioned explicitly as a design concept, bridging cryptojacking detection research with human-centered interface design. This paper also demonstrates how usability heuristics can directly inform a tool's ability to support rapid and confident decision-making under real-world threats. This paper argues that practical security tools require not only robust backend functionality but also a user-centric design that communicates risk and empowers users to take meaningful action.
11. A neural drift-plus-penalty algorithm for network power allocation and routing
Authors: Ahmed Rashwan, Keith Briggs, Chris Budd β’
Published: 2025-09-11 β’
Source: arXiv
The drift-plus-penalty method is a Lyapunov optimisation technique commonly applied to network routing problems. It reduces the original stochastic planning task to a sequence of greedy optimizations, enabling the design of distributed routing algorithms which stabilize data queues while simultaneously optimizing a specified penalty function. While drift-plus-penalty methods have desirable asymptotic properties, they tend to incur higher network delay than alternative control methods, especially under light network load. In this work, we propose a learned variant of the drift-plus-penalty method that can preserve its theoretical guarantees, while being flexible enough to learn routing strategies directly from a model of the problem. Our approach introduces a novel mechanism for learning routing decisions and employs an optimal transport-based method for link scheduling. Applied to the joint task of transmit-power allocation and data routing, the method achieves consistent improvements over common baselines under a broad set of scenarios.
12. I Know Who Clones Your Code: Interpretable Smart Contract Similarity Detection
Authors: Zhenguang Liu, Lixun Ma, Zhongzheng Mu, Chengkun Wei, Xiaojun Xu, Yingying Jiao, Kui Ren β’
Published: 2025-09-11 β’
Source: arXiv
Widespread reuse of open-source code in smart contract development boosts programming efficiency but significantly amplifies bug propagation across contracts, while dedicated methods for detecting similar smart contract functions remain very limited. Conventional abstract-syntax-tree (AST) based methods for smart contract similarity detection face challenges in handling intricate tree structures, which impedes detailed semantic comparison of code. Recent deep-learning based approaches tend to overlook code syntax and detection interpretability, resulting in suboptimal performance. To fill this research gap, we introduce SmartDetector, a novel approach for computing similarity between smart contract functions, explainable at the fine-grained statement level. Technically, SmartDetector decomposes the AST of a smart contract function into a series of smaller statement trees, each reflecting a structural element of the source code. Then, SmartDetector uses a classifier to compute the similarity score of two functions by comparing each pair of their statement trees. To address the infinite hyperparameter space of the classifier, we mathematically derive a cosine-wise diffusion process to efficiently search optimal hyperparameters. Extensive experiments conducted on three large real-world datasets demonstrate that SmartDetector outperforms current state-of-the-art methods by an average improvement of 14.01% in F1-score, achieving an overall average F1-score of 95.88%.
13. Detection of a Deeply Embedded Protocluster Candidate in NGC 602 with JWST
Authors: Beena Meena, Peter Zeidler, Elena Sabbi, Antonella Nota, Camilla Pacifici, Olivia C. Jones β’
Published: 2025-09-11 β’
Source: arXiv
JWST NIRCam and MIRI photometry of NGC 602, a low-metallicity young star cluster in the Small Magellanic Cloud, reveals an extended mid-infrared bright emission feature designated as MZS-1. This feature is prominent between 10 and 25.5 microns, but is extremely faint at 7.7 microns and entirely undetected at shorter wavelengths. MZS-1 exhibits an elliptical morphology with a major axis of approximately 8 arcseconds and a minor axis of about 4 arcseconds. Its elongated shape and multiple emission peaks in the two-dimensional flux map suggest a group of deeply embedded sources with blackbody-like temperatures ranging from 100 K to 140 K. SED fitting using the Robitaille 2017 model grids identifies these sources as Stage I young stellar objects (YSOs) with masses below approximately 3 solar masses and a total stellar mass of the protocluster of about 300 solar masses (based on a Salpeter IMF). The low YSO masses are consistent with their absence in Spitzer-based catalogs due to sensitivity limits. By revealing a deeply embedded, low-mass protocluster invisible in previous surveys, this work highlights JWST's unparalleled resolution and sensitivity in uncovering the earliest stages of low-mass cluster formation in the metal-poor regime.
14. Unsteady gas dynamics modeling for leakage detection in parallel pipelines
Authors: Ilgar G. Aliyev, Konul Gafarbayli, Ahad Mammadov, Firangiz Mammadrazayeva β’
Published: 2025-09-11 β’
Source: arXiv
This study presents a novel analytical framework for modeling unsteady gas dynamics in parallel pipeline systems under leakage conditions. The proposed method introduces a time-dependent leakage mass flow rate function, which dynamically captures the temporal decay of leakage based on real-time inlet pressure measurements. This functional form allows for a more physically consistent and mathematically tractable representation of gas loss compared to conventional constant-rate or stepwise models. The pipeline system is partitioned into three regions relative to the leakage point, and closed-form pressure solutions are derived using Laplace transform techniques. These expressions enable direct estimation of the leakage location through inverse pressure profiles, eliminating the need for computationally intensive iterative schemes. The analytical model is further validated against representative benchmark scenarios, demonstrating good agreement with literature-based results. A comparative analysis underscores the model's ability to localize leakage using minimal sensor data while preserving interpretability - an essential feature for deployment in industrial environments. The approach provides a lightweight yet robust alternative to purely numerical or machine learning-based solutions and offers potential integration into real-time monitoring systems. This work contributes to the field by unifying gas dynamic principles, sensor-assisted modeling, and analytical solution strategies to enhance the reliability and speed of leak detection in modern gas transport infrastructures.
15. Mechanistic Learning with Guided Diffusion Models to Predict Spatio-Temporal Brain Tumor Growth
Authors: Daria Laslo, Efthymios Georgiou, Marius George Linguraru, Andreas Rauschecker, Sabine Muller, Catherine R. Jutzeler, Sarah Bruningk β’
Published: 2025-09-11 β’
Source: arXiv
Predicting the spatio-temporal progression of brain tumors is essential for guiding clinical decisions in neuro-oncology. We propose a hybrid mechanistic learning framework that combines a mathematical tumor growth model with a guided denoising diffusion implicit model (DDIM) to synthesize anatomically feasible future MRIs from preceding scans. The mechanistic model, formulated as a system of ordinary differential equations, captures temporal tumor dynamics including radiotherapy effects and estimates future tumor burden. These estimates condition a gradient-guided DDIM, enabling image synthesis that aligns with both predicted growth and patient anatomy. We train our model on the BraTS adult and pediatric glioma datasets and evaluate on 60 axial slices of in-house longitudinal pediatric diffuse midline glioma (DMG) cases. Our framework generates realistic follow-up scans based on spatial similarity metrics. It also introduces tumor growth probability maps, which capture both clinically relevant extent and directionality of tumor growth as shown by 95th percentile Hausdorff Distance. The method enables biologically informed image generation in data-limited scenarios, offering generative-space-time predictions that account for mechanistic priors.
16. Programmable 200 GOPS Hopfield-inspired photonic Ising machine
Authors: Nayem AL-Kayed, Charles St-Arnault, Hugh Morison, A. Aadhi, Chaoran Huang, Alexander N. Tait, David V. Plant, Bhavin J. Shastri β’
Published: 2025-09-11 β’
Source: arXiv
Ising machines offer a compelling approach to addressing NP-hard problems, but physical realizations that are simultaneously scalable, reconfigurable, fast, and stable remain elusive. Quantum annealers, like D-Wave's cryogenic hardware, target combinatorial optimization tasks, but quadratic scaling of qubit requirements with problem size limits their scalability on dense graphs. Here, we introduce a programmable, stable, room-temperature optoelectronic oscillator (OEO)-based Ising machine with linear scaling in spin representation. Inspired by Hopfield networks, our architecture solves fully-connected problems with up to 256 spins (65,536 couplings), and $>$41,000 spins (205,000+ couplings) if sparse. Our system leverages cascaded thin-film lithium niobate modulators, a semiconductor optical amplifier, and a digital signal processing (DSP) engine in a recurrent time-encoded loop, demonstrating potential $>$200 giga-operations per second for spin coupling and nonlinearity. This platform achieves the largest spin configuration in an OEO-based photonic Ising machine, enabled by high intrinsic speed. We experimentally demonstrate best-in-class solution quality for Max-Cut problems of arbitrary graph topologies (2,000 and 20,000 spins) among photonic Ising machines and obtain ground-state solutions for number partitioning and lattice protein folding - benchmarks previously unaddressed by photonic systems. Our system leverages inherent noise from high baud rates to escape local minima and accelerate convergence. Finally, we show that embedding DSP - traditionally used in optical communications - within optical computation enhances convergence and solution quality, opening new frontiers in scalable, ultrafast computing for optimization, neuromorphic processing, and analog AI.
17. Human-in-the-loop Learning Through Decentralized Communication Mechanisms
Authors: Yiting Hu, Lingjie Duan β’
Published: 2025-09-11 β’
Source: arXiv
Information sharing platforms like TripAdvisor and Waze involve human agents as both information producers and consumers. All these platforms operate in a centralized way to collect agents' latest observations of new options (e.g., restaurants, hotels, travel routes) and share such information with all in real time. However, after hearing the central platforms' live updates, many human agents are found selfish and unwilling to further explore unknown options for the benefit of others in the long run. To regulate the human-in-the-loop learning (HILL) game against selfish agents' free-riding, this paper proposes a paradigm shift from centralized to decentralized way of operation that forces agents' local explorations through restricting information sharing. When game theory meets distributed learning, we formulate our decentralized communication mechanism's design as a new multi-agent Markov decision process (MA-MDP), and derive its analytical condition to outperform today's centralized operation. As the optimal decentralized communication mechanism in MA-MDP is NP-hard to solve, we present an asymptotically optimal algorithm with linear complexity to determine the mechanism's timing of intermittent information sharing. Then we turn to non-myopic agents who may revert to even over-explore, and adapt our mechanism design to work. Simulation experiments using real-world dataset demonstrate the effectiveness of our decentralized mechanisms for various scenarios.
18. Explainable AI for Accelerated Microstructure Imaging: A SHAP-Guided Protocol on the Connectome 2.0 scanner
Authors: Quentin Uhl, Tommaso Pavan, Julianna Gerold, Kwok-Shing Chan, Yohan Jun, Shohei Fujita, Aneri Bhatt, Yixin Ma, Qiaochu Wang, Hong-Hsi Lee, Susie Y. Huang, Berkin Bilgic, Ileana Jelescu β’
Published: 2025-09-11 β’
Source: arXiv
The diffusion MRI Neurite Exchange Imaging model offers a promising framework for probing gray matter microstructure by estimating parameters such as compartment sizes, diffusivities, and inter-compartmental water exchange time. However, existing protocols require long scan times. This study proposes a reduced acquisition scheme for the Connectome 2.0 scanner that preserves model accuracy while substantially shortening scan duration. We developed a data-driven framework using explainable artificial intelligence with a guided recursive feature elimination strategy to identify an optimal 8-feature subset from a 15-feature protocol. The performance of this optimized protocol was validated in vivo and benchmarked against the full acquisition and alternative reduction strategies. Parameter accuracy, preservation of anatomical contrast, and test-retest reproducibility were assessed. The reduced protocol yielded parameter estimates and cortical maps comparable to the full protocol, with low estimation errors in synthetic data and minimal impact on test-retest variability. Compared to theory-driven and heuristic reduction schemes, the optimized protocol demonstrated superior robustness, reducing the deviation in water exchange time estimates by over two-fold. In conclusion, this hybrid optimization framework enables viable imaging of neurite exchange in 14 minutes without loss of parameter fidelity. This approach supports the broader application of exchange-sensitive diffusion magnetic resonance imaging in neuroscience and clinical research, and offers a generalizable method for designing efficient acquisition protocols in biophysical parameter mapping.
19. Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference
Authors: Haoran Wu, Can Xiao, Jiayi Nie, Xuan Guo, Binglei Lou, Jeffrey T. H. Wong, Zhiwen Mo, Cheng Zhang, Przemyslaw Forys, Wayne Luk, Hongxiang Fan, Jianyi Cheng, Timothy M. Jones, Rika Antonova, Robert Mullins, Aaron Zhao β’
Published: 2025-09-11 β’
Source: arXiv
LLMs now form the backbone of AI agents for a diverse array of applications, including tool use, command-line agents, and web or computer use agents. These agentic LLM inference tasks are fundamentally different from chatbot-focused inference -- they often have much larger context lengths to capture complex, prolonged inputs, such as entire webpage DOMs or complicated tool call trajectories. This, in turn, generates significant off-chip memory traffic for the underlying hardware at the inference stage and causes the workload to be constrained by two memory walls, namely the bandwidth and capacity memory walls, preventing the on-chip compute units from achieving high utilization. In this paper, we introduce PLENA, a hardware-software co-designed system that applies three core optimization pathways to tackle these challenges. PLENA includes an efficient hardware implementation of compute and memory units supporting an asymmetric quantization scheme. PLENA also features a novel flattened systolic array architecture that has native support for FlashAttention to tackle these memory walls in the scenario of inference serving for long-context LLMs. Additionally, PLENA is developed with a complete stack, including a custom ISA, a compiler, a cycle-emulated simulator, and an automated design space exploration flow. The simulated results show that PLENA achieves up to 8.5x higher utilization than existing accelerators, and delivers 2.24x higher throughput than the A100 GPU and 3.85x higher throughput than the TPU v6e, under the same multiplier count and memory settings. The full PLENA system will also be open-sourced.
20. OpenFake: An Open Dataset and Platform Toward Large-Scale Deepfake Detection
Authors: Victor Livernoche, Akshatha Arodi, Andreea Musulan, Zachary Yang, Adam Salvail, GaΓ©tan Marceau Caron, Jean-FranΓ§ois Godbout, Reihaneh Rabbany β’
Published: 2025-09-11 β’
Source: arXiv
Deepfakes, synthetic media created using advanced AI techniques, have intensified the spread of misinformation, particularly in politically sensitive contexts. Existing deepfake detection datasets are often limited, relying on outdated generation methods, low realism, or single-face imagery, restricting the effectiveness for general synthetic image detection. By analyzing social media posts, we identify multiple modalities through which deepfakes propagate misinformation. Furthermore, our human perception study demonstrates that recently developed proprietary models produce synthetic images increasingly indistinguishable from real ones, complicating accurate identification by the general public. Consequently, we present a comprehensive, politically-focused dataset specifically crafted for benchmarking detection against modern generative models. This dataset contains three million real images paired with descriptive captions, which are used for generating 963k corresponding high-quality synthetic images from a mix of proprietary and open-source models. Recognizing the continual evolution of generative techniques, we introduce an innovative crowdsourced adversarial platform, where participants are incentivized to generate and submit challenging synthetic images. This ongoing community-driven initiative ensures that deepfake detection methods remain robust and adaptive, proactively safeguarding public discourse from sophisticated misinformation threats.
21. Evaluating Quantum Amplitude Estimation for Pricing Multi-Asset Basket Options
Authors: Muhammad Kashif, Shaf Khalid, Nouhaila Innan, Alberto Marchisio, Muhammad Shafique β’
Published: 2025-09-11 β’
Source: arXiv
Accurate and efficient pricing of multi-asset basket options poses a significant challenge, especially when dealing with complex real-world data. In this work, we investigate the role of quantum-enhanced uncertainty modeling in financial pricing options on real-world data. Specifically, we use quantum amplitude estimation and analyze the impact of varying the number of uncertainty qubits while keeping the number of assets fixed, as well as the impact of varying the number of assets while keeping the number of uncertainty qubits fixed. To provide a comprehensive evaluation, we establish and validate a hybrid quantum-classical comparison framework, benchmarking quantum approaches against classical Monte Carlo simulations and Black-Scholes methods. Beyond simply computing option prices, we emphasize the trade-off between accuracy and computational resources, offering insights into the potential advantages and limitations of quantum approaches for different problem scales. Our results contribute to understanding the feasibility of quantum methods in finance and guide the optimal allocation of quantum resources in hybrid quantum-classical workflows.
22. MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems
Authors: Channdeth Sok, David Luz, Yacine Haddam β’
Published: 2025-09-11 β’
Source: arXiv
Large Language Models (LLMs) are increasingly deployed in enterprise applications, yet their reliability remains limited by hallucinations, i.e., confident but factually incorrect information. Existing detection approaches, such as SelfCheckGPT and MetaQA, primarily target standalone LLMs and do not address the unique challenges of Retrieval-Augmented Generation (RAG) systems, where responses must be consistent with retrieved evidence. We therefore present MetaRAG, a metamorphic testing framework for hallucination detection in Retrieval-Augmented Generation (RAG) systems. MetaRAG operates in a real-time, unsupervised, black-box setting, requiring neither ground-truth references nor access to model internals, making it suitable for proprietary and high-stakes domains. The framework proceeds in four stages: (1) decompose answers into atomic factoids, (2) generate controlled mutations of each factoid using synonym and antonym substitutions, (3) verify each variant against the retrieved context (synonyms are expected to be entailed and antonyms contradicted), and (4) aggregate penalties for inconsistencies into a response-level hallucination score. Crucially for identity-aware AI, MetaRAG localizes unsupported claims at the factoid span where they occur (e.g., pregnancy-specific precautions, LGBTQ+ refugee rights, or labor eligibility), allowing users to see flagged spans and enabling system designers to configure thresholds and guardrails for identity-sensitive queries. Experiments on a proprietary enterprise dataset illustrate the effectiveness of MetaRAG for detecting hallucinations and enabling trustworthy deployment of RAG-based conversational agents. We also outline a topic-based deployment design that translates MetaRAG's span-level scores into identity-aware safeguards; this design is discussed but not evaluated in our experiments.
23. Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain Expansion, and Metric Optimization
Authors: Hangyi Jia, Yuxi Qian, Hanwen Tong, Xinhui Wu, Lin Chen, Feng Wei β’
Published: 2025-09-11 β’
Source: arXiv
Recent advances in large language models (LLMs) have enabled the emergence of general-purpose agents for automating end-to-end machine learning (ML) workflows, including data analysis, feature engineering, model training, and competition solving. However, existing benchmarks remain limited in task coverage, domain diversity, difficulty modeling, and evaluation rigor, failing to capture the full capabilities of such agents in realistic settings. We present TAM Bench, a diverse, realistic, and structured benchmark for evaluating LLM-based agents on end-to-end ML tasks. TAM Bench features three key innovations: (1) A browser automation and LLM-based task acquisition system that automatically collects and structures ML challenges from platforms such as Kaggle, AIcrowd, and Biendata, spanning multiple task types and data modalities (e.g., tabular, text, image, graph, audio); (2) A leaderboard-driven difficulty modeling mechanism that estimates task complexity using participant counts and score dispersion, enabling scalable and objective task calibration; (3) A multi-dimensional evaluation framework incorporating performance, format compliance, constraint adherence, and task generalization. Based on 150 curated AutoML tasks, we construct three benchmark subsets of different sizes -- Lite, Medium, and Full -- designed for varying evaluation scenarios. The Lite version, with 18 tasks and balanced coverage across modalities and difficulty levels, serves as a practical testbed for daily benchmarking and comparative studies.
24. Proactive AI Adoption can be Threatening: When Help Backfires
Authors: Dana Harari, Ofra Amir β’
Published: 2025-09-11 β’
Source: arXiv
Artificial intelligence (AI) assistants are increasingly embedded in workplace tools, raising the question of how initiative-taking shapes adoption. Prior work highlights trust and expectation mismatches as barriers, but the underlying psychological mechanisms remain unclear. Drawing on self-affirmation and social exchange theories, we theorize that unsolicited help elicits self-threat, reducing willingness to accept assistance, likelihood of future use, and performance expectancy. We report two vignette-based experiments (Study~1: $N=761$; Study~2: $N=571$, preregistered). Study~1 compared anticipatory and reactive help provided by an AI vs. a human, while Study~2 distinguished between \emph{offering} (suggesting help) and \emph{providing} (acting automatically). In Study 1, AI help was more threatening than human help. Across both studies, anticipatory help increased perceived threat and reduced adoption outcomes. Our findings identify self-threat as a mechanism explaining why proactive AI features may backfire and suggest design implications for AI initiative.
25. Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization
Authors: Zhengzhao Lai, Youbin Zheng, Zhenyang Cai, Haonan Lyu, Jinpu Yang, Hongqing Liang, Yan Hu, Benyou Wang β’
Published: 2025-09-11 β’
Source: arXiv
Materials characterization is fundamental to acquiring materials information, revealing the processing-microstructure-property relationships that guide material design and optimization. While multimodal large language models (MLLMs) have recently shown promise in generative and predictive tasks within materials science, their capacity to understand real-world characterization imaging data remains underexplored. To bridge this gap, we present MatCha, the first benchmark for materials characterization image understanding, comprising 1,500 questions that demand expert-level domain expertise. MatCha encompasses four key stages of materials research comprising 21 distinct tasks, each designed to reflect authentic challenges faced by materials scientists. Our evaluation of state-of-the-art MLLMs on MatCha reveals a significant performance gap compared to human experts. These models exhibit degradation when addressing questions requiring higher-level expertise and sophisticated visual perception. Simple few-shot and chain-of-thought prompting struggle to alleviate these limitations. These findings highlight that existing MLLMs still exhibit limited adaptability to real-world materials characterization scenarios. We hope MatCha will facilitate future research in areas such as new material discovery and autonomous scientific agents. MatCha is available at https://github.com/FreedomIntelligence/MatCha.
26. Altered Histories in Version Control System Repositories: Evidence from the Trenches
Authors: Solal Rapaport, Laurent Pautet, Samuel Tardieu, Stefano Zacchiroli β’
Published: 2025-09-11 β’
Source: arXiv
Version Control Systems (VCS) like Git allow developers to locally rewrite recorded history, e.g., to reorder and suppress commits or specific data in them. These alterations have legitimate use cases, but become problematic when performed on public branches that have downstream users: they break push/pull workflows, challenge the integrity and reproducibility of repositories, and create opportunities for supply chain attackers to sneak into them nefarious changes. We conduct the first large-scale investigation of Git history alterations in public code repositories. We analyze 111 M (millions) repositories archived by Software Heritage, which preserves VCS histories even across alterations. We find history alterations in 1.22 M repositories, for a total of 8.7 M rewritten histories. We categorize changes by where they happen (which repositories, which branches) and what is changed in them (files or commit metadata). Conducting two targeted case studies we show that altered histories recurrently change licenses retroactively, or are used to remove ''secrets'' (e.g., private keys) committed by mistake. As these behaviors correspond to bad practices-in terms of project governance or security management, respectively-that software recipients might want to avoid, we introduce GitHistorian, an automated tool, that developers can use to spot and describe history alterations in public Git repositories.
27. Data Driven Discovery of Emergent Dynamics in Reaction Diffusion Systems from Sparse and Noisy Observations
Authors: Saumitra Dwivedi, Ricardo da Silva Torres, Ibrahim A. Hameed, Gunnar Tufte, Anniken Susanne T. Karlsen β’
Published: 2025-09-11 β’
Source: arXiv
Data-driven discovery of emergent dynamics is gaining popularity, particularly in the context of reaction-diffusion systems. These systems are widely studied across various fields, including neuroscience, ecology, epidemiology, and several other subject areas that deal with emergent dynamics. A current challenge in the discovery process relates to system identification when there is no prior knowledge of the underlying physics. We attempt to address this challenge by learning Soft Artificial Life (Soft ALife) models, such as Agent-based and Cellular Automata (CA) models, from observed data for reaction-diffusion systems. In this paper, we present findings on the applicability of a conceptual framework, the Data-driven Rulesets for Soft Artificial Life (DRSALife) model, to learn Soft ALife rulesets that accurately represent emergent dynamics in a reaction-diffusion system from observed data. This model has demonstrated promising results for Elementary CA Rule 30, Game of Life, and Vicsek Flocking problems in recent work. To our knowledge, this is one of the few studies that explore machine-based Soft ALife ruleset learning and system identification for reaction-diffusion dynamics without any prior knowledge of the underlying physics. Moreover, we provide comprehensive findings from experiments investigating the potential effects of using noisy and sparse observed datasets on learning emergent dynamics. Additionally, we successfully identify the structure and parameters of the underlying partial differential equations (PDEs) representing these dynamics. Experimental results demonstrate that the learned models are able to predict the emergent dynamics with good accuracy (74%) and exhibit quite robust performance when subjected to Gaussian noise and temporal sparsity.
28. Fusing Knowledge and Language: A Comparative Study of Knowledge Graph-Based Question Answering with LLMs
Authors: Vaibhav Chaudhary, Neha Soni, Narotam Singh, Amita Kapoor β’
Published: 2025-09-11 β’
Source: arXiv
Knowledge graphs, a powerful tool for structuring information through relational triplets, have recently become the new front-runner in enhancing question-answering systems. While traditional Retrieval Augmented Generation (RAG) approaches are proficient in fact-based and local context-based extraction from concise texts, they encounter limitations when addressing the thematic and holistic understanding of complex, extensive texts, requiring a deeper analysis of both text and context. This paper presents a comprehensive technical comparative study of three different methodologies for constructing knowledge graph triplets and integrating them with Large Language Models (LLMs) for question answering: spaCy, Stanford CoreNLP-OpenIE, and GraphRAG, all leveraging open source technologies. We evaluate the effectiveness, feasibility, and adaptability of these methods by analyzing their capabilities, state of development, and their impact on the performance of LLM-based question answering. Experimental results indicate that while OpenIE provides the most comprehensive coverage of triplets, GraphRAG demonstrates superior reasoning abilities among the three. We conclude with a discussion on the strengths and limitations of each method and provide insights into future directions for improving knowledge graph-based question answering.
29. The role of communication delays in the optimal control of spatially invariant systems
Authors: Luca Ballotta, Juncal Arbelaiz, Vijay Gupta, Luca Schenato, Mihailo R. JovanoviΔ β’
Published: 2025-09-11 β’
Source: arXiv
We study optimal proportional feedback controllers for spatially invariant systems when the controller has access to delayed state measurements received from different spatial locations. We analyze how delays affect the spatial locality of the optimal feedback gain leveraging the problem decoupling in the spatial frequency domain. For the cases of expensive control and small delay, we provide exact expressions of the optimal controllers in the limit for infinite control weight and vanishing delay, respectively. In the expensive control regime, the optimal feedback control law decomposes into a delay-aware filtering of the delayed state and the optimal controller in the delay-free setting. Under small delays, the optimal controller is a perturbation of the delay-free one which depends linearly on the delay. We illustrate our analytical findings with a reaction-diffusion process over the real line and a multi-agent system coupled through circulant matrices, showing that delays reduce the effectiveness of optimal feedback control and may require each subsystem within a distributed implementation to communicate with farther-away locations.
30. Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents
Authors: Jiawei Wang, Jiacai Liu, Yuqian Fu, Yingru Li, Xintao Wang, Yuan Lin, Yu Yue, Lin Zhang, Yang Wang, Ke Wang β’
Published: 2025-09-11 β’
Source: arXiv
In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed-seed.github.io/
31. Improved Riemannian potato field: an Automatic Artifact Rejection Method for EEG
Authors: Davoud Hajhassani, Quentin BarthΓ©lemy, JΓ©rΓ©mie Mattout, Marco Congedo β’
Published: 2025-09-11 β’
Source: arXiv
Electroencephalography (EEG) signal cleaning has long been a critical challenge in the research community. The presence of artifacts can significantly degrade EEG data quality, complicating analysis and potentially leading to erroneous interpretations. While various artifact rejection methods have been proposed, the gold standard remains manual visual inspection by human experts-a process that is time-consuming, subjective, and impractical for large-scale EEG studies. Existing techniques are often hindered by a strong reliance on manual hyperparameter tuning, sensitivity to outliers, and high computational costs. In this paper, we introduce the improved Riemannian Potato Field (iRPF), a fast and fully automated method for EEG artifact rejection that addresses key limitations of current approaches. We evaluate iRPF against several state-of-the-art artifact rejection methods, using two publicly available EEG databases, labeled for various artifact types, comprising 226 EEG recordings. Our results demonstrate that iRPF outperforms all competitors across multiple metrics, with gains of up to 22% in recall, 102% in specificity, 54% in precision, and 24% in F1-score, compared to Isolation Forest, Autoreject, Riemannian Potato, and Riemannian Potato Field, respectively. Statistical analysis confirmed the significance of these improvements (p < 0.001) with large effect sizes (Cohen's d > 0.8) in most comparisons. Additionally, on a typical EEG recording iRPF performs artifact cleaning in under 8 milliseconds per epoch using a standard laptop, highlighting its efficiency for large-scale EEG data processing and real-time applications. iRPF offers a robust and data-driven artifact rejection solution for high-quality EEG pre-processing in brain-computer interfaces and clinical neuroimaging applications.
32. Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis
Authors: Jing Hao, Yuxuan Fan, Yanpeng Sun, Kaixin Guo, Lizhuo Lin, Jinrong Yang, Qi Yong H. Ai, Lun M. Wong, Hao Tang, Kuo Feng Hung β’
Published: 2025-09-11 β’
Source: arXiv
Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.
33. Jupiter: Enhancing LLM Data Analysis Capabilities via Notebook and Inference-Time Value-Guided Search
Authors: Shuocheng Li, Yihao Liu, Silin Du, Wenxuan Zeng, Zhe Xu, Mengyu Zhou, Yeye He, Haoyu Dong, Shi Han, Dongmei Zhang β’
Published: 2025-09-11 β’
Source: arXiv
Large language models (LLMs) have shown great promise in automating data science workflows, but existing models still struggle with multi-step reasoning and tool use, which limits their effectiveness on complex data analysis tasks. To address this, we propose a scalable pipeline that extracts high-quality, tool-based data analysis tasks and their executable multi-step solutions from real-world Jupyter notebooks and associated data files. Using this pipeline, we introduce NbQA, a large-scale dataset of standardized task-solution pairs that reflect authentic tool-use patterns in practical data science scenarios. To further enhance multi-step reasoning, we present Jupiter, a framework that formulates data analysis as a search problem and applies Monte Carlo Tree Search (MCTS) to generate diverse solution trajectories for value model learning. During inference, Jupiter combines the value model and node visit counts to efficiently collect executable multi-step plans with minimal search steps. Experimental results show that Qwen2.5-7B and 14B-Instruct models on NbQA solve 77.82% and 86.38% of tasks on InfiAgent-DABench, respectively-matching or surpassing GPT-4o and advanced agent frameworks. Further evaluations demonstrate improved generalization and stronger tool-use reasoning across diverse multi-step reasoning tasks.
34. Global Optimization of Stochastic Black-Box Functions with Arbitrary Noise Distributions using Wilson Score Kernel Density Estimation
Authors: ThorbjΓΈrn MosekjΓ¦r Iversen, Lars CarΓΈe SΓΈrensen, Simon Faarvang Mathiesen, Henrik Gordon Petersen β’
Published: 2025-09-11 β’
Source: arXiv
Many optimization problems in robotics involve the optimization of time-expensive black-box functions, such as those involving complex simulations or evaluation of real-world experiments. Furthermore, these functions are often stochastic as repeated experiments are subject to unmeasurable disturbances. Bayesian optimization can be used to optimize such methods in an efficient manner by deploying a probabilistic function estimator to estimate with a given confidence so that regions of the search space can be pruned away. Consequently, the success of the Bayesian optimization depends on the function estimator's ability to provide informative confidence bounds. Existing function estimators require many function evaluations to infer the underlying confidence or depend on modeling of the disturbances. In this paper, it is shown that the confidence bounds provided by the Wilson Score Kernel Density Estimator (WS-KDE) are applicable as excellent bounds to any stochastic function with an output confined to the closed interval [0;1] regardless of the distribution of the output. This finding opens up the use of WS-KDE for stable global optimization on a wider range of cost functions. The properties of WS-KDE in the context of Bayesian optimization are demonstrated in simulation and applied to the problem of automated trap design for vibrational part feeders.
35. Dynamic Structural Recovery Parameters Enhance Prediction of Visual Outcomes After Macular Hole Surgery
Authors: Yinzheng Zhao, Zhihao Zhao, Rundong Jiang, Louisa Sackewitz, Quanmin Liang, Mathias Maier, Daniel Zapp, Peter Charbel Issa, Mohammad Ali Nasseri β’
Published: 2025-09-11 β’
Source: arXiv
Purpose: To introduce novel dynamic structural parameters and evaluate their integration within a multimodal deep learning (DL) framework for predicting postoperative visual recovery in idiopathic full-thickness macular hole (iFTMH) patients. Methods: We utilized a publicly available longitudinal OCT dataset at five stages (preoperative, 2 weeks, 3 months, 6 months, and 12 months). A stage specific segmentation model delineated related structures, and an automated pipeline extracted quantitative, composite, qualitative, and dynamic features. Binary logistic regression models, constructed with and without dynamic parameters, assessed their incremental predictive value for best-corrected visual acuity (BCVA). A multimodal DL model combining clinical variables, OCT-derived features, and raw OCT images was developed and benchmarked against regression models. Results: The segmentation model achieved high accuracy across all timepoints (mean Dice > 0.89). Univariate and multivariate analyses identified base diameter, ellipsoid zone integrity, and macular hole area as significant BCVA predictors (P < 0.05). Incorporating dynamic recovery rates consistently improved logistic regression AUC, especially at the 3-month follow-up. The multimodal DL model outperformed logistic regression, yielding higher AUCs and overall accuracy at each stage. The difference is as high as 0.12, demonstrating the complementary value of raw image volume and dynamic parameters. Conclusions: Integrating dynamic parameters into the multimodal DL model significantly enhances the accuracy of predictions. This fully automated process therefore represents a promising clinical decision support tool for personalized postoperative management in macular hole surgery.