1. LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Authors: Ming Yin, Dinghan Shen, Silei Xu, Jianbing Han, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song β’
Published: 2025-08-21 β’
Source: arXiv
Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.
2. Understanding and Utilizing Dynamic Coupling in Free-Floating Space Manipulators for On-Orbit Servicing
Authors: Gargi Das, Daegyun Choi, Donghoon Kim β’
Published: 2025-08-21 β’
Source: arXiv
This study proposes a dynamic coupling-informed trajectory optimization algorithm for free-floating space manipulator systems (SMSs). Dynamic coupling between the base and the manipulator arms plays a critical role in influencing the system's behavior. While prior research has predominantly focused on minimizing this coupling, often overlooking its potential advantages, this work investigates how dynamic coupling can instead be leveraged to improve trajectory planning. Singular value decomposition (SVD) of the dynamic coupling matrix is employed to identify the dominant components governing coupling behavior. A quantitative metric is then formulated to characterize the strength and directionality of the coupling and is incorporated into a trajectory optimization framework. To assess the feasibility of the optimized trajectory, a sliding mode control-based tracking controller is designed to generate the required joint torque inputs. Simulation results demonstrate that explicitly accounting for dynamic coupling in trajectory planning enables more informed and potentially more efficient operation, offering new directions for the control of free-floating SMSs.
3. Non-negligible summands in tensor powers of some modular representations of finite $p$-groups
Authors: Kent B. Vashaw, Justin Zhang β’
Published: 2025-08-21 β’
Source: arXiv
Let $p>0$ be a prime, $G$ be a finite $p$-group and $\Bbbk$ be an algebraically closed field of characteristic $p$. Dave Benson has conjectured that if $p=2$ and $V$ is an odd-dimensional indecomposable representation of $G$ then all summands of the tensor product $V \otimes V^*$ except for $\Bbbk$ have even dimension. It is known that the analogous result for general $p$ is false. In this paper, we investigate the class of graded representations $V$ which have dimension coprime to $p$ and for which $V \otimes V^*$ has a non-trivial summand of dimension coprime to $p$, for a graded group scheme closely related to $\mathbb{Z}/p^r \mathbb{Z} \times \mathbb{Z}/p^s \mathbb{Z}$, where $r$ and $s$ are nonnegative integers and $p>2$. We produce an infinite family of such representations in characteristic 3 and show in particular that the tensor subcategory generated by any of these representations in the semisimplification contains the modulo $3$ reduction of the category of representations of the symmetric group $S_3$. Our results are compatible with a general version of Benson's conjecture due to Etingof.
4. $ΞΆ$-functions via contour integrals and universal sum rules
Authors: Guglielmo Fucci, Mateusz Piorkowski, Jonathan Stanfill β’
Published: 2025-08-21 β’
Source: arXiv
This work develops an analytic framework for the study of the $\zeta$-function associated with general sequences of complex numbers. We show that a contour integral representation, commonly used when studying spectral $\zeta$-functions associated with self-adjoint differential operators, can be extended far beyond its traditional setting. In contrast to representations utilizing integrals of $\theta$-functions, our method applies to arbitrary sequences of complex numbers with minimal assumptions. This leads to a set of universal identities, including sum rules and meromorphic properties, that hold across a broad class of $\zeta$-functions. Additionally, we discuss the connection to regularized (modified) Fredholm determinants of $p$-Schatten--von Neumann class operators. We illustrate the versatility of this representation by computing special values and residues of the $\zeta$-function for a variety of sequences of complex numbers, in particular, the zeros of Airy functions, parabolic cylinder functions, and confluent hypergeometric functions. Furthermore, we employ the adaptive Antoulas--Anderson (AAA) algorithm for rational interpolation in the study of the Airy $\zeta$-function.
5. Investigation of D-Wave quantum annealing for training Restricted Boltzmann Machines and mitigating catastrophic forgetting
Authors: Abdelmoula El-Yazizi, Yaroslav Koshka β’
Published: 2025-08-21 β’
Source: arXiv
Modest statistical differences between the sampling performances of the D-Wave quantum annealer (QA) and the classical Markov Chain Monte Carlo (MCMC), when applied to Restricted Boltzmann Machines (RBMs), are explored to explain, and possibly address, the absence of significant and consistent improvements in RBM trainability when the D-Wave sampling was used in previous investigations. A novel hybrid sampling approach, combining the classical and the QA contributions, is investigated as a promising way to benefit from the modest differences between the two sampling methods. No improvements in the RBM training are achieved in this work, thereby suggesting that the differences between the QA-based and MCMC sampling, mainly found in the medium-to-low probability regions of the distribution, which are less important for the quality of the sample, are insufficient to benefit the training. Difficulties in achieving sufficiently high quality of embedding RBMs into the lattice of the newer generation of D-Wave hardware could be further complicating the task. On the other hand, the ability to generate samples of sufficient variety from lower-probability parts of the distribution has a potential to benefit other machine learning applications, such as the mitigation of catastrophic forgetting (CF) during incremental learning. The feasibility of using QA-generated patterns of desirable classes for CF mitigation by the generative replay is demonstrated in this work for the first time. While the efficiency of the CF mitigation using the D-Wave QA was comparable to that of the classical mitigation, both the speed of generating a large number of distinct desirable patterns and the potential for further improvement make this approach promising for a variety of challenging machine learning applications.
6. NiceWebRL: a Python library for human subject experiments with reinforcement learning environments
Authors: Wilka Carvalho, Vikram Goddla, Ishaan Sinha, Hoon Shin, Kunal Jha β’
Published: 2025-08-21 β’
Source: arXiv
We present NiceWebRL, a research tool that enables researchers to use machine reinforcement learning (RL) environments for online human subject experiments. NiceWebRL is a Python library that allows any Jax-based environment to be transformed into an online interface, supporting both single-agent and multi-agent environments. As such, NiceWebRL enables AI researchers to compare their algorithms to human performance, cognitive scientists to test ML algorithms as theories for human cognition, and multi-agent researchers to develop algorithms for human-AI collaboration. We showcase NiceWebRL with 3 case studies that demonstrate its potential to help develop Human-like AI, Human-compatible AI, and Human-assistive AI. In the first case study (Human-like AI), NiceWebRL enables the development of a novel RL model of cognition. Here, NiceWebRL facilitates testing this model against human participants in both a grid world and Craftax, a 2D Minecraft domain. In our second case study (Human-compatible AI), NiceWebRL enables the development of a novel multi-agent RL algorithm that can generalize to human partners in the Overcooked domain. Finally, in our third case study (Human-assistive AI), we show how NiceWebRL can allow researchers to study how an LLM can assist humans on complex tasks in XLand-Minigrid, an environment with millions of hierarchical tasks. The library is available at https://github.com/KempnerInstitute/nicewebrl.
7. Assessing the Reliability of Truncated Coupled Cluster Wavefunction: Estimating the Distance from the Exact Solution
Authors: ΓdΓ‘m Ganyecz, Zsolt Benedek, KlΓ‘ra Petrov, Gergely Barcza, AndrΓ‘s Olasz, MiklΓ³s A. Werner, Γrs Legeza β’
Published: 2025-08-21 β’
Source: arXiv
A new approach is proposed to assess the reliability of the truncated wavefunction methods by estimating the deviation from the full configuration interaction (FCI) wavefunction. While typical multireference diagnostics compare some derived property of the solution with the ideal picture of a single determinant, we try to answer a more practical question, how far is the solution from the exact one. Using the density matrix renormalization group (DMRG) method to provide an approximate FCI solution for the self-consistently determined relevant active space, we compare the low-level CI expansions and one-body reduced density matrixes to determine the distance of the two solutions ($\tilde{d}_\Phi$, $\tilde{d}_\gamma$). We demonstrate the applicability of the approach for the CCSD method by benchmarking on the W4-17 dataset, as well as on transition metal-containing species. We also show that the presented moderate-cost, purely wavefunction-based metric is truly unique in the sense that it does not correlate with any popular multireference measures. We also explored the usage of CCSD natural orbitals ($\tilde{d}_{\gamma,\mathrm{NO}}$) and its effect on the active space size and the metric. The proposed diagnostic can also be applied to other wavefunction approximations, and it has the potential to provide a quality measure for post-Hartree-Fock procedures in general.
8. Large-dimensional Factor Analysis with Weighted PCA
Authors: Zhongyuan Lyu, Ming Yuan β’
Published: 2025-08-21 β’
Source: arXiv
Principal component analysis (PCA) is arguably the most widely used approach for large-dimensional factor analysis. While it is effective when the factors are sufficiently strong, it can be inconsistent when the factors are weak and/or the noise has complex dependence structure. We argue that the inconsistency often stems from bias and introduce a general approach to restore consistency. Specifically, we propose a general weighting scheme for PCA and show that with a suitable choice of weighting matrices, it is possible to deduce consistent and asymptotic normal estimators under much weaker conditions than the usual PCA. While the optimal weight matrix may require knowledge about the factors and covariance of the idiosyncratic noise that are not known a priori, we develop an agnostic approach to adaptively choose from a large class of weighting matrices that can be viewed as PCA for weighted linear combinations of auto-covariances among the observations. Theoretical and numerical results demonstrate the merits of our methodology over the usual PCA and other recently developed techniques for large-dimensional approximate factor models.
9. A Grant-free Coded Random Access Scheme for Near-field Communications
Authors: Enrico Testi, Giulia Torcolacci, NicolΓ² Decarli, Davide Dardari, Enrico Paolini β’
Published: 2025-08-21 β’
Source: arXiv
The industrial Internet of things (IIoT) is revolutionizing industrial processes by facilitating massive machine-type communications among countless interconnected devices. To efficiently handle the resulting large-scale and sporadic traffic, grant-free random access protocols-especially coded random access (CRA)-have emerged as scalable and reliable solutions. At the same time, advancements in wireless hardware, including extremely large-scale MIMO arrays and high-frequency communication (e.g., mmWave, Terahertz), are pushing network operations into the near-field propagation regime, allowing for dense connectivity and enhanced spatial multiplexing. This paper proposes an innovative approach that combines near-field spatial multiplexing with the interference mitigation capabilities of CRA, utilizing an extremely large aperture array at the access point. This integration improves reliability and reduces access latency, offering a robust framework for IIoT connectivity in next-generation 6G networks.
10. CM2LoD3: Reconstructing LoD3 Building Models Using Semantic Conflict Maps
Authors: Franz Hanke, Antonia Bieringer, Olaf Wysocki, Boris Jutzi β’
Published: 2025-08-21 β’
Source: arXiv
Detailed 3D building models are crucial for urban planning, digital twins, and disaster management applications. While Level of Detail 1 (LoD)1 and LoD2 building models are widely available, they lack detailed facade elements essential for advanced urban analysis. In contrast, LoD3 models address this limitation by incorporating facade elements such as windows, doors, and underpasses. However, their generation has traditionally required manual modeling, making large-scale adoption challenging. In this contribution, CM2LoD3, we present a novel method for reconstructing LoD3 building models leveraging Conflict Maps (CMs) obtained from ray-to-model-prior analysis. Unlike previous works, we concentrate on semantically segmenting real-world CMs with synthetically generated CMs from our developed Semantic Conflict Map Generator (SCMG). We also observe that additional segmentation of textured models can be fused with CMs using confidence scores to further increase segmentation performance and thus increase 3D reconstruction accuracy. Experimental results demonstrate the effectiveness of our CM2LoD3 method in segmenting and reconstructing building openings, with the 61% performance with uncertainty-aware fusion of segmented building textures. This research contributes to the advancement of automated LoD3 model reconstruction, paving the way for scalable and efficient 3D city modeling. Our project is available: https://github.com/InFraHank/CM2LoD3
11. Discrete Radar based on Modulo Arithmetic
Authors: Nishant Mehrotra, Sandesh Rao Mattu, Saif Khan Mohammed, Ronny Hadani, Robert Calderbank β’
Published: 2025-08-21 β’
Source: arXiv
Zak-OTFS is modulation scheme where signals are formed in the delay-Doppler (DD) domain, converted to the time domain (DD) for transmission and reception, then returned to the DD domain for processing. We describe how to use the same architecture for radar sensing. The intended delay resolution is $\frac{1}{B}$ where $B$ is the radar bandwidth, and the intended Doppler resolution is $\frac{1}{T}$ where $T$ is the transmission time. We form a radar waveform in the DD domain, illuminate the scattering environment, match filter the return, then correlate with delay and Doppler shifts of the transmitted waveform. This produces an image of the scattering environment, and the radar ambiguity function expresses the blurriness of this image. The possible delay and Doppler shifts generate the continuous Heisenberg-Weyl group which has been widely studied in the theory of radar. We describe how to approach the problem of waveform design, not from the perspective of this continuous group, but from the perspective of a discrete group of delay and Doppler shifts, where the discretization is determined by the intended delay and Doppler resolution of the radar. We describe how to approach the problem of shaping the ambiguity surface through symplectic transformations that normalize our discrete Heisenberg-Weyl group. The complexity of traditional continuous radar signal processing is $\mathcal{O}\big(B^2T^2\big)$. We describe how to reduce this complexity to $\mathcal{O}\big(BT\log T\big)$ by choosing the radar waveform to be a common eigenvector of a maximal commutative subgroup of our discrete Heisenberg-Weyl group. The theory of symplectic transformations also enables defining libraries of optimal radar waveforms with small peak-to-average power ratios.
12. Mind and Motion Aligned: A Joint Evaluation IsaacSim Benchmark for Task Planning and Low-Level Policies in Mobile Manipulation
Authors: Nikita Kachaev, Andrei Spiridonov, Andrey Gorodetsky, Kirill Muravyev, Nikita Oskolkov, Aditya Narendra, Vlad Shakhuro, Dmitry Makarov, Aleksandr I. Panov, Polina Fedotova, Alexey K. Kovalev β’
Published: 2025-08-21 β’
Source: arXiv
Benchmarks are crucial for evaluating progress in robotics and embodied AI. However, a significant gap exists between benchmarks designed for high-level language instruction following, which often assume perfect low-level execution, and those for low-level robot control, which rely on simple, one-step commands. This disconnect prevents a comprehensive evaluation of integrated systems where both task planning and physical execution are critical. To address this, we propose Kitchen-R, a novel benchmark that unifies the evaluation of task planning and low-level control within a simulated kitchen environment. Built as a digital twin using the Isaac Sim simulator and featuring more than 500 complex language instructions, Kitchen-R supports a mobile manipulator robot. We provide baseline methods for our benchmark, including a task-planning strategy based on a vision-language model and a low-level control policy based on diffusion policy. We also provide a trajectory collection system. Our benchmark offers a flexible framework for three evaluation modes: independent assessment of the planning module, independent assessment of the control policy, and, crucially, an integrated evaluation of the whole system. Kitchen-R bridges a key gap in embodied AI research, enabling more holistic and realistic benchmarking of language-guided robotic agents.
13. Benchmarking Computer Science Survey Generation
Authors: Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Jiaxin Mao, Ziyi Ye, Yiqun Liu β’
Published: 2025-08-21 β’
Source: arXiv
Scientific survey articles play a vital role in summarizing research progress, yet their manual creation is becoming increasingly infeasible due to the rapid growth of academic literature. While large language models (LLMs) offer promising capabilities for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To address this gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for evaluating scientific survey generation in the computer science domain. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers that serves as the retrieval pool. In addition, we propose an automated evaluation framework that measures generated surveys across four dimensions: information coverage, referencing accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based approaches shows that survey generation remains highly challenging, even for advanced self-reflection frameworks. These findings highlight the complexity of the task and the necessity for continued research. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE
14. CausalMesh: A Formally Verified Causal Cache for Stateful Serverless Computing
Authors: Haoran Zhang, Zihao Zhang, Shuai Mu, Sebastian Angel, Vincent Liu β’
Published: 2025-08-21 β’
Source: arXiv
Stateful serverless workflows consist of multiple serverless functions that access state on a remote database. Developers sometimes add a cache layer between the serverless runtime and the database to improve I/O latency. However, in a serverless environment, functions in the same workflow may be scheduled to different nodes with different caches, which can cause non-intuitive anomalies. This paper presents CausalMesh, a novel approach to causally consistent caching in environments where a computation may migrate from one machine to another, such as in serverless computing. CausalMesh is the first cache system that supports coordination-free and abort-free read/write operations and read transactions when clients roam among multiple servers. CausalMesh also supports read-write transactional causal consistency in the presence of client roaming, but at the cost of abort-freedom. We have formally verified CausalMesh's protocol in Dafny, and our experimental evaluation shows that CausalMesh has lower latency and higher throughput than existing proposals
15. Reading Between the Lines: A Study of Thematic Bias in Book Recommender Systems
Authors: Nityaa Kalra, Savvina Daniil β’
Published: 2025-08-21 β’
Source: arXiv
Recommender systems help users discover new content, but can also reinforce existing biases, leading to unfair exposure and reduced diversity. This paper introduces and investigates thematic bias in book recommendations, defined as a disproportionate favouring or neglect of certain book themes. We adopt a multi-stage bias evaluation framework using the Book-Crossing dataset to evaluate thematic bias in recommendations and its impact on different user groups. Our findings show that thematic bias originates from content imbalances and is amplified by user engagement patterns. By segmenting users based on their thematic preferences, we find that users with niche and long-tail interests receive less personalised recommendations, whereas users with diverse interests receive more consistent recommendations. These findings suggest that recommender systems should be carefully designed to accommodate a broader range of user interests. By contributing to the broader goal of responsible AI, this work also lays the groundwork for extending thematic bias analysis to other domains.
16. Classification errors distort findings in automated speech processing: examples and solutions from child-development research
Authors: Lucas Gautheron, Evan Kidd, Anton Malko, Marvin Lavechin, Alejandrina Cristia β’
Published: 2025-08-21 β’
Source: arXiv
With the advent of wearable recorders, scientists are increasingly turning to automated methods of analysis of audio and video data in order to measure children's experience, behavior, and outcomes, with a sizable literature employing long-form audio-recordings to study language acquisition. While numerous articles report on the accuracy and reliability of the most popular automated classifiers, less has been written on the downstream effects of classification errors on measurements and statistical inferences (e.g., the estimate of correlations and effect sizes in regressions). This paper proposes a Bayesian approach to study the effects of algorithmic errors on key scientific questions, including the effect of siblings on children's language experience and the association between children's production and their input. In both the most commonly used \gls{lena}, and an open-source alternative (the Voice Type Classifier from the ACLEW system), we find that classification errors can significantly distort estimates. For instance, automated annotations underestimated the negative effect of siblings on adult input by 20--80\%, potentially placing it below statistical significance thresholds. We further show that a Bayesian calibration approach for recovering unbiased estimates of effect sizes can be effective and insightful, but does not provide a fool-proof solution. Both the issue reported and our solution may apply to any classifier involving event detection and classification with non-zero error rates.
17. Label Uncertainty for Ultrasound Segmentation
Authors: Malini Shivaram, Gautam Rajendrakumar Gare, Laura Hutchins, Jacob Duplantis, Thomas Deiss, Thales Nogueira Gomes, Thong Tran, Keyur H. Patel, Thomas H Fox, Amita Krishnan, Deva Ramanan, Bennett DeBoisblanc, Ricardo Rodriguez, John Galeotti β’
Published: 2025-08-21 β’
Source: arXiv
In medical imaging, inter-observer variability among radiologists often introduces label uncertainty, particularly in modalities where visual interpretation is subjective. Lung ultrasound (LUS) is a prime example-it frequently presents a mixture of highly ambiguous regions and clearly discernible structures, making consistent annotation challenging even for experienced clinicians. In this work, we introduce a novel approach to both labeling and training AI models using expert-supplied, per-pixel confidence values. Rather than treating annotations as absolute ground truth, we design a data annotation protocol that captures the confidence that radiologists have in each labeled region, modeling the inherent aleatoric uncertainty present in real-world clinical data. We demonstrate that incorporating these confidence values during training leads to improved segmentation performance. More importantly, we show that this enhanced segmentation quality translates into better performance on downstream clinically-critical tasks-specifically, estimating S/F oxygenation ratio values, classifying S/F ratio change, and predicting 30-day patient readmission. While we empirically evaluate many methods for exposing the uncertainty to the learning model, we find that a simple approach that trains a model on binarized labels obtained with a (60%) confidence threshold works well. Importantly, high thresholds work far better than a naive approach of a 50% threshold, indicating that training on very confident pixels is far more effective. Our study systematically investigates the impact of training with varying confidence thresholds, comparing not only segmentation metrics but also downstream clinical outcomes. These results suggest that label confidence is a valuable signal that, when properly leveraged, can significantly enhance the reliability and clinical utility of AI in medical imaging.
18. Multi-perspective monitoring of wildlife and human activities from camera traps and drones with deep learning models
Authors: Hao Chen, Fang Qiu, Li An, Douglas Stow, Eve Bohnett, Haitao Lyu, Shuang Tian β’
Published: 2025-08-21 β’
Source: arXiv
Wildlife and human activities are key components of landscape systems. Understanding their spatial distribution is essential for evaluating human wildlife interactions and informing effective conservation planning. Multiperspective monitoring of wildlife and human activities by combining camera traps and drone imagery. Capturing the spatial patterns of their distributions, which allows the identification of the overlap of their activity zones and the assessment of the degree of human wildlife conflict. The study was conducted in Chitwan National Park (CNP), Nepal, and adjacent regions. Images collected by visible and nearinfrared camera traps and thermal infrared drones from February to July 2022 were processed to create training and testing datasets, which were used to build deep learning models to automatic identify wildlife and human activities. Drone collected thermal imagery was used for detecting targets to provide a multiple monitoring perspective. Spatial pattern analysis was performed to identify animal and resident activity hotspots and delineation potential human wildlife conflict zones. Among the deep learning models tested, YOLOv11s achieved the highest performance with a precision of 96.2%, recall of 92.3%, mAP50 of 96.7%, and mAP50 of 81.3%, making it the most effective for detecting objects in camera trap imagery. Drone based thermal imagery, analyzed with an enhanced Faster RCNN model, added a complementary aerial viewpoint for camera trap detections. Spatial pattern analysis identified clear hotspots for both wildlife and human activities and their overlapping patterns within certain areas in the CNP and buffer zones indicating potential conflict. This study reveals human wildlife conflicts within the conserved landscape. Integrating multiperspective monitoring with automated object detection enhances wildlife surveillance and landscape management.
19. Automated Modeling of Polarons: Defects and Reactivity on TiO$_2$(110) Surfaces
Authors: Firat Yalcin, Carla Verdi, Viktor C. Birschitzky, Matthias Meier, Michael Wolloch, Michele Reticcioli β’
Published: 2025-08-21 β’
Source: arXiv
Polarons are widespread in functional materials and are key to device performance in several technological applications. However, their effective impact on material behavior remains elusive, as condensed matter studies struggle to capture their intricate interplay with atomic defects in the crystal. In this work, we present an automated workflow for modeling polarons within density functional theory (DFT). Our approach enables a fully automatic identification of the most favorable polaronic configurations in the system. Machine learning techniques accelerate predictions, allowing for an efficient exploration of the defect-polaron configuration space. We apply this methodology to Nb-doped TiO$_2$(110) surfaces, providing new insights into the role of defects in surface reactivity. Using CO adsorbates as a probe, we find that Nb doping has minimal impact on reactivity, whereas oxygen vacancies contribute significantly depending on their local arrangement via the stabilization of polarons on the surface atomic layer. Our package streamlines the modeling of charge trapping and polaron localization with high efficiency, enabling systematic, large-scale investigations of polaronic effects across complex material systems.
20. Conformalized Exceptional Model Mining: Telling Where Your Model Performs (Not) Well
Authors: Xin Du, Sikun Yang, Wouter Duivesteijn, Mykola Pechenizkiy β’
Published: 2025-08-21 β’
Source: arXiv
Understanding the nuanced performance of machine learning models is essential for responsible deployment, especially in high-stakes domains like healthcare and finance. This paper introduces a novel framework, Conformalized Exceptional Model Mining, which combines the rigor of Conformal Prediction with the explanatory power of Exceptional Model Mining (EMM). The proposed framework identifies cohesive subgroups within data where model performance deviates exceptionally, highlighting regions of both high confidence and high uncertainty. We develop a new model class, mSMoPE (multiplex Soft Model Performance Evaluation), which quantifies uncertainty through conformal prediction's rigorous coverage guarantees. By defining a new quality measure, Relative Average Uncertainty Loss (RAUL), our framework isolates subgroups with exceptional performance patterns in multi-class classification and regression tasks. Experimental results across diverse datasets demonstrate the framework's effectiveness in uncovering interpretable subgroups that provide critical insights into model behavior. This work lays the groundwork for enhancing model interpretability and reliability, advancing the state-of-the-art in explainable AI and uncertainty quantification.
21. A Novel Mutation Based Method for Detecting FPGA Logic Synthesis Tool Bugs
Authors: Yi Zhang, He Jiang, Xiaochen Li, Shikai Guo, Peiyu Zou, Zun Wang β’
Published: 2025-08-21 β’
Source: arXiv
FPGA (Field-Programmable Gate Array) logic synthesis tools are key components in the EDA (Electronic Design Automation) toolchain. They convert hardware designs written in description languages such as Verilog into gate-level representations for FPGAs. However, defects in these tools may lead to unexpected behaviors and pose security risks. Therefore, it is crucial to harden these tools through testing. Although several methods have been proposed to automatically test FPGA logic synthesis tools, the challenge remains of insufficient semantic and logical complexity in test programs. In this paper, we propose VERMEI, a new method for testing FPGA logic synthesis tools. VERMEI consists of three modules: preprocessing, equivalent mutation, and bug identification. The preprocessing module identifies zombie logic (inactive code with no impact on the circuit output) in seed programs through simulation and coverage analysis. The equivalent mutation module generates equivalent variants of seed programs by pruning or inserting logic fragments in zombie areas. It uses Bayesian sampling to extract logic fragments from historical Verilog designs, making the generated variants have complex control flows and structures. The bug identification module, based on differential testing, compares the synthesized outputs of seed and variant programs to identify bugs. Experiments on Yosys, Vivado, and Quartus demonstrate that VERMEI outperforms the state-of-the-art methods. Within five months, VERMEI reported 15 bugs to vendors, 9 of which were confirmed as new.
22. The Enemy from Within: A Study of Political Delegitimization Discourse in Israeli Political Speech
Authors: Naama Rivlin-Angert, Guy Mor-Lan β’
Published: 2025-08-21 β’
Source: arXiv
We present the first large-scale computational study of political delegitimization discourse (PDD), defined as symbolic attacks on the normative validity of political entities. We curate and manually annotate a novel Hebrew-language corpus of 10,410 sentences drawn from Knesset speeches (1993-2023), Facebook posts (2018-2021), and leading news outlets, of which 1,812 instances (17.4\%) exhibit PDD and 642 carry additional annotations for intensity, incivility, target type, and affective framing. We introduce a two-stage classification pipeline combining finetuned encoder models and decoder LLMs. Our best model (DictaLM 2.0) attains an F$_1$ of 0.74 for binary PDD detection and a macro-F$_1$ of 0.67 for classification of delegitimization characteristics. Applying this classifier to longitudinal and cross-platform data, we see a marked rise in PDD over three decades, higher prevalence on social media versus parliamentary debate, greater use by male than female politicians, and stronger tendencies among right-leaning actors - with pronounced spikes during election campaigns and major political events. Our findings demonstrate the feasibility and value of automated PDD analysis for understanding democratic discourse.
23. Super-additive Cooperation in Language Model Agents
Authors: Filippo Tonini, Lukas Galke β’
Published: 2025-08-21 β’
Source: arXiv
With the prospect of autonomous artificial intelligence (AI) agents, studying their tendency for cooperative behavior becomes an increasingly relevant topic. This study is inspired by the super-additive cooperation theory, where the combined effects of repeated interactions and inter-group rivalry have been argued to be the cause for cooperative tendencies found in humans. We devised a virtual tournament where language model agents, grouped into teams, face each other in a Prisoner's Dilemma game. By simulating both internal team dynamics and external competition, we discovered that this blend substantially boosts both overall and initial, one-shot cooperation levels (the tendency to cooperate in one-off interactions). This research provides a novel framework for large language models to strategize and act in complex social scenarios and offers evidence for how intergroup competition can, counter-intuitively, result in more cooperative behavior. These insights are crucial for designing future multi-agent AI systems that can effectively work together and better align with human values. Source code is available at https://github.com/pippot/Superadditive-cooperation-LLMs.
24. Universal Dancing by Luminous Robots under Sequential Schedulers
Authors: Caterina Feletti, Paola Flocchini, Debasish Pattanayak, Giuseppe Prencipe, Nicola Santoro β’
Published: 2025-08-21 β’
Source: arXiv
The Dancing problem requires a swarm of $n$ autonomous mobile robots to form a sequence of patterns, aka perform a choreography. Existing work has proven that some crucial restrictions on choreographies and initial configurations (e.g., on repetitions of patterns, periodicity, symmetries, contractions/expansions) must hold so that the Dancing problem can be solved under certain robot models. Here, we prove that these necessary constraints can be dropped by considering the LUMI model (i.e., where robots are endowed with a light whose color can be chosen from a constant-size palette) under the quite unexplored sequential scheduler. We formalize the class of Universal Dancing problems which require a swarm of $n$ robots starting from any initial configuration to perform a (periodic or finite) sequence of arbitrary patterns, only provided that each pattern consists of $n$ vertices (including multiplicities). However, we prove that, to be solvable under LUMI, the length of the feasible choreographies is bounded by the compositions of $n$ into the number of colors available to the robots. We provide an algorithm solving the Universal Dancing problem by exploiting the peculiar capability of sequential robots to implement a distributed counter mechanism. Even assuming non-rigid movements, our algorithm ensures spatial homogeneity of the performed choreography.
25. Direct Neutron Reactions in Storage Rings Utilizing a Supercompact Cyclotron Neutron Target
Authors: Ariel TarifeΓ±o-Saldivia, CΓ©sar Domingo-Pardo, Iris Dillmann, Yuri A. Litvinov β’
Published: 2025-08-21 β’
Source: arXiv
We propose a new approach for a high-density free-neutron target, primarily aimed at nuclear astrophysics reaction studies in inverse kinematics with radioactive ions circulating in a storage ring. The target concept integrates four key subsystems: a neutron production source driven by a supercompact cyclotron utilizing $^9$Be($p,xn$) reactions, an optimized moderator/reflector assembly using either heavy-water or beryllium oxide with a graphite reflector shell to thermalize fast neutrons, a cryogenic liquid hydrogen moderator to maximize thermal neutron density in the interaction region, and beam pipe geometries that enable neutron-ion interactions while maintaining vacuum conditions for ion circulation. This integrated approach focuses on feasibility by incorporating readily available technology. Using a commercial supercompact cyclotron delivering 130 $\mu$A, the design achieves thermal neutron areal densities of $\sim3.4\times10^{6}$ n/cm$^2$ for a proof-of-concept demonstrator at the CRYRING ion-storage ring at GSI. This autonomous accelerator-target assembly design enables deployment at both in-flight and ISOL facilities to exploit their complementary production yields. Potential upgrades based on higher-energy and/or higher-current isochronous cyclotrons should enable an increase in areal density to $\sim$10$^9$ n/cm$^2$. In combination with a customized low-energy storage ring and a radioactive ion-beam facility, the proposed solution could deliver luminosities above 10$^{23}$ cm$^{-2}$ s$^{-1}$, thereby enabling neutron capture measurements of $\sim$mb cross sections within a few days of experiment. The proposed system represents a significant milestone towards enabling large neutron-capture surveys on exotic nuclei, thereby opening a new avenue for understanding the synthesis of heavy elements in our universe.
26. A Curated Dataset and Deep Learning Approach for Minor Dent Detection in Vehicles
Authors: Danish Zia Baig, Mohsin Kamal β’
Published: 2025-08-21 β’
Source: arXiv
Conventional car damage inspection techniques are labor-intensive, manual, and frequently overlook tiny surface imperfections like microscopic dents. Machine learning provides an innovative solution to the increasing demand for quicker and more precise inspection methods. The paper uses the YOLOv8 object recognition framework to provide a deep learning-based solution for automatically detecting microscopic surface flaws, notably tiny dents, on car exteriors. Traditional automotive damage inspection procedures are manual, time-consuming, and frequently unreliable at detecting tiny flaws. To solve this, a bespoke dataset containing annotated photos of car surfaces under various lighting circumstances, angles, and textures was created. To improve robustness, the YOLOv8m model and its customized variants, YOLOv8m-t4 and YOLOv8m-t42, were trained employing real-time data augmentation approaches. Experimental results show that the technique has excellent detection accuracy and low inference latency, making it suited for real-time applications such as automated insurance evaluations and automobile inspections. Evaluation parameters such as mean Average Precision (mAP), precision, recall, and F1-score verified the model's efficacy. With a precision of 0.86, recall of 0.84, and F1-score of 0.85, the YOLOv8m-t42 model outperformed the YOLOv8m-t4 model (precision: 0.81, recall: 0.79, F1-score: 0.80) in identifying microscopic surface defects. With a little reduced mAP@0.5:0.95 of 0.20, the mAP@0.5 for YOLOv8m-t42 stabilized at 0.60. Furthermore, YOLOv8m-t42's PR curve area was 0.88, suggesting more consistent performance than YOLOv8m-t4 (0.82). YOLOv8m-t42 has greater accuracy and is more appropriate for practical dent detection applications, even though its convergence is slower.
27. Lang2Lift: A Framework for Language-Guided Pallet Detection and Pose Estimation Integrated in Autonomous Outdoor Forklift Operation
Authors: Huy Hoang Nguyen, Johannes Huemer, Markus Murschitz, Tobias Glueck, Minh Nhat Vu, Andreas Kugi β’
Published: 2025-08-21 β’
Source: arXiv
The logistics and construction industries face persistent challenges in automating pallet handling, especially in outdoor environments with variable payloads, inconsistencies in pallet quality and dimensions, and unstructured surroundings. In this paper, we tackle automation of a critical step in pallet transport: the pallet pick-up operation. Our work is motivated by labor shortages, safety concerns, and inefficiencies in manually locating and retrieving pallets under such conditions. We present Lang2Lift, a framework that leverages foundation models for natural language-guided pallet detection and 6D pose estimation, enabling operators to specify targets through intuitive commands such as "pick up the steel beam pallet near the crane." The perception pipeline integrates Florence-2 and SAM-2 for language-grounded segmentation with FoundationPose for robust pose estimation in cluttered, multi-pallet outdoor scenes under variable lighting. The resulting poses feed into a motion planning module for fully autonomous forklift operation. We validate Lang2Lift on the ADAPT autonomous forklift platform, achieving 0.76 mIoU pallet segmentation accuracy on a real-world test dataset. Timing and error analysis demonstrate the system's robustness and confirm its feasibility for deployment in operational logistics and construction environments. Video demonstrations are available at https://eric-nguyen1402.github.io/lang2lift.github.io/
28. Foundational Design Principles and Patterns for Building Robust and Adaptive GenAI-Native Systems
Authors: Frederik Vandeputte β’
Published: 2025-08-21 β’
Source: arXiv
Generative AI (GenAI) has emerged as a transformative technology, demonstrating remarkable capabilities across diverse application domains. However, GenAI faces several major challenges in developing reliable and efficient GenAI-empowered systems due to its unpredictability and inefficiency. This paper advocates for a paradigm shift: future GenAI-native systems should integrate GenAI's cognitive capabilities with traditional software engineering principles to create robust, adaptive, and efficient systems. We introduce foundational GenAI-native design principles centered around five key pillars -- reliability, excellence, evolvability, self-reliance, and assurance -- and propose architectural patterns such as GenAI-native cells, organic substrates, and programmable routers to guide the creation of resilient and self-evolving systems. Additionally, we outline the key ingredients of a GenAI-native software stack and discuss the impact of these systems from technical, user adoption, economic, and legal perspectives, underscoring the need for further validation and experimentation. Our work aims to inspire future research and encourage relevant communities to implement and refine this conceptual framework.
29. Multilateralism in the Global Governance of Artificial Intelligence
Authors: Michal Natorski β’
Published: 2025-08-21 β’
Source: arXiv
This chapter inquires how international multilateralism addresses the emergence of the general-purpose technology of Artificial Intelligence. In more detail, it analyses two key features of AI multilateralism: its generalized principles and the coordination of state relations in the realm of AI. Firstly, it distinguishes the generalized principles of AI multilateralism of epochal change, determinism, and dialectical understanding. In the second place, the adaptation of multilateralism to AI led to the integration of AI issues into the agendas of existing cooperation frameworks and the creation of new ad hoc frameworks focusing exclusively on AI issues. In both cases, AI multilateralism develops in the shadow of the state hierarchy in relations with other AI stakeholders. While AI multilateralism is multi-stakeholder, and the hierarchy between state and non-state actors may seem blurred, states preserve the competence as decisive decision-makers in agenda-setting, negotiation, and implementation of soft law international commitments.
30. Spiking Variational Graph Representation Inference for Video Summarization
Authors: Wenrui Li, Wei Han, Liang-Jian Deng, Ruiqin Xiong, Xiaopeng Fan β’
Published: 2025-08-21 β’
Source: arXiv
With the rise of short video content, efficient video summarization techniques for extracting key information have become crucial. However, existing methods struggle to capture the global temporal dependencies and maintain the semantic coherence of video content. Additionally, these methods are also influenced by noise during multi-channel feature fusion. We propose a Spiking Variational Graph (SpiVG) Network, which enhances information density and reduces computational complexity. First, we design a keyframe extractor based on Spiking Neural Networks (SNN), leveraging the event-driven computation mechanism of SNNs to learn keyframe features autonomously. To enable fine-grained and adaptable reasoning across video frames, we introduce a Dynamic Aggregation Graph Reasoner, which decouples contextual object consistency from semantic perspective coherence. We present a Variational Inference Reconstruction Module to address uncertainty and noise arising during multi-channel feature fusion. In this module, we employ Evidence Lower Bound Optimization (ELBO) to capture the latent structure of multi-channel feature distributions, using posterior distribution regularization to reduce overfitting. Experimental results show that SpiVG surpasses existing methods across multiple datasets such as SumMe, TVSum, VideoXum, and QFVS. Our codes and pre-trained models are available at https://github.com/liwrui/SpiVG.
31. Bladder Cancer Diagnosis with Deep Learning: A Multi-Task Framework and Online Platform
Authors: Jinliang Yu, Mingduo Xie, Yue Wang, Tianfan Fu, Xianglai Xu, Jiajun Wang β’
Published: 2025-08-21 β’
Source: arXiv
Clinical cystoscopy, the current standard for bladder cancer diagnosis, suffers from significant reliance on physician expertise, leading to variability and subjectivity in diagnostic outcomes. There is an urgent need for objective, accurate, and efficient computational approaches to improve bladder cancer diagnostics. Leveraging recent advancements in deep learning, this study proposes an integrated multi-task deep learning framework specifically designed for bladder cancer diagnosis from cystoscopic images. Our framework includes a robust classification model using EfficientNet-B0 enhanced with Convolutional Block Attention Module (CBAM), an advanced segmentation model based on ResNet34-UNet++ architecture with self-attention mechanisms and attention gating, and molecular subtyping using ConvNeXt-Tiny to classify molecular markers such as HER-2 and Ki-67. Additionally, we introduce a Gradio-based online diagnostic platform integrating all developed models, providing intuitive features including multi-format image uploads, bilingual interfaces, and dynamic threshold adjustments. Extensive experimentation demonstrates the effectiveness of our methods, achieving outstanding accuracy (93.28%), F1-score (82.05%), and AUC (96.41%) for classification tasks, and exceptional segmentation performance indicated by a Dice coefficient of 0.9091. The online platform significantly improved the accuracy, efficiency, and accessibility of clinical bladder cancer diagnostics, enabling practical and user-friendly deployment. The code is publicly available. Our multi-task framework and integrated online tool collectively advance the field of intelligent bladder cancer diagnosis by improving clinical reliability, supporting early tumor detection, and enabling real-time diagnostic feedback. These contributions mark a significant step toward AI-assisted decision-making in urology.
32. Transfer learning optimization based on evolutionary selective fine tuning
Authors: Jacinto Colan, Ana Davila, Yasuhisa Hasegawa β’
Published: 2025-08-21 β’
Source: arXiv
Deep learning has shown substantial progress in image analysis. However, the computational demands of large, fully trained models remain a consideration. Transfer learning offers a strategy for adapting pre-trained models to new tasks. Traditional fine-tuning often involves updating all model parameters, which can potentially lead to overfitting and higher computational costs. This paper introduces BioTune, an evolutionary adaptive fine-tuning technique that selectively fine-tunes layers to enhance transfer learning efficiency. BioTune employs an evolutionary algorithm to identify a focused set of layers for fine-tuning, aiming to optimize model performance on a given target task. Evaluation across nine image classification datasets from various domains indicates that BioTune achieves competitive or improved accuracy and efficiency compared to existing fine-tuning methods such as AutoRGN and LoRA. By concentrating the fine-tuning process on a subset of relevant layers, BioTune reduces the number of trainable parameters, potentially leading to decreased computational cost and facilitating more efficient transfer learning across diverse data characteristics and distributions.
33. ExBigBang: A Dynamic Approach for Explainable Persona Classification through Contextualized Hybrid Transformer Analysis
Authors: Saleh Afzoon, Amin Beheshti, Nabi Rezvani, Farshad Khunjush, Usman Naseem, John McMahon, Zahra Fathollahi, Mahdieh Labani, Wathiq Mansoor, Xuyun Zhang β’
Published: 2025-08-21 β’
Source: arXiv
In user-centric design, persona development plays a vital role in understanding user behaviour, capturing needs, segmenting audiences, and guiding design decisions. However, the growing complexity of user interactions calls for a more contextualized approach to ensure designs align with real user needs. While earlier studies have advanced persona classification by modelling user behaviour, capturing contextual information, especially by integrating textual and tabular data, remains a key challenge. These models also often lack explainability, leaving their predictions difficult to interpret or justify. To address these limitations, we present ExBigBang (Explainable BigBang), a hybrid text-tabular approach that uses transformer-based architectures to model rich contextual features for persona classification. ExBigBang incorporates metadata, domain knowledge, and user profiling to embed deeper context into predictions. Through a cyclical process of user profiling and classification, our approach dynamically updates to reflect evolving user behaviours. Experiments on a benchmark persona classification dataset demonstrate the robustness of our model. An ablation study confirms the benefits of combining text and tabular data, while Explainable AI techniques shed light on the rationale behind the model's predictions.
34. Databelt: A Continuous Data Path for Serverless Workflows in the 3D Compute Continuum
Authors: Cynthia Marcelino, Leonard Guelmino, Thomas Pusztai, Stefan Nastic β’
Published: 2025-08-21 β’
Source: arXiv
Typically, serverless functions rely on remote storage services for managing state, which can result in increased latency and network communication overhead. In a dynamic environment such as the 3D (Edge-Cloud-Space) Compute Continuum, serverless functions face additional challenges due to frequent changes in network topology. As satellites move in and out of the range of ground stations, functions must make multiple hops to access cloud services, leading to high-latency state access and unnecessary data transfers. In this paper, we present Databelt, a state management framework for serverless workflows designed for the dynamic environment of the 3D Compute Continuum. Databelt introduces an SLO-aware state propagation mechanism that enables the function state to move continuously in orbit. Databelt proactively offloads function states to the most suitable node, such that when functions execute, the data is already present on the execution node or nearby, thus minimizing state access latency and reducing the number of network hops. Additionally, Databelt introduces a function state fusion mechanism that abstracts state management for functions sharing the same serverless runtime. When functions are fused, Databelt seamlessly retrieves their state as a group, reducing redundant network and storage operations and improving overall workflow efficiency. Our experimental results show that Databelt reduces workflow execution time by up to 66% and increases throughput by 50% compared to the baselines. Furthermore, our results show that Databelt function state fusion reduces storage operations latency by up to 20%, by reducing repetitive storage requests for functions within the same runtime, ensuring efficient execution of serverless workflows in highly dynamic network environments such as the 3D Continuum.
35. Conflict-Aware Soft Prompting for Retrieval-Augmented Generation
Authors: Eunseong Choi, June Park, Hyeri Lee, Jongwuk Lee β’
Published: 2025-08-21 β’
Source: arXiv
Retrieval-augmented generation (RAG) enhances the capabilities of large language models (LLMs) by incorporating external knowledge into their input prompts. However, when the retrieved context contradicts the LLM's parametric knowledge, it often fails to resolve the conflict between incorrect external context and correct parametric knowledge, known as context-memory conflict. To tackle this problem, we introduce Conflict-Aware REtrieval-Augmented Generation (CARE), consisting of a context assessor and a base LLM. The context assessor encodes compact memory token embeddings from raw context tokens. Through grounded/adversarial soft prompting, the context assessor is trained to discern unreliable context and capture a guidance signal that directs reasoning toward the more reliable knowledge source. Extensive experiments show that CARE effectively mitigates context-memory conflicts, leading to an average performance gain of 5.0\% on QA and fact-checking benchmarks, establishing a promising direction for trustworthy and adaptive RAG systems.