1. Calibration-Aware Prompt Learning for Medical Vision-Language Models
Authors: Abhishek Basu, Fahad Shamshad, Ashshak Sharifdeen, Karthik Nandakumar, Muhammad Haris Khan β’
Published: 2025-09-18 β’
Source: arXiv
Medical Vision-Language Models (Med-VLMs) have demonstrated remarkable performance across diverse medical imaging tasks by leveraging large-scale image-text pretraining. However, their confidence calibration is largely unexplored, and so remains a significant challenge. As such, miscalibrated predictions can lead to overconfident errors, undermining clinical trust and decision-making reliability. To address this, we introduce CalibPrompt, the first framework to calibrate Med-VLMs during prompt tuning. CalibPrompt optimizes a small set of learnable prompts with carefully designed calibration objectives under scarce labeled data regime. First, we study a regularizer that attempts to align the smoothed accuracy with the predicted model confidences. Second, we introduce an angular separation loss to maximize textual feature proximity toward improving the reliability in confidence estimates of multimodal Med-VLMs. Extensive experiments on four publicly available Med-VLMs and five diverse medical imaging datasets reveal that CalibPrompt consistently improves calibration without drastically affecting clean accuracy. Our code is available at https://github.com/iabh1shekbasu/CalibPrompt.
2. Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation
Authors: Luca Bartolomei, Enrico Mannocci, Fabio Tosi, Matteo Poggi, Stefano Mattoccia β’
Published: 2025-09-18 β’
Source: arXiv
Event cameras capture sparse, high-temporal-resolution visual information, making them particularly suitable for challenging environments with high-speed motion and strongly varying lighting conditions. However, the lack of large datasets with dense ground-truth depth annotations hinders learning-based monocular depth estimation from event data. To address this limitation, we propose a cross-modal distillation paradigm to generate dense proxy labels leveraging a Vision Foundation Model (VFM). Our strategy requires an event stream spatially aligned with RGB frames, a simple setup even available off-the-shelf, and exploits the robustness of large-scale VFMs. Additionally, we propose to adapt VFMs, either a vanilla one like Depth Anything v2 (DAv2), or deriving from it a novel recurrent architecture to infer depth from monocular event cameras. We evaluate our approach with synthetic and real-world datasets, demonstrating that i) our cross-modal paradigm achieves competitive performance compared to fully supervised methods without requiring expensive depth annotations, and ii) our VFM-based models achieve state-of-the-art performance.
3. Parameter sensitivity of cosmic pairwise velocities in the non-linear regime of structure formation
Authors: Jorge Enrique GarcΓa-Farieta, HΓ©ctor J. HortΓΊa β’
Published: 2025-09-18 β’
Source: arXiv
The peculiar velocities of dark matter tracers drive the growth of cosmic structures, providing a sensitive test of cosmological models and strengthening constraints on the nature of dark energy. In this work, we investigate the mean pairwise velocities, $v_{12}$, of dark matter tracers as a cosmological probe in the non-linear regime of cosmic structure formation. Using N-body dark matter-only simulations, we measure $v_{12}$ for pair separations up to 50 $h^{-1}$Mpc and model it by solving the pair conservation equation for a self-gravitating particle system, along with various prescriptions of the nonlinear matter power spectrum. We quantified the sensitivity of $v_{12}$ to variations in key cosmological parameters such as $\Omega_{\mathrm{m}}$, $\sigma_8$, $h$, $M_\nu$, and $w$. Our parameter inference analysis using MCMC shows sub-11% agreement with simulation data, with notable degeneracies, particularly between $\Omega_\mathrm{m}$ and $\sigma_8$. We further compute the stable clustering crossing scale across redshifts $z=0$, $0.5$, and $1$, assessing its dependence on cosmology. Among the tested power spectrum modeling approaches, we find that the CSSTEmu emulator provides the most accurate predictions, with deviations below 5% for $r > 10$ $h^{-1}$Mpc at $z=0.5$. Our results are validated using independent simulation suites, demonstrating that our framework offers a robust method for extracting cosmological constraints from upcoming peculiar velocity data.
4. Lightweight and Accurate Multi-View Stereo with Confidence-Aware Diffusion Model
Authors: Fangjinhua Wang, Qingshan Xu, Yew-Soon Ong, Marc Pollefeys β’
Published: 2025-09-18 β’
Source: arXiv
To reconstruct the 3D geometry from calibrated images, learning-based multi-view stereo (MVS) methods typically perform multi-view depth estimation and then fuse depth maps into a mesh or point cloud. To improve the computational efficiency, many methods initialize a coarse depth map and then gradually refine it in higher resolutions. Recently, diffusion models achieve great success in generation tasks. Starting from a random noise, diffusion models gradually recover the sample with an iterative denoising process. In this paper, we propose a novel MVS framework, which introduces diffusion models in MVS. Specifically, we formulate depth refinement as a conditional diffusion process. Considering the discriminative characteristic of depth estimation, we design a condition encoder to guide the diffusion process. To improve efficiency, we propose a novel diffusion network combining lightweight 2D U-Net and convolutional GRU. Moreover, we propose a novel confidence-based sampling strategy to adaptively sample depth hypotheses based on the confidence estimated by diffusion model. Based on our novel MVS framework, we propose two novel MVS methods, DiffMVS and CasDiffMVS. DiffMVS achieves competitive performance with state-of-the-art efficiency in run-time and GPU memory. CasDiffMVS achieves state-of-the-art performance on DTU, Tanks & Temples and ETH3D. Code is available at: https://github.com/cvg/diffmvs.
5. Assessing Historical Structural Oppression Worldwide via Rule-Guided Prompting of Large Language Models
Authors: Sreejato Chatterjee, Linh Tran, Quoc Duy Nguyen, Roni Kirson, Drue Hamlin, Harvest Aquino, Hanjia Lyu, Jiebo Luo, Timothy Dye β’
Published: 2025-09-18 β’
Source: arXiv
Traditional efforts to measure historical structural oppression struggle with cross-national validity due to the unique, locally specified histories of exclusion, colonization, and social status in each country, and often have relied on structured indices that privilege material resources while overlooking lived, identity-based exclusion. We introduce a novel framework for oppression measurement that leverages Large Language Models (LLMs) to generate context-sensitive scores of lived historical disadvantage across diverse geopolitical settings. Using unstructured self-identified ethnicity utterances from a multilingual COVID-19 global study, we design rule-guided prompting strategies that encourage models to produce interpretable, theoretically grounded estimations of oppression. We systematically evaluate these strategies across multiple state-of-the-art LLMs. Our results demonstrate that LLMs, when guided by explicit rules, can capture nuanced forms of identity-based historical oppression within nations. This approach provides a complementary measurement tool that highlights dimensions of systemic exclusion, offering a scalable, cross-cultural lens for understanding how oppression manifests in data-driven research and public health contexts. To support reproducible evaluation, we release an open-sourced benchmark dataset for assessing LLMs on oppression measurement (https://github.com/chattergpt/llm-oppression-benchmark).
6. Competing and Intertwined Orders in Boson-Doped Mott Antiferromagnets
Authors: Xin Lu, Jia-Xin Zhang, Lukas Homeier, Hong-Chen Jiang, Shou-Shu Gong, D. N. Sheng, Zheng-Yu Weng β’
Published: 2025-09-18 β’
Source: arXiv
Inspired by the recent experimental advances in cold atom quantum simulators, we explore the experimentally implemented bosonic $t$-$t'$-$J$ model on the square lattice using large-scale density matrix renormalization group simulations. By tuning the doping level $\delta$ and hopping ratio $t'/t$, we uncover six distinct quantum phases, several of which go far beyond the conventional paradigm of phase-coherent superfluidity (SF) expected for bosonic systems. In particular, in the presence of antiferromagnetic (AFM) order, doped holes are tightly bound into pairs, giving rise to a pair density wave (PDW) phase at low doping and small $|t'/t|$, which is suppressed on the $t'<0$ side, resulting in a disordered PDW state that lacks coherence of either individual bosons or pairs. Upon further doping, bosons can regain phase coherence and form a SF* state, characterized by condensation at emergent incommensurate momenta concurrent with an incommensurate magnetic order. On the $t'>0$ side, the sign-induced kinetic frustration inherently disfavors local AFM correlations, leading to a phase separation in which doped holes cluster into ferromagnetic (FM) domains spatially separated by undoped AFM regions. Upon further doping, this inhomogeneous state evolves into a uniform SF + $xy$-FM phase. Finally, we propose a concrete experimental scheme to realize both signs of $t'/t$ in Rydberg tweezer arrays, with an explicit mapping between model parameters and experimentally accessible regimes. Our results reveal competing and intertwined orders in doped antiferromagnets, which are relevant to central issues in high-$T_c$ superconductivity, reflecting the frustrated interplay between doped holes and spin background.
7. Explicit Context-Driven Neural Acoustic Modeling for High-Fidelity RIR Generation
Authors: Chen Si, Qianyi Wu, Chaitanya Amballa, Romit Roy Choudhury β’
Published: 2025-09-18 β’
Source: arXiv
Realistic sound simulation plays a critical role in many applications. A key element in sound simulation is the room impulse response (RIR), which characterizes how sound propagates from a source to a listener within a given space. Recent studies have applied neural implicit methods to learn RIR using context information collected from the environment, such as scene images. However, these approaches do not effectively leverage explicit geometric information from the environment. To further exploit the potential of neural implicit models with direct geometric features, we present Mesh-infused Neural Acoustic Field (MiNAF), which queries a rough room mesh at given locations and extracts distance distributions as an explicit representation of local context. Our approach demonstrates that incorporating explicit local geometric features can better guide the neural network in generating more accurate RIR predictions. Through comparisons with conventional and state-of-the-art baseline methods, we show that MiNAF performs competitively across various evaluation metrics. Furthermore, we verify the robustness of MiNAF in datasets with limited training samples, demonstrating an advance in high-fidelity sound simulation.
8. What's the Best Way to Retrieve Slides? A Comparative Study of Multimodal, Caption-Based, and Hybrid Retrieval Techniques
Authors: Petros Stylianos Giouroukis, Dimitris Dimitriadis, Dimitrios Papadopoulos, Zhenwen Shao, Grigorios Tsoumakas β’
Published: 2025-09-18 β’
Source: arXiv
Slide decks, serving as digital reports that bridge the gap between presentation slides and written documents, are a prevalent medium for conveying information in both academic and corporate settings. Their multimodal nature, combining text, images, and charts, presents challenges for retrieval-augmented generation systems, where the quality of retrieval directly impacts downstream performance. Traditional approaches to slide retrieval often involve separate indexing of modalities, which can increase complexity and lose contextual information. This paper investigates various methodologies for effective slide retrieval, including visual late-interaction embedding models like ColPali, the use of visual rerankers, and hybrid retrieval techniques that combine dense retrieval with BM25, further enhanced by textual rerankers and fusion methods like Reciprocal Rank Fusion. A novel Vision-Language Models-based captioning pipeline is also evaluated, demonstrating significantly reduced embedding storage requirements compared to visual late-interaction techniques, alongside comparable retrieval performance. Our analysis extends to the practical aspects of these methods, evaluating their runtime performance and storage demands alongside retrieval efficacy, thus offering practical guidance for the selection and development of efficient and robust slide retrieval systems for real-world applications.
9. Voyager: An End-to-End Framework for Design-Space Exploration and Generation of DNN Accelerators
Authors: Kartik Prabhu, Jeffrey Yu, Xinyuan Allen Pan, Zhouhua Xie, Abigail Aleshire, Zihan Chen, Ammar Ali Ratnani, Priyanka Raina β’
Published: 2025-09-18 β’
Source: arXiv
While deep neural networks (DNNs) have achieved state-of-the-art performance in fields from computer vision to natural language processing, efficiently running these computationally demanding models requires hardware accelerators. However, designing these accelerators is a time-consuming, labor-intensive process that does not scale well. While prior efforts have sought to automate DNN accelerator generation, they offer limited parameterization, cannot produce high-performance, tapeout-ready designs, provide limited support for datatypes and quantization schemes, and lack an integrated, end-to-end software compiler. This work proposes Voyager, a high-level synthesis (HLS)-based framework for design space exploration (DSE) and generation of DNN accelerators. Voyager overcomes the limitations of prior work by offering extensive configurability across technology nodes, clock frequencies, and scales, with customizable parameters such as number of processing elements, on-chip buffer sizes, and external memory bandwidth. Voyager supports a wider variety of datatypes and quantization schemes versus prior work, including both built-in floating-point, posit and integer formats, as well as user-defined formats with both per-tensor scaling and microscaling quantization. Voyager's PyTorch-based compiler efficiently maps networks end-to-end on the generated hardware, with support for quantization, fusion, and tiling. We evaluate Voyager on state-of-the-art vision and language models. Voyager enables fast DSE with full-dataset accuracy evaluation for datatypes and quantization schemes. Generated designs achieve a high utilization across models and scales, up to 99.8%, and outperform prior generators with up to 61% lower latency and 56% lower area. Compared to hand-optimized accelerators, Voyager achieves comparable performance, while offering much greater automation in design and workload mapping.
10. CausalPre: Scalable and Effective Data Pre-processing for Causal Fairness
Authors: Ying Zheng, Yangfan Jiang, Kian-Lee Tan β’
Published: 2025-09-18 β’
Source: arXiv
Causal fairness in databases is crucial to preventing biased and inaccurate outcomes in downstream tasks. While most prior work assumes a known causal model, recent efforts relax this assumption by enforcing additional constraints. However, these approaches often fail to capture broader attribute relationships that are critical to maintaining utility. This raises a fundamental question: Can we harness the benefits of causal reasoning to design efficient and effective fairness solutions without relying on strong assumptions about the underlying causal model? In this paper, we seek to answer this question by introducing CausalPre, a scalable and effective causality-guided data pre-processing framework that guarantees justifiable fairness, a strong causal notion of fairness. CausalPre extracts causally fair relationships by reformulating the originally complex and computationally infeasible extraction task into a tailored distribution estimation problem. To ensure scalability, CausalPre adopts a carefully crafted variant of low-dimensional marginal factorization to approximate the joint distribution, complemented by a heuristic algorithm that efficiently tackles the associated computational challenge. Extensive experiments on benchmark datasets demonstrate that CausalPre is both effective and scalable, challenging the conventional belief that achieving causal fairness requires trading off relationship coverage for relaxed model assumptions.
11. Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning
Authors: Yeongbin Seo, Dongha Lee, Jaehyung Kim, Jinyoung Yeo β’
Published: 2025-09-18 β’
Source: arXiv
Autoregressive (AR) language models generate text one token at a time, which limits their inference speed. Diffusion-based language models offer a promising alternative, as they can decode multiple tokens in parallel. However, we identify a key bottleneck in current diffusion LMs: the long decoding-window problem, where tokens generated far from the input context often become irrelevant or repetitive. Previous solutions like semi-autoregressive address this issue by splitting windows into blocks, but this sacrifices speed and bidirectionality, eliminating the main advantage of diffusion models. To overcome this, we propose Convolutional decoding (Conv), a normalization-based method that narrows the decoding window without hard segmentation, leading to better fluency and flexibility. Additionally, we introduce Rejecting Rule-based Fine-Tuning (R2FT), a post-hoc training scheme that better aligns tokens at positions far from context. Our methods achieve state-of-the-art results on open-ended generation benchmarks (e.g., AlpacaEval) among diffusion LM baselines, with significantly lower step size than previous works, demonstrating both speed and quality improvements.
12. Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation
Authors: Xiaoyu Yue, Zidong Wang, Yuqing Wang, Wenlong Zhang, Xihui Liu, Wanli Ouyang, Lei Bai, Luping Zhou β’
Published: 2025-09-18 β’
Source: arXiv
Recent studies have demonstrated the importance of high-quality visual representations in image generation and have highlighted the limitations of generative models in image understanding. As a generative paradigm originally designed for natural language, autoregressive models face similar challenges. In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain. We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency. We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR). Without relying on pre-trained representation models, ST-AR significantly enhances the image understanding ability of autoregressive models and leads to improved generation quality. Specifically, ST-AR brings approximately 42% FID improvement for LlamaGen-L and 49% FID improvement for LlamaGen-XL, while maintaining the same sampling strategy.
13. SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models
Authors: Huy Nghiem, Advik Sachdeva, Hal DaumΓ© III β’
Published: 2025-09-18 β’
Source: arXiv
WARNING: This paper contains examples of offensive materials. Toxic content has become pervasive on social media platforms. We introduce SMARTER, a data-efficient two-stage framework for explainable content moderation using Large Language Models (LLMs). In Stage 1, we leverage LLMs' own outputs to generate synthetic explanations for both correct and incorrect labels, enabling alignment via preference optimization with minimal human supervision. In Stage 2, we refine explanation quality through cross-model training, allowing weaker models to align stylistically and semantically with stronger ones. Experiments on three benchmark tasks -- HateXplain, Latent Hate, and Implicit Hate -- demonstrate that SMARTER enables LLMs to achieve up to a 13.5% macro-F1 improvement over standard few-shot baselines while using only a fraction of the full training data. Our framework offers a scalable strategy for low-resource settings by harnessing LLMs' self-improving capabilities for both classification and explanation.
14. Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment
Authors: Ankur Samanta, Akshayaa Magesh, Youliang Yu, Runzhe Wu, Ayush Jain, Daniel Jiang, Boris Vidolov, Paul Sajda, Yonathan Efroni, Kaveh Hassani β’
Published: 2025-09-18 β’
Source: arXiv
Language Models (LMs) are inconsistent reasoners, often generating contradictory responses to identical prompts. While inference-time methods can mitigate these inconsistencies, they fail to address the core problem: LMs struggle to reliably select reasoning pathways leading to consistent outcomes under exploratory sampling. To address this, we formalize self-consistency as an intrinsic property of well-aligned reasoning models and introduce Multi-Agent Consensus Alignment (MACA), a reinforcement learning framework that post-trains models to favor reasoning trajectories aligned with their internal consensus using majority/minority outcomes from multi-agent debate. These trajectories emerge from deliberative exchanges where agents ground reasoning in peer arguments, not just aggregation of independent attempts, creating richer consensus signals than single-round majority voting. MACA enables agents to teach themselves to be more decisive and concise, and better leverage peer insights in multi-agent settings without external supervision, driving substantial improvements across self-consistency (+27.6% on GSM8K), single-agent reasoning (+23.7% on MATH), sampling-based inference (+22.4% Pass@20 on MATH), and multi-agent ensemble decision-making (+42.7% on MathQA). These findings, coupled with strong generalization to unseen benchmarks (+16.3% on GPQA, +11.6% on CommonsenseQA), demonstrate robust self-alignment that more reliably unlocks latent reasoning potential of language models.
15. Semi-Supervised 3D Medical Segmentation from 2D Natural Images Pretrained Model
Authors: Pak-Hei Yeung, Jayroop Ramesh, Pengfei Lyu, Ana Namburete, Jagath Rajapakse β’
Published: 2025-09-18 β’
Source: arXiv
This paper explores the transfer of knowledge from general vision models pretrained on 2D natural images to improve 3D medical image segmentation. We focus on the semi-supervised setting, where only a few labeled 3D medical images are available, along with a large set of unlabeled images. To tackle this, we propose a model-agnostic framework that progressively distills knowledge from a 2D pretrained model to a 3D segmentation model trained from scratch. Our approach, M&N, involves iterative co-training of the two models using pseudo-masks generated by each other, along with our proposed learning rate guided sampling that adaptively adjusts the proportion of labeled and unlabeled data in each training batch to align with the models' prediction accuracy and stability, minimizing the adverse effect caused by inaccurate pseudo-masks. Extensive experiments on multiple publicly available datasets demonstrate that M&N achieves state-of-the-art performance, outperforming thirteen existing semi-supervised segmentation approaches under all different settings. Importantly, ablation studies show that M&N remains model-agnostic, allowing seamless integration with different architectures. This ensures its adaptability as more advanced models emerge. The code is available at https://github.com/pakheiyeung/M-N.
16. Top- and bottom-heavy vertical velocity structures: physical modes of layered atmospheric models
Authors: Fiaz Ahmed, J. David Neelin β’
Published: 2025-09-18 β’
Source: arXiv
Tropical East and West Pacific Oceans display differences in their vertical velocity (or omega) profiles. The East Pacific is characterized by bottom-heavy profiles, while the West Pacific is characterized by top-heavy profiles. Although inter-basin differences in the horizontal SST gradient are known to be important, physical reasons for why these omega structure variants exist are not fully understood. This question is addressed using a steady, linear model on an $f$-plane with $n$ atmospheric layers. Convection and radiation are parameterized as linear responses to thermodynamic perturbations with convective nonlinearity approximated by convection on/off regimes. The free (or eigen) modes of the model yield vertical structures resembling the observed baroclinic modes of the tropical atmosphere, with each mode associated with a characteristic horizontal scale (the eigenvalue). In the standard parameter regime, the first-baroclinic mode has a large spatial scale ($\sim$ 1500 km) while the second-baroclinic mode has a smaller spatial scale ($\sim$ 250 km). When the model is forced with a strong- and weak-gradient surface temperature ($T_s$) patterns, the resulting omega profiles assume bottom- and top-heavy structures respectively -- mimicking the observed differences between East and West Pacific Oceans. Additional dependence on the magnitude of the Coriolis force is also observed. The connection between the vertical structure and the horizontal scale of the baroclinic modes explains why a strong-gradient $T_s$ profile projects strongly onto the second-baroclinic mode yielding bottom-heavy omega profiles in the eastern Pacific, while a weak-gradient $T_s$ profile projects strongly onto the first-baroclinic mode, yielding top-heavy omega profiles typical of the Western Pacific.
17. AIP: Subverting Retrieval-Augmented Generation via Adversarial Instructional Prompt
Authors: Saket S. Chaturvedi, Gaurav Bagwe, Lan Zhang, Xiaoyong Yuan β’
Published: 2025-09-18 β’
Source: arXiv
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by retrieving relevant documents from external sources to improve factual accuracy and verifiability. However, this reliance introduces new attack surfaces within the retrieval pipeline, beyond the LLM itself. While prior RAG attacks have exposed such vulnerabilities, they largely rely on manipulating user queries, which is often infeasible in practice due to fixed or protected user inputs. This narrow focus overlooks a more realistic and stealthy vector: instructional prompts, which are widely reused, publicly shared, and rarely audited. Their implicit trust makes them a compelling target for adversaries to manipulate RAG behavior covertly. We introduce a novel attack for Adversarial Instructional Prompt (AIP) that exploits adversarial instructional prompts to manipulate RAG outputs by subtly altering retrieval behavior. By shifting the attack surface to the instructional prompts, AIP reveals how trusted yet seemingly benign interface components can be weaponized to degrade system integrity. The attack is crafted to achieve three goals: (1) naturalness, to evade user detection; (2) utility, to encourage use of prompts; and (3) robustness, to remain effective across diverse query variations. We propose a diverse query generation strategy that simulates realistic linguistic variation in user queries, enabling the discovery of prompts that generalize across paraphrases and rephrasings. Building on this, a genetic algorithm-based joint optimization is developed to evolve adversarial prompts by balancing attack success, clean-task utility, and stealthiness. Experimental results show that AIP achieves up to 95.23% ASR while preserving benign functionality. These findings uncover a critical and previously overlooked vulnerability in RAG systems, emphasizing the need to reassess the shared instructional prompts.
18. Hunting the elusive $X17$ in CE$Ξ½$NS at the ESS
Authors: Joakim CederkΓ€ll, YaΕar HiΓ§yΔ±lmaz, Else Lytken, Stefano Moretti, Johan Rathsman β’
Published: 2025-09-18 β’
Source: arXiv
The so-called $X17$ particle has been proposed in order to explain a very significant resonant behaviour (in both the angular separation and invariant mass) of $e^+e^-$ pairs produced during a nuclear transition of excited $^8$Be, $^4$He and $^{12}$C nuclei. Fits to the corresponding data point, as most probable explanation, to a spin-1 object, which is protophobic and has a mass of approximately 16.7 MeV, which then makes the $X17$ potentially observable in Coherent Elastic neutrino ($\nu$) Nucleus Scattering (CE$\nu$NS) at the European Spallation Source (ESS). By adopting as theoretical framework a minimal extension of the Standard Model (SM) with a generic $U(1)'$ gauge group mixing with the hypercharge one of the latter, which can naturally accommodate the $X17$ state compliant with all available measurements from a variety of experiments, we predict that CE$\nu$NS at the ESS will constitute an effective means to probe this hypothesis, even after allowing for the inevitable systematics associated to the performance of the planned detectors therein.
19. Digital Twin-based Cooperative Autonomous Driving in Smart Intersections: A Multi-Agent Reinforcement Learning Approach
Authors: Taoyuan Yu, Kui Wang, Zongdian Li, Tao Yu, Kei Sakaguchi, Walid Saad β’
Published: 2025-09-18 β’
Source: arXiv
Unsignalized intersections pose safety and efficiency challenges due to complex traffic flows and blind spots. In this paper, a digital twin (DT)-based cooperative driving system with roadside unit (RSU)-centric architecture is proposed for enhancing safety and efficiency at unsignalized intersections. The system leverages comprehensive bird-eye-view (BEV) perception to eliminate blind spots and employs a hybrid reinforcement learning (RL) framework combining offline pre-training with online fine-tuning. Specifically, driving policies are initially trained using conservative Q-learning (CQL) with behavior cloning (BC) on real datasets, then fine-tuned using multi-agent proximal policy optimization (MAPPO) with self-attention mechanisms to handle dynamic multi-agent coordination. The RSU implements real-time commands via vehicle-to-infrastructure (V2I) communications. Experimental results show that the proposed method yields failure rates below 0.03\% coordinating up to three connected autonomous vehicles (CAVs), significantly outperforming traditional methods. In addition, the system exhibits sub-linear computational scaling with inference times under 40 ms. Furthermore, it demonstrates robust generalization across diverse unsignalized intersection scenarios, indicating its practicality and readiness for real-world deployment.
20. Transplant-Ready? Evaluating AI Lung Segmentation Models in Candidates with Severe Lung Disease
Authors: Jisoo Lee, Michael R. Harowicz, Yuwen Chen, Hanxue Gu, Isaac S. Alderete, Lin Li, Maciej A. Mazurowski, Matthew G. Hartwig β’
Published: 2025-09-18 β’
Source: arXiv
This study evaluates publicly available deep-learning based lung segmentation models in transplant-eligible patients to determine their performance across disease severity levels, pathology categories, and lung sides, and to identify limitations impacting their use in preoperative planning in lung transplantation. This retrospective study included 32 patients who underwent chest CT scans at Duke University Health System between 2017 and 2019 (total of 3,645 2D axial slices). Patients with standard axial CT scans were selected based on the presence of two or more lung pathologies of varying severity. Lung segmentation was performed using three previously developed deep learning models: Unet-R231, TotalSegmentator, MedSAM. Performance was assessed using quantitative metrics (volumetric similarity, Dice similarity coefficient, Hausdorff distance) and a qualitative measure (four-point clinical acceptability scale). Unet-R231 consistently outperformed TotalSegmentator and MedSAM in general, for different severity levels, and pathology categories (p<0.05). All models showed significant performance declines from mild to moderate-to-severe cases, particularly in volumetric similarity (p<0.05), without significant differences among lung sides or pathology types. Unet-R231 provided the most accurate automated lung segmentation among evaluated models with TotalSegmentator being a close second, though their performance declined significantly in moderate-to-severe cases, emphasizing the need for specialized model fine-tuning in severe pathology contexts.
21. Constrained Feedback Learning for Non-Stationary Multi-Armed Bandits
Authors: Shaoang Li, Jian Li β’
Published: 2025-09-18 β’
Source: arXiv
Non-stationary multi-armed bandits enable agents to adapt to changing environments by incorporating mechanisms to detect and respond to shifts in reward distributions, making them well-suited for dynamic settings. However, existing approaches typically assume that reward feedback is available at every round - an assumption that overlooks many real-world scenarios where feedback is limited. In this paper, we take a significant step forward by introducing a new model of constrained feedback in non-stationary multi-armed bandits, where the availability of reward feedback is restricted. We propose the first prior-free algorithm - that is, one that does not require prior knowledge of the degree of non-stationarity - that achieves near-optimal dynamic regret in this setting. Specifically, our algorithm attains a dynamic regret of $\tilde{\mathcal{O}}({K^{1/3} V_T^{1/3} T }/{ B^{1/3}})$, where $T$ is the number of rounds, $K$ is the number of arms, $B$ is the query budget, and $V_T$ is the variation budget capturing the degree of non-stationarity.
22. Semantic-LiDAR-Inertial-Wheel Odometry Fusion for Robust Localization in Large-Scale Dynamic Environments
Authors: Haoxuan Jiang, Peicong Qian, Yusen Xie, Linwei Zheng, Xiaocong Li, Ming Liu, Jun Ma β’
Published: 2025-09-18 β’
Source: arXiv
Reliable, drift-free global localization presents significant challenges yet remains crucial for autonomous navigation in large-scale dynamic environments. In this paper, we introduce a tightly-coupled Semantic-LiDAR-Inertial-Wheel Odometry fusion framework, which is specifically designed to provide high-precision state estimation and robust localization in large-scale dynamic environments. Our framework leverages an efficient semantic-voxel map representation and employs an improved scan matching algorithm, which utilizes global semantic information to significantly reduce long-term trajectory drift. Furthermore, it seamlessly fuses data from LiDAR, IMU, and wheel odometry using a tightly-coupled multi-sensor fusion Iterative Error-State Kalman Filter (iESKF). This ensures reliable localization without experiencing abnormal drift. Moreover, to tackle the challenges posed by terrain variations and dynamic movements, we introduce a 3D adaptive scaling strategy that allows for flexible adjustments to wheel odometry measurement weights, thereby enhancing localization precision. This study presents extensive real-world experiments conducted in a one-million-square-meter automated port, encompassing 3,575 hours of operational data from 35 Intelligent Guided Vehicles (IGVs). The results consistently demonstrate that our system outperforms state-of-the-art LiDAR-based localization methods in large-scale dynamic environments, highlighting the framework's reliability and practical value.
23. Sentinel Agents for Secure and Trustworthy Agentic AI in Multi-Agent Systems
Authors: Diego Gosmar, Deborah A. Dahl β’
Published: 2025-09-18 β’
Source: arXiv
This paper proposes a novel architectural framework aimed at enhancing security and reliability in multi-agent systems (MAS). A central component of this framework is a network of Sentinel Agents, functioning as a distributed security layer that integrates techniques such as semantic analysis via large language models (LLMs), behavioral analytics, retrieval-augmented verification, and cross-agent anomaly detection. Such agents can potentially oversee inter-agent communications, identify potential threats, enforce privacy and access controls, and maintain comprehensive audit records. Complementary to the idea of Sentinel Agents is the use of a Coordinator Agent. The Coordinator Agent supervises policy implementation, and manages agent participation. In addition, the Coordinator also ingests alerts from Sentinel Agents. Based on these alerts, it can adapt policies, isolate or quarantine misbehaving agents, and contain threats to maintain the integrity of the MAS ecosystem. This dual-layered security approach, combining the continuous monitoring of Sentinel Agents with the governance functions of Coordinator Agents, supports dynamic and adaptive defense mechanisms against a range of threats, including prompt injection, collusive agent behavior, hallucinations generated by LLMs, privacy breaches, and coordinated multi-agent attacks. In addition to the architectural design, we present a simulation study where 162 synthetic attacks of different families (prompt injection, hallucination, and data exfiltration) were injected into a multi-agent conversational environment. The Sentinel Agents successfully detected the attack attempts, confirming the practical feasibility of the proposed monitoring approach. The framework also offers enhanced system observability, supports regulatory compliance, and enables policy evolution over time.
24. Explainable AI for Infection Prevention and Control: Modeling CPE Acquisition and Patient Outcomes in an Irish Hospital with Transformers
Authors: Minh-Khoi Pham, Tai Tan Mai, Martin Crane, Rob Brennan, Marie E. Ward, Una Geary, Declan Byrne, Brian O Connell, Colm Bergin, Donncha Creagh, Nick McDonald, Marija Bezbradica β’
Published: 2025-09-18 β’
Source: arXiv
Carbapenemase-Producing Enterobacteriace poses a critical concern for infection prevention and control in hospitals. However, predictive modeling of previously highlighted CPE-associated risks such as readmission, mortality, and extended length of stay (LOS) remains underexplored, particularly with modern deep learning approaches. This study introduces an eXplainable AI modeling framework to investigate CPE impact on patient outcomes from Electronic Medical Records data of an Irish hospital. We analyzed an inpatient dataset from an Irish acute hospital, incorporating diagnostic codes, ward transitions, patient demographics, infection-related variables and contact network features. Several Transformer-based architectures were benchmarked alongside traditional machine learning models. Clinical outcomes were predicted, and XAI techniques were applied to interpret model decisions. Our framework successfully demonstrated the utility of Transformer-based models, with TabTransformer consistently outperforming baselines across multiple clinical prediction tasks, especially for CPE acquisition (AUROC and sensitivity). We found infection-related features, including historical hospital exposure, admission context, and network centrality measures, to be highly influential in predicting patient outcomes and CPE acquisition risk. Explainability analyses revealed that features like "Area of Residence", "Admission Ward" and prior admissions are key risk factors. Network variables like "Ward PageRank" also ranked highly, reflecting the potential value of structural exposure information. This study presents a robust and explainable AI framework for analyzing complex EMR data to identify key risk factors and predict CPE-related outcomes. Our findings underscore the superior performance of the Transformer models and highlight the importance of diverse clinical and network features.
25. "Let it be Chaos in the Plumbing!" Usage and Efficacy of Chaos Engineering in DevOps Pipelines
Authors: Stefano Fossati, Damian Andrew Tamburri, Massimiliano Di Penta, Marco Tonnarelli β’
Published: 2025-09-18 β’
Source: arXiv
Chaos Engineering (CE) has emerged as a proactive method to improve the resilience of modern distributed systems, particularly within DevOps environments. Originally pioneered by Netflix, CE simulates real-world failures to expose weaknesses before they impact production. In this paper, we present a systematic gray literature review that investigates how industry practitioners have adopted and adapted CE principles over recent years. Analyzing 50 sources published between 2019 and early 2024, we developed a comprehensive classification framework that extends the foundational CE principles into ten distinct concepts. Our study reveals that while the core tenets of CE remain influential, practitioners increasingly emphasize controlled experimentation, automation, and risk mitigation strategies to align with the demands of agile and continuously evolving DevOps pipelines. Our results enhance the understanding of how CE is intended and implemented in practice, and offer guidance for future research and industrial applications aimed at improving system robustness in dynamic production environments.
26. Self-Explaining Reinforcement Learning for Mobile Network Resource Allocation
Authors: Konrad Nowosadko, Franco Ruggeri, Ahmad Terra β’
Published: 2025-09-18 β’
Source: arXiv
Reinforcement Learning (RL) methods that incorporate deep neural networks (DNN), though powerful, often lack transparency. Their black-box characteristic hinders interpretability and reduces trustworthiness, particularly in critical domains. To address this challenge in RL tasks, we propose a solution based on Self-Explaining Neural Networks (SENNs) along with explanation extraction methods to enhance interpretability while maintaining predictive accuracy. Our approach targets low-dimensionality problems to generate robust local and global explanations of the model's behaviour. We evaluate the proposed method on the resource allocation problem in mobile networks, demonstrating that SENNs can constitute interpretable solutions with competitive performance. This work highlights the potential of SENNs to improve transparency and trust in AI-driven decision-making for low-dimensional tasks. Our approach strong performance on par with the existing state-of-the-art methods, while providing robust explanations.
27. A Comparative Evaluation of Large Language Models for Persian Sentiment Analysis and Emotion Detection in Social Media Texts
Authors: Kian Tohidi, Kia Dashtipour, Simone Rebora, Sevda Pourfaramarz β’
Published: 2025-09-18 β’
Source: arXiv
This study presents a comprehensive comparative evaluation of four state-of-the-art Large Language Models (LLMs)--Claude 3.7 Sonnet, DeepSeek-V3, Gemini 2.0 Flash, and GPT-4o--for sentiment analysis and emotion detection in Persian social media texts. Comparative analysis among LLMs has witnessed a significant rise in recent years, however, most of these analyses have been conducted on English language tasks, creating gaps in understanding cross-linguistic performance patterns. This research addresses these gaps through rigorous experimental design using balanced Persian datasets containing 900 texts for sentiment analysis (positive, negative, neutral) and 1,800 texts for emotion detection (anger, fear, happiness, hate, sadness, surprise). The main focus was to allow for a direct and fair comparison among different models, by using consistent prompts, uniform processing parameters, and by analyzing the performance metrics such as precision, recall, F1-scores, along with misclassification patterns. The results show that all models reach an acceptable level of performance, and a statistical comparison of the best three models indicates no significant differences among them. However, GPT-4o demonstrated a marginally higher raw accuracy value for both tasks, while Gemini 2.0 Flash proved to be the most cost-efficient. The findings indicate that the emotion detection task is more challenging for all models compared to the sentiment analysis task, and the misclassification patterns can represent some challenges in Persian language texts. These findings establish performance benchmarks for Persian NLP applications and offer practical guidance for model selection based on accuracy, efficiency, and cost considerations, while revealing cultural and linguistic challenges that require consideration in multilingual AI system deployment.
28. Learning Informed Prior Distributions with Normalizing Flows for Bayesian Analysis
Authors: Hendrik Roch, Chun Shen β’
Published: 2025-09-18 β’
Source: arXiv
We investigate the use of normalizing flow (NF) models as flexible priors in Bayesian inference with Markov Chain Monte Carlo (MCMC) sampling. Trained on posteriors from previous analyses, these models can be used as informative priors, capturing non-trivial distributions and correlations, in subsequent inference tasks. We compare different training strategies and loss functions, finding that training based on Kullback-Leibler (KL) divergence and unsupervised learning consistently yield the most accurate reproductions of reference distributions. Applied in sequential Bayesian workflows, MCMC with the NF-based priors reproduces the results of one-shot joint inferences well, provided the target distributions are unimodal. In cases with pronounced multi-modality or dataset tension, distortions may arise, underscoring the need for caution in multi-stage Bayesian inference. A comparison between the pocoMC MCMC sampler and the standard emcee sampler further demonstrates the importance of advanced and robust algorithms for exploring the posterior space. Overall, our results establish NF-based priors as a practical and efficient tool for sequential Bayesian inference in high-dimensional parameter spaces.
29. CollabVLA: Self-Reflective Vision-Language-Action Model Dreaming Together with Human
Authors: Nan Sun, Yongchang Li, Chenxu Wang, Huiying Li, Huaping Liu β’
Published: 2025-09-18 β’
Source: arXiv
In this work, we present CollabVLA, a self-reflective vision-language-action framework that transforms a standard visuomotor policy into a collaborative assistant. CollabVLA tackles key limitations of prior VLAs, including domain overfitting, non-interpretable reasoning, and the high latency of auxiliary generative models, by integrating VLM-based reflective reasoning with diffusion-based action generation under a mixture-of-experts design. Through a two-stage training recipe of action grounding and reflection tuning, it supports explicit self-reflection and proactively solicits human guidance when confronted with uncertainty or repeated failure. It cuts normalized Time by ~2x and Dream counts by ~4x vs. generative agents, achieving higher success rates, improved interpretability, and balanced low latency compared with existing methods. This work takes a pioneering step toward shifting VLAs from opaque controllers to genuinely assistive agents capable of reasoning, acting, and collaborating with humans.
30. Non-Intrusive Parametrized-Background Data-Weak Reconstruction of Cardiac Displacement Fields from Sparse MRI-like Observations
Authors: Francesco C. Mantegazza, Federica Caforio, Christoph Augustin, Matthias A. F. Gsell, Gundolf Haase, Elias Karabelas β’
Published: 2025-09-18 β’
Source: arXiv
Personalized cardiac diagnostics require accurate reconstruction of myocardial displacement fields from sparse clinical imaging data, yet current methods often demand intrusive access to computational models. In this work, we apply the non-intrusive Parametrized-Background Data-Weak (PBDW) approach to three-dimensional (3D) cardiac displacement field reconstruction from limited Magnetic Resonance Image (MRI)-like observations. Our implementation requires only solution snapshots -- no governing equations, assembly routines, or solver access -- enabling immediate deployment across commercial and research codes using different constitutive models. Additionally, we introduce two enhancements: an H-size minibatch worst-case Orthogonal Matching Pursuit (wOMP) algorithm that improves Sensor Selection (SS) computational efficiency while maintaining reconstruction accuracy, and memory optimization techniques exploiting block matrix structures in vectorial problems. We demonstrate the effectiveness of the method through validation on a 3D left ventricular model with simulated scar tissue. Starting with noise-free reconstruction, we systematically incorporate Gaussian noise and spatial sparsity mimicking realistic MRI acquisition protocols. Results show exceptional accuracy in noise-free conditions (relative L2 error of order O(1e-5)), robust performance with 10% noise (relative L2 error of order O(1e-2)), and effective reconstruction from sparse measurements (relative L2 error of order O(1e-2)). The online reconstruction achieves four-order-of-magnitude computational speed-up compared to full Finite Element (FE) simulations, with reconstruction times under one tenth of second for sparse scenarios, demonstrating significant potential for integration into clinical cardiac modeling workflows.
31. MapAnything: Mapping Urban Assets using Single Street-View Images
Authors: Miriam Louise Carnot, Jonas Kunze, Erik Fastermann, Eric Peukert, AndrΓ© Ludwig, Bogdan Franczyk β’
Published: 2025-09-18 β’
Source: arXiv
To maintain an overview of urban conditions, city administrations manage databases of objects like traffic signs and trees, complete with their geocoordinates. Incidents such as graffiti or road damage are also relevant. As digitization increases, so does the need for more data and up-to-date databases, requiring significant manual effort. This paper introduces MapAnything, a module that automatically determines the geocoordinates of objects using individual images. Utilizing advanced Metric Depth Estimation models, MapAnything calculates geocoordinates based on the object's distance from the camera, geometric principles, and camera specifications. We detail and validate the module, providing recommendations for automating urban object and incident mapping. Our evaluation measures the accuracy of estimated distances against LiDAR point clouds in urban environments, analyzing performance across distance intervals and semantic areas like roads and vegetation. The module's effectiveness is demonstrated through practical use cases involving traffic signs and road damage.
32. LLM Agents at the Roundtable: A Multi-Perspective and Dialectical Reasoning Framework for Essay Scoring
Authors: Jinhee Jang, Ayoung Moon, Minkyoung Jung, YoungBin Kim. Seung Jin Lee β’
Published: 2025-09-18 β’
Source: arXiv
The emergence of large language models (LLMs) has brought a new paradigm to automated essay scoring (AES), a long-standing and practical application of natural language processing in education. However, achieving human-level multi-perspective understanding and judgment remains a challenge. In this work, we propose Roundtable Essay Scoring (RES), a multi-agent evaluation framework designed to perform precise and human-aligned scoring under a zero-shot setting. RES constructs evaluator agents based on LLMs, each tailored to a specific prompt and topic context. Each agent independently generates a trait-based rubric and conducts a multi-perspective evaluation. Then, by simulating a roundtable-style discussion, RES consolidates individual evaluations through a dialectical reasoning process to produce a final holistic score that more closely aligns with human evaluation. By enabling collaboration and consensus among agents with diverse evaluation perspectives, RES outperforms prior zero-shot AES approaches. Experiments on the ASAP dataset using ChatGPT and Claude show that RES achieves up to a 34.86% improvement in average QWK over straightforward prompting (Vanilla) methods.
33. ProtoMedX: Towards Explainable Multi-Modal Prototype Learning for Bone Health Classification
Authors: Alvaro Lopez Pellicer, Andre Mariucci, Plamen Angelov, Marwan Bukhari, Jemma G. Kerns β’
Published: 2025-09-18 β’
Source: arXiv
Bone health studies are crucial in medical practice for the early detection and treatment of Osteopenia and Osteoporosis. Clinicians usually make a diagnosis based on densitometry (DEXA scans) and patient history. The applications of AI in this field are ongoing research. Most successful methods rely on deep learning models that use vision alone (DEXA/X-ray imagery) and focus on prediction accuracy, while explainability is often disregarded and left to post hoc assessments of input contributions. We propose ProtoMedX, a multi-modal model that uses both DEXA scans of the lumbar spine and patient records. ProtoMedX's prototype-based architecture is explainable by design, which is crucial for medical applications, especially in the context of the upcoming EU AI Act, as it allows explicit analysis of model decisions, including incorrect ones. ProtoMedX demonstrates state-of-the-art performance in bone health classification while also providing explanations that can be visually understood by clinicians. Using a dataset of 4,160 real NHS patients, the proposed ProtoMedX achieves 87.58% accuracy in vision-only tasks and 89.8% in its multi-modal variant, both surpassing existing published methods.
34. OnlineMate: An LLM-Based Multi-Agent Companion System for Cognitive Support in Online Learning
Authors: Xian Gao, Zongyun Zhang, Ting Liu, Yuzhuo Fu β’
Published: 2025-09-18 β’
Source: arXiv
In online learning environments, students often lack personalized peer interactions, which play a crucial role in supporting cognitive development and learning engagement. Although previous studies have utilized large language models (LLMs) to simulate interactive dynamic learning environments for students, these interactions remain limited to conversational exchanges, lacking insights and adaptations to the learners' individualized learning and cognitive states. As a result, students' interest in discussions with AI learning companions is low, and they struggle to gain inspiration from such interactions. To address this challenge, we propose OnlineMate, a multi-agent learning companion system driven by LLMs that integrates the Theory of Mind (ToM). OnlineMate is capable of simulating peer-like agent roles, adapting to learners' cognitive states during collaborative discussions, and inferring their psychological states, such as misunderstandings, confusion, or motivation. By incorporating Theory of Mind capabilities, the system can dynamically adjust its interaction strategies to support the development of higher-order thinking and cognition. Experimental results in simulated learning scenarios demonstrate that OnlineMate effectively fosters deep learning and discussions while enhancing cognitive engagement in online educational settings.
35. DeKeyNLU: Enhancing Natural Language to SQL Generation through Task Decomposition and Keyword Extraction
Authors: Jian Chen, Zhenyan Chen, Xuming Hu, Peilin Zhou, Yining Hua, Han Fang, Cissy Hing Yee Choy, Xinmei Ke, Jingfeng Luo, Zixuan Yuan β’
Published: 2025-09-18 β’
Source: arXiv
Natural Language to SQL (NL2SQL) provides a new model-centric paradigm that simplifies database access for non-technical users by converting natural language queries into SQL commands. Recent advancements, particularly those integrating Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) reasoning, have made significant strides in enhancing NL2SQL performance. However, challenges such as inaccurate task decomposition and keyword extraction by LLMs remain major bottlenecks, often leading to errors in SQL generation. While existing datasets aim to mitigate these issues by fine-tuning models, they struggle with over-fragmentation of tasks and lack of domain-specific keyword annotations, limiting their effectiveness. To address these limitations, we present DeKeyNLU, a novel dataset which contains 1,500 meticulously annotated QA pairs aimed at refining task decomposition and enhancing keyword extraction precision for the RAG pipeline. Fine-tuned with DeKeyNLU, we propose DeKeySQL, a RAG-based NL2SQL pipeline that employs three distinct modules for user question understanding, entity retrieval, and generation to improve SQL generation accuracy. We benchmarked multiple model configurations within DeKeySQL RAG pipeline. Experimental results demonstrate that fine-tuning with DeKeyNLU significantly improves SQL generation accuracy on both BIRD (62.31% to 69.10%) and Spider (84.2% to 88.7%) dev datasets.