1. Residual Off-Policy RL for Finetuning Behavior Cloning Policies
Authors: Lars Ankile, Zhenyu Jiang, Rocky Duan, Guanya Shi, Pieter Abbeel, Anusha Nagabandi β’
Published: 2025-09-23 β’
Source: arXiv
Recent advances in behavior cloning (BC) have enabled impressive visuomotor control policies. However, these approaches are limited by the quality of human demonstrations, the manual effort required for data collection, and the diminishing returns from increasing offline data. In comparison, reinforcement learning (RL) trains an agent through autonomous interaction with the environment and has shown remarkable success in various domains. Still, training RL policies directly on real-world robots remains challenging due to sample inefficiency, safety concerns, and the difficulty of learning from sparse rewards for long-horizon tasks, especially for high-degree-of-freedom (DoF) systems. We present a recipe that combines the benefits of BC and RL through a residual learning framework. Our approach leverages BC policies as black-box bases and learns lightweight per-step residual corrections via sample-efficient off-policy RL. We demonstrate that our method requires only sparse binary reward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems in both simulation and the real world. In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands. Our results demonstrate state-of-the-art performance in various vision-based tasks, pointing towards a practical pathway for deploying RL in the real world. Project website: https://residual-offpolicy-rl.github.io
2. VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction
Authors: Weijie Wang, Yeqing Chen, Zeyu Zhang, Hengyu Liu, Haoxiao Wang, Zhiyuan Feng, Wenkang Qin, Zheng Zhu, Donny Y. Chen, Bohan Zhuang β’
Published: 2025-09-23 β’
Source: arXiv
Feed-forward 3D Gaussian Splatting (3DGS) has emerged as a highly effective solution for novel view synthesis. Existing methods predominantly rely on a pixel-aligned Gaussian prediction paradigm, where each 2D pixel is mapped to a 3D Gaussian. We rethink this widely adopted formulation and identify several inherent limitations: it renders the reconstructed 3D models heavily dependent on the number of input views, leads to view-biased density distributions, and introduces alignment errors, particularly when source views contain occlusions or low texture. To address these challenges, we introduce VolSplat, a new multi-view feed-forward paradigm that replaces pixel alignment with voxel-aligned Gaussians. By directly predicting Gaussians from a predicted 3D voxel grid, it overcomes pixel alignment's reliance on error-prone 2D feature matching, ensuring robust multi-view consistency. Furthermore, it enables adaptive control over Gaussian density based on 3D scene complexity, yielding more faithful Gaussian point clouds, improved geometric consistency, and enhanced novel-view rendering quality. Experiments on widely used benchmarks including RealEstate10K and ScanNet demonstrate that VolSplat achieves state-of-the-art performance while producing more plausible and view-consistent Gaussian reconstructions. In addition to superior results, our approach establishes a more scalable framework for feed-forward 3D reconstruction with denser and more robust representations, paving the way for further research in wider communities. The video results, code and trained models are available on our project page: https://lhmd.top/volsplat.
3. Biochemical Network Motifs Under Periodic Forcing: A Selective Catalogue of Transfer Functions and Frequency Response Properties
Authors: Nguyen H. N. Tran, Federico Frascoli, Andrew H. A. Clayton β’
Published: 2025-09-23 β’
Source: arXiv
Understanding the function of network motifs in an attempt to gain insight into how their combinations create larger reaction networks that drive cellular functions, has been a longstanding pursuit of systems biology. One specific objective within this pursuit is understanding how individual motifs respond to pulsatile and oscillatory signals. This is especially relevant because biochemical networks are often activated by signals that, in nature, occur in the form of pulses and oscillations. A powerful analytical tool for studying such dynamics is the transfer function: a compact frequency-domain description of input-output dynamics. In this work, we derive transfer functions for a set of commonly studied network motifs and characterise their responses to pulsatile and oscillatory inputs. The novelty of this review does not lie in the introduction of new mathematical theorems or biological discoveries, but in bridging well-established frequency domain formalisms from control theory with the analysis of biochemical networks under periodic forcing. In doing so, our contributions are threefold: 1. A systematic derivation and compilation of transfer functions for common network motifs: consolidating results scattered across the literature and establishing a consistent formalism for motif-level transfer functions. 2. Contextualisation of these transfer functions within biological models: extending abstract transfer functions to concrete biological settings so that the results are readily applicable without extensive mathematical labour. 3. Resolution of ambiguity between biological and control-theoretic treatments of feedback: clarifying how feedback loops should be understood within the transfer function formalism and reconciling differences between biology literature and control-oriented literature. This is done by formalising the notion of an intrinsic transfer function.
4. Approximating Electoral Control Problems
Authors: Huy Vu Bui, Michael C. Chavrimootoo, Trung Kien Le, Son M. Nguyen β’
Published: 2025-09-23 β’
Source: arXiv
Much research in electoral control -- one of the most studied form of electoral attacks, in which an entity running an election alters the structure of that election to yield a preferred outcome -- has focused on giving decision complexity results, e.g., membership in P, NP-completeness, or fixed-parameter tractability. Approximation algorithms on the other hand have received little attention in electoral control, despite their prevalence in the study of other forms of electoral attacks, such as manipulation and bribery. Early work established some preliminary results with respect to popular voting rules such as plurality, approval, and Condorcet. In this paper, we establish for each of the ``standard'' control problems under plurality, approval, and Condorcet, whether they are approximable, and we prove our results in both the weighted and unweighted voter settings. For each problem we study under either approval or Condorcet, we show that any approximation algorithm we give is optimal, unless P=NP. Our approximation algorithms leverage the fact that Covering Integer Programs (CIPs) can be approximated within a factor of $O(\log n)$. Under plurality, we give an $O(m)$-approximation algorithm, and give as lower bound $\Omega(m^{1/4})$, by using a known lower bound on the Minimum $k$-Union (M$k$U) problem. To our knowledge, this is the first application of M$k$U in computational social choice. We also generalize our $O(m)$-approximation algorithm to work with respect to an infinite family of voting rules using an axiomatic approach. Our work closes a long list of open problems established 18 years ago.
5. Random coverage of a manifold with boundary
Authors: Mathew D. Penrose, Xiaochuan Yang β’
Published: 2025-09-23 β’
Source: arXiv
Let $A$ be a compact $d$-dimensional $C^2$ Riemannian manifold with boundary, embedded in ${\bf R}^m$ where $m \geq d \geq 2$, and let $B$ be a nice subset of $A$ (possibly $B=A$). Let $X_1,X_2, \ldots $ be independent random uniform points in $A$. Define the {\em coverage threshold} $R_n$ to be the smallest $r$ such that $B$ is covered by the geodetic balls of radius $r$ centred on $X_1,\ldots,X_n$. We obtain the limiting distribution of $R_n$ and also a strong law of large numbers for $R_n$ in the large-$n$ limit. For example, if $A$ has Riemannian volume 1 and its boundary has surface measure $|\partial A|$, and $B=A$, then if $d=3$ then ${\bf P}[n\pi R_n^3 - \log n - 2 \log (\log n) \leq x]$ converges to $\exp(-2^{-4}\pi^{5/3} |\partial A| e^{-2 x/3})$ and $(n \pi R_n^3)/(\log n) \to 1$ almost surely, while if $d=2$ then ${\bf P}[n \pi R_n^2 - \log n - \log (\log n) \leq x]$ converges to $\exp(- e^{-x}- |\partial A|\pi^{-1/2} e^{-x/2})$. We generalize to allow for multiple coverage. For the strong laws of large numbers, we can relax the requirement that the underlying density on $A$ be uniform. For the limiting distribution, we have a similar result for Poisson samples. Our results still hold if we use Euclidean rather than geodetic balls.
6. A Novel Site-Specific Inference Model for Urban Canyon Channels: From Measurements to Modeling
Authors: Junzhe Song, Ruisi He, Mi Yang, Zhengyu Zhang, Xinwen Chen, Xiaoying Zhang, Bo Ai β’
Published: 2025-09-23 β’
Source: arXiv
With the rapid development of intelligent transportation and smart city applications, urban canyon has become a critical scenario for the design and evaluation of wireless communication systems. Due to its unique environmental layout, the channel characteristics in urban canyon are strongly a street geometry and building distribution, thereby exhibiting significant site-specific channel condition. However, this feature has not been well captured in existing channel models. In this paper, we propose a site-specific channel inference model based on environmental geometry, the model is parameterized using sub-6GHz channel measurements. Multipath components (MPCs) are extracted and clustered according to geometric propagation, which are explicitly derived from the influence of canyon width, thereby establishing an interpretable mapping between the physical environment and statistical characteristics of MPCs. A step-by-step implementation scheme is presented. Subsequently, the proposed site-specific channel inference model is validated by comparing second-order statistics of channels, derived from the model and measurements. The results show that the proposed model achieves high accuracy and robustness in different urban canyon scenarios.
7. DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models' Understanding on Indian Culture
Authors: Arijit Maji, Raghvendra Kumar, Akash Ghosh, Anushka, Nemil Shah, Abhilekh Borah, Vanshika Shah, Nishant Mishra, Sriparna Saha β’
Published: 2025-09-23 β’
Source: arXiv
We introduce DRISHTIKON, a first-of-its-kind multimodal and multilingual benchmark centered exclusively on Indian culture, designed to evaluate the cultural understanding of generative AI systems. Unlike existing benchmarks with a generic or global scope, DRISHTIKON offers deep, fine-grained coverage across India's diverse regions, spanning 15 languages, covering all states and union territories, and incorporating over 64,000 aligned text-image pairs. The dataset captures rich cultural themes including festivals, attire, cuisines, art forms, and historical heritage amongst many more. We evaluate a wide range of vision-language models (VLMs), including open-source small and large models, proprietary systems, reasoning-specialized VLMs, and Indic-focused models, across zero-shot and chain-of-thought settings. Our results expose key limitations in current models' ability to reason over culturally grounded, multimodal inputs, particularly for low-resource languages and less-documented traditions. DRISHTIKON fills a vital gap in inclusive AI research, offering a robust testbed to advance culturally aware, multimodally competent language technologies.
8. SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data
Authors: Erik BoΕΎΓk, Marek Ε uppa β’
Published: 2025-09-23 β’
Source: arXiv
Automatic Speech Recognition (ASR) for low-resource languages like Slovak is hindered by the scarcity of training data. To address this, we introduce SloPalSpeech, a new, large-scale Slovak ASR dataset containing 2,806 hours of speech from parliamentary proceedings. We developed a robust processing pipeline to align and segment long-form recordings into clean, 30-second audio-transcript pairs suitable for model training. We use this dataset to fine-tune several OpenAI Whisper models (small, medium, large-v3, and large-v3-turbo), achieving significant Word Error Rate (WER) reductions on standard Slovak benchmarks like Common Voice and FLEURS. For instance, the fine-tuned Whisper-small model's WER dropped by up to 70\%, approaching the baseline performance of the much larger Whisper-large-v3 model. To foster future research in low-resource speech recognition, we publicly release the complete SloPalSpeech dataset, the fully segmented transcripts (60 million words), and all our fine-tuned models.
9. Photo-Induced Enhancement of Critical Temperature in a Phase Competing Spin-Fermion System
Authors: Sankha Subhra Bakshi β’
Published: 2025-09-23 β’
Source: arXiv
Ultrafast optical excitation is known to destabilize long-range order in correlated systems, yet experiments have also reported the emergence of metastable phases, in some cases with enhanced critical temperatures. The microscopic origin of such light-induced stabilization remains unresolved. Here we investigate this problem within a minimal spin-fermion framework: a double-exchange model at half filling, augmented by ferromagnetic superexchange on a square lattice. In equilibrium, at half-filling the ordering temperature is set by the competition between kinetic-energy-driven antiferromagnetism and superexchange-induced ferromagnetism. Using quantum Landau-Lifshitz-Gilbert-Brown dynamics for localized spins combined with mean-field evolution of itinerant electrons, we demonstrate a nonthermal mechanism for stabilizing ordered phases. Photoexcitation creates a long-lived nonequilibrium carrier population that resists thermalization and reshapes the low-energy landscape, converting kinetic-energy-driven antiferromagnetism into ferromagnetism and enhancing the critical temperature. While model-specific, our results reveal a general microscopic pathway by which light can tip the balance between competing orders, suggesting routes toward optically engineered magnetism, charge-density-wave order, and superconductivity.
10. Moving by Looking: Towards Vision-Driven Avatar Motion Generation
Authors: Markos Diomataris, Berat Mert Albaba, Giorgio Becherini, Partha Ghosh, Omid Taheri, Michael J. Black β’
Published: 2025-09-23 β’
Source: arXiv
The way we perceive the world fundamentally shapes how we move, whether it is how we navigate in a room or how we interact with other humans. Current human motion generation methods, neglect this interdependency and use task-specific ``perception'' that differs radically from that of humans. We argue that the generation of human-like avatar behavior requires human-like perception. Consequently, in this work we present CLOPS, the first human avatar that solely uses egocentric vision to perceive its surroundings and navigate. Using vision as the primary driver of motion however, gives rise to a significant challenge for training avatars: existing datasets have either isolated human motion, without the context of a scene, or lack scale. We overcome this challenge by decoupling the learning of low-level motion skills from learning of high-level control that maps visual input to motion. First, we train a motion prior model on a large motion capture dataset. Then, a policy is trained using Q-learning to map egocentric visual inputs to high-level control commands for the motion prior. Our experiments empirically demonstrate that egocentric vision can give rise to human-like motion characteristics in our avatars. For example, the avatars walk such that they avoid obstacles present in their visual field. These findings suggest that equipping avatars with human-like sensors, particularly egocentric vision, holds promise for training avatars that behave like humans.
11. AgentInit: Initializing LLM-based Multi-Agent Systems via Diversity and Expertise Orchestration for Effective and Efficient Collaboration
Authors: Chunhao Tian, Yutong Wang, Xuebo Liu, Zhexuan Wang, Liang Ding, Miao Zhang, Min Zhang β’
Published: 2025-09-23 β’
Source: arXiv
Proper initialization is crucial for any system, particularly in multi-agent systems (MAS), where it plays a pivotal role in determining both the system's efficiency and effectiveness. However, existing MAS initialization methods do not fully account for the collaborative needs of the generated agents in subsequent stages. Inspired by the principles of effective team composition, we propose AgentInit, which aims to optimize the structure of agent teams. Specifically, in addition to multi-round interactions and reflections between agents during agent generation, AgentInit incorporates a Natural Language to Format mechanism to ensure consistency and standardization. Balanced team selection strategies using Pareto principles are subsequently applied to jointly consider agent team diversity and task relevance to promote effective and efficient collaboration and enhance overall system performance. Experiments show that AgentInit consistently outperforms state-of-the-art initialization methods and pre-defined strategies across various frameworks and tasks, achieving an overall performance improvement of up to 1.2 and 1.6, respectively, while also significantly reducing token consumption. Further analysis confirms its strong transferability to similar tasks and verifies the effectiveness of its key components, demonstrating its capability and adaptability as a reliable MAS initialization method. Source code and models are available at https://github.com/1737423697/AgentInit.
12. An on-chip Pixel Processing Approach with 2.4ΞΌs latency for Asynchronous Read-out of SPAD-based dToF Flash LiDARs
Authors: Yiyang Liu, Rongxuan Zhang, Istvan Gyongy, Alistair Gorman, Sarrah M. Patanwala, Filip Taneski, Robert K. Henderson β’
Published: 2025-09-23 β’
Source: arXiv
We propose a fully asynchronous peak detection approach for SPAD-based direct time-of-flight (dToF) flash LiDAR, enabling pixel-wise event-driven depth acquisition without global synchronization. By allowing pixels to independently report depth once a sufficient signal-to-noise ratio is achieved, the method reduces latency, mitigates motion blur, and increases effective frame rate compared to frame-based systems. The framework is validated under two hardware implementations: an offline 256$\times$128 SPAD array with PC based processing and a real-time FPGA proof-of-concept prototype with 2.4$\upmu$s latency for on-chip integration. Experiments demonstrate robust depth estimation, reflectivity reconstruction, and dynamic event-based representation under both static and dynamic conditions. The results confirm that asynchronous operation reduces redundant background data and computational load, while remaining tunable via simple hyperparameters. These findings establish a foundation for compact, low-latency, event-driven LiDAR architectures suited to robotics, autonomous driving, and consumer applications. In addition, we have derived a semi-closed-form solution for the detection probability of the raw-peak finding based LiDAR systems that could benefit both conventional frame-based and proposed asynchronous LiDAR systems.
13. An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications
Authors: Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, Ahmed E. Hassan β’
Published: 2025-09-23 β’
Source: arXiv
Foundation model (FM)-based AI agents are rapidly gaining adoption across diverse domains, but their inherent non-determinism and non-reproducibility pose testing and quality assurance challenges. While recent benchmarks provide task-level evaluations, there is limited understanding of how developers verify the internal correctness of these agents during development. To address this gap, we conduct the first large-scale empirical study of testing practices in the AI agent ecosystem, analyzing 39 open-source agent frameworks and 439 agentic applications. We identify ten distinct testing patterns and find that novel, agent-specific methods like DeepEval are seldom used (around 1%), while traditional patterns like negative and membership testing are widely adapted to manage FM uncertainty. By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM-based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests. Our findings offer the first empirical testing baseline in FM-based agent frameworks and agentic applications, revealing a rational but incomplete adaptation to non-determinism. To address it, framework developers should improve support for novel testing methods, application developers must adopt prompt regression testing, and researchers should explore barriers to adoption. Strengthening these practices is vital for building more robust and dependable AI agents.
14. LLMs as verification oracles for Solidity
Authors: Massimo Bartoletti, Enrico Lipparini, Livio Pompianu β’
Published: 2025-09-23 β’
Source: arXiv
Ensuring the correctness of smart contracts is critical, as even subtle flaws can lead to severe financial losses. While bug detection tools able to spot common vulnerability patterns can serve as a first line of defense, most real-world exploits and losses stem from errors in the contract business logic. Formal verification tools such as SolCMC and the Certora Prover address this challenge, but their impact remains limited by steep learning curves and restricted specification languages. Recent works have begun to explore the use of large language models (LLMs) for security-related tasks such as vulnerability detection and test generation. Yet, a fundamental question remains open: can LLMs serve as verification oracles, capable of reasoning about arbitrary contract-specific properties? In this paper, we provide the first systematic evaluation of GPT-5, a state-of-the-art reasoning LLM, in this role. We benchmark its performance on a large dataset of verification tasks, compare its outputs against those of established formal verification tools, and assess its practical effectiveness in real-world auditing scenarios. Our study combines quantitative metrics with qualitative analysis, and shows that recent reasoning-oriented LLMs can be surprisingly effective as verification oracles, suggesting a new frontier in the convergence of AI and formal methods for secure smart contract development and auditing.
15. A Scoping Review of Mixed Initiative Visual Analytics in the Automation Renaissance
Authors: Shayan Monadjemi, Yuhan Guo, Kai Xu, Alex Endert, Anamaria Crisan β’
Published: 2025-09-23 β’
Source: arXiv
Artificial agents are increasingly integrated into data analysis workflows, carrying out tasks that were primarily done by humans. Our research explores how the introduction of automation re-calibrates the dynamic between humans and automating technology. To explore this question, we conducted a scoping review encompassing twenty years of mixed-initiative visual analytic systems. To describe and contrast the relationship between humans and automation, we developed an integrated taxonomy to delineate the objectives of these mixed-initiative visual analytics tools, how much automation they support, and the assumed roles of humans. Here, we describe our qualitative approach of integrating existing theoretical frameworks with new codes we developed. Our analysis shows that the visualization research literature lacks consensus on the definition of mixed-initiative systems and explores a limited potential of the collaborative interaction landscape between people and automation. Our research provides a scaffold to advance the discussion of human-AI collaboration during visual data analysis.
16. PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generatio
Authors: Alexandre PichΓ©, Ehsan Kamaloo, Rafael Pardinas, Dzmitry Bahdanau β’
Published: 2025-09-23 β’
Source: arXiv
Reinforcement Learning (RL) is increasingly utilized to enhance the reasoning capabilities of Large Language Models (LLMs). However, effectively scaling these RL methods presents significant challenges, primarily due to the difficulty in maintaining high AI accelerator utilization without generating stale, off-policy data that harms common RL algorithms. This paper introduces PipelineRL, an approach designed to achieve a superior trade-off between hardware efficiency and data on-policyness for LLM training. PipelineRL employs concurrent asynchronous data generation and model training, distinguished by the novel in-flight weight updates. This mechanism allows the LLM generation engine to receive updated model weights with minimal interruption during the generation of token sequences, thereby maximizing both the accelerator utilization and the freshness of training data. Experiments conducted on long-form reasoning tasks using 128 H100 GPUs demonstrate that PipelineRL achieves approximately $\sim 2x$ faster learning compared to conventional RL baselines while maintaining highly on-policy training data. A scalable and modular open-source implementation of PipelineRL is also released as a key contribution.
17. Investigating Traffic Accident Detection Using Multimodal Large Language Models
Authors: Ilhan Skender, Kailin Tong, Selim Solmaz, Daniel Watzenig β’
Published: 2025-09-23 β’
Source: arXiv
Traffic safety remains a critical global concern, with timely and accurate accident detection essential for hazard reduction and rapid emergency response. Infrastructure-based vision sensors offer scalable and efficient solutions for continuous real-time monitoring, facilitating automated detection of acci- dents directly from captured images. This research investigates the zero-shot capabilities of multimodal large language models (MLLMs) for detecting and describing traffic accidents using images from infrastructure cameras, thus minimizing reliance on extensive labeled datasets. Main contributions include: (1) Evaluation of MLLMs using the simulated DeepAccident dataset from CARLA, explicitly addressing the scarcity of diverse, realistic, infrastructure-based accident data through controlled simulations; (2) Comparative performance analysis between Gemini 1.5 and 2.0, Gemma 3 and Pixtral models in acci- dent identification and descriptive capabilities without prior fine-tuning; and (3) Integration of advanced visual analytics, specifically YOLO for object detection, Deep SORT for multi- object tracking, and Segment Anything (SAM) for instance segmentation, into enhanced prompts to improve model accuracy and explainability. Key numerical results show Pixtral as the top performer with an F1-score of 0.71 and 83% recall, while Gemini models gained precision with enhanced prompts (e.g., Gemini 1.5 rose to 90%) but suffered notable F1 and recall losses. Gemma 3 offered the most balanced performance with minimal metric fluctuation. These findings demonstrate the substantial potential of integrating MLLMs with advanced visual analytics techniques, enhancing their applicability in real-world automated traffic monitoring systems.
18. Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning
Authors: Guoxin Wang, Jun Zhao, Xinyi Liu, Yanbo Liu, Xuyang Cao, Chao Li, Zhuoyun Liu, Qintian Sun, Fangru Zhou, Haoqiang Xing, Zhenhong Yang β’
Published: 2025-09-23 β’
Source: arXiv
Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.
19. MAPPO for Edge Server Monitoring
Authors: Samuel Chamoun, Christian McDowell, Robin Buchanan, Kevin Chan, Eric Graves, Yin Sun β’
Published: 2025-09-23 β’
Source: arXiv
In this paper, we consider a goal-oriented communication problem for edge server monitoring, where jobs arrive intermittently at multiple dispatchers and must be assigned to shared edge servers with finite queues and time-varying availability. Accurate knowledge of server status is critical for sustaining high throughput, yet remains challenging under dynamic workloads and partial observability. To address this challenge, each dispatcher maintains server knowledge through two complementary mechanisms: (i) active status queries that provide instantaneous updates at a communication cost, and (ii) job execution feedback that reveals server conditions opportunistically. We formulate a cooperative multi-agent distributed decision-making problem in which dispatchers jointly optimize query scheduling to balance throughput against communication overhead. To solve this problem, we propose a Multi-Agent Proximal Policy Optimization (MAPPO)-based algorithm that leverages centralized training with decentralized execution (CTDE) to learn distributed query-and-dispatch policies under partial and stale observations. Numerical evaluations show that MAPPO achieves superior throughput-cost tradeoffs and significantly outperforms baseline strategies, achieving on average a 30% improvement over the closest baseline.
20. A DyL-Unet framework based on dynamic learning for Temporally Consistent Echocardiographic Segmentation
Authors: Jierui Qu, Jianchun Zhao β’
Published: 2025-09-23 β’
Source: arXiv
Accurate segmentation of cardiac anatomy in echocardiography is essential for cardiovascular diagnosis and treatment. Yet echocardiography is prone to deformation and speckle noise, causing frame-to-frame segmentation jitter. Even with high accuracy in single-frame segmentation, temporal instability can weaken functional estimates and impair clinical interpretability. To address these issues, we propose DyL-UNet, a dynamic learning-based temporal consistency U-Net segmentation architecture designed to achieve temporally stable and precise echocardiographic segmentation. The framework constructs an Echo-Dynamics Graph (EDG) through dynamic learning to extract dynamic information from videos. DyL-UNet incorporates multiple Swin-Transformer-based encoder-decoder branches for processing single-frame images. It further introduces Cardiac Phase-Dynamics Attention (CPDA) at the skip connections, which uses EDG-encoded dynamic features and cardiac-phase cues to enforce temporal consistency during segmentation. Extensive experiments on the CAMUS and EchoNet-Dynamic datasets demonstrate that DyL-UNet maintains segmentation accuracy comparable to existing methods while achieving superior temporal consistency, providing a reliable solution for automated clinical echocardiography.
21. Enhancing Noise Robustness for Neural Speech Codecs through Resource-Efficient Progressive Quantization Perturbation Simulation
Authors: Rui-Chen Zheng, Yang Ai, Hui-Peng Du, Zhen-Hua Ling β’
Published: 2025-09-23 β’
Source: arXiv
Noise robustness remains a critical challenge for deploying neural speech codecs in real-world acoustic scenarios where background noise is often inevitable. A key observation we make is that even slight input noise perturbations can cause unintended shifts in quantized codewords, thereby degrading the quality of reconstructed speech. Motivated by this finding, we propose a novel and resource-efficient training strategy to enhance the noise robustness of speech codecs by simulating such perturbations directly at the quantization level. Our approach introduces two core mechanisms: (1) a distance-weighted probabilistic top-K sampling strategy that replaces the conventional deterministic nearest-neighbor selection in residual vector quantization (RVQ); and (2) a progressive training scheme that introduces perturbations from the last to the first quantizer in a controlled manner. Crucially, our method is trained exclusively on clean speech, eliminating the need for any paired noisy-clean data. Experiments on two advanced neural speech codecs, Encodec and WavTokenizer, demonstrate that the proposed strategy substantially improves robustness under noisy conditions-for example, boosting UTMOS from 3.475 to 3.586 at 15 dB SNR on Encodec-while also enhancing coding quality for clean speech.
22. Fully Learnable Neural Reward Machines
Authors: Hazem Dewidar, Elena Umili β’
Published: 2025-09-23 β’
Source: arXiv
Non-Markovian Reinforcement Learning (RL) tasks present significant challenges, as agents must reason over entire trajectories of state-action pairs to make optimal decisions. A common strategy to address this is through symbolic formalisms, such as Linear Temporal Logic (LTL) or automata, which provide a structured way to express temporally extended objectives. However, these approaches often rely on restrictive assumptions -- such as the availability of a predefined Symbol Grounding (SG) function mapping raw observations to high-level symbolic representations, or prior knowledge of the temporal task. In this work, we propose a fully learnable version of Neural Reward Machines (NRM), which can learn both the SG function and the automaton end-to-end, removing any reliance on prior knowledge. Our approach is therefore as easily applicable as classic deep RL (DRL) approaches, while being far more explainable, because of the finite and compact nature of automata. Furthermore, we show that by integrating Fully Learnable Reward Machines (FLNRM) with DRL, our method outperforms previous approaches based on Recurrent Neural Networks (RNNs).
23. From latent factors to language: a user study on LLM-generated explanations for an inherently interpretable matrix-based recommender system
Authors: Maxime Manderlier, Fabian Lecron, Olivier Vu Thanh, Nicolas Gillis β’
Published: 2025-09-23 β’
Source: arXiv
We investigate whether large language models (LLMs) can generate effective, user-facing explanations from a mathematically interpretable recommendation model. The model is based on constrained matrix factorization, where user types are explicitly represented and predicted item scores share the same scale as observed ratings, making the model's internal representations and predicted scores directly interpretable. This structure is translated into natural language explanations using carefully designed LLM prompts. Many works in explainable AI rely on automatic evaluation metrics, which often fail to capture users' actual needs and perceptions. In contrast, we adopt a user-centered approach: we conduct a study with 326 participants who assessed the quality of the explanations across five key dimensions-transparency, effectiveness, persuasion, trust, and satisfaction-as well as the recommendations themselves.To evaluate how different explanation strategies are perceived, we generate multiple explanation types from the same underlying model, varying the input information provided to the LLM. Our analysis reveals that all explanation types are generally well received, with moderate statistical differences between strategies. User comments further underscore how participants react to each type of explanation, offering complementary insights beyond the quantitative results.
24. LLM-based Agents Suffer from Hallucinations: A Survey of Taxonomy, Methods, and Directions
Authors: Xixun Lin, Yucheng Ning, Jingwen Zhang, Yan Dong, Yilong Liu, Yongxuan Wu, Xiaohua Qi, Nan Sun, Yanmin Shang, Pengfei Cao, Lixin Zou, Xu Chen, Chuan Zhou, Jia Wu, Shirui Pan, Bin Wang, Yanan Cao, Kai Chen, Songlin Hu, Li Guo β’
Published: 2025-09-23 β’
Source: arXiv
Driven by the rapid advancements of Large Language Models (LLMs), LLM-based agents have emerged as powerful intelligent systems capable of human-like cognition, reasoning, and interaction. These agents are increasingly being deployed across diverse real-world applications, including student education, scientific research, and financial analysis. However, despite their remarkable potential, LLM-based agents remain vulnerable to hallucination issues, which can result in erroneous task execution and undermine the reliability of the overall system design. Addressing this critical challenge requires a deep understanding and a systematic consolidation of recent advances on LLM-based agents. To this end, we present the first comprehensive survey of hallucinations in LLM-based agents. By carefully analyzing the complete workflow of agents, we propose a new taxonomy that identifies different types of agent hallucinations occurring at different stages. Furthermore, we conduct an in-depth examination of eighteen triggering causes underlying the emergence of agent hallucinations. Through a detailed review of a large number of existing studies, we summarize approaches for hallucination mitigation and detection, and highlight promising directions for future research. We hope this survey will inspire further efforts toward addressing hallucinations in LLM-based agents, ultimately contributing to the development of more robust and reliable agent systems.
25. SAG-SCI: the Real-time, High-level Analysis Software for Array Control and Data Acquisition of the Cherenkov Telescope Array Observatory
Authors: Gabriele Panebianco, NicolΓ² Parmiggiani, Andrea Bulgarelli, Ambra Di Piano, Luca Castaldini, Valentina Fioretti, Giovanni De Cesare, Sami Caroff, Pierre Aubert, Gilles Maurin, Vincent Pollet, Thomas Vuillaume, Igor Oya, Cristian Vignali β’
Published: 2025-09-23 β’
Source: arXiv
The Cherenkov Telescope Array Observatory (CTAO) is going to be the leading observatory for very-high-energy gamma-rays over the next decades. Its unique sensitivity, wide field of view, and rapid slewing capability make the CTAO especially suited to study transient astrophysical phenomena. The CTAO will analyse its data in real-time, responding to external science alerts on transient events and issuing its own. The Science Alert Generation (SAG) automated pipeline, a component of the Array Control and Data Acquisition (ACADA) software, is designed to detect and issue candidate science alerts. In this work, we present the current development status of SAG-SCI, the SAG component responsible for the real-time, high-level analysis of CTAO data. The SAG-SCI pipelines receive gamma-ray data from multiple reconstruction lines, merge them, store them in a database, and trigger several parallel scientific analyses on the latest data. These analyses include estimating target significance and flux, producing sky maps and light curves, and conducting blind searches for sources within the field of view. We execute SAG-SCI on a set of simulated gamma-ray data, detecting the simulated sources and accurately reconstructing their flux and position. We also estimate the systematic errors introduced by the analysis and discuss the results in relation to the generation of candidate science alerts.
26. Native Mixed Reality Compositing on Meta Quest 3: A Quantitative Feasibility Study of ARM-Based SoCs and Thermal Headroom
Authors: Muhammad Kaif Laghari, Areeb Ahmed Shaikh, Faiz Khan, Aafia Gul Siddiqui β’
Published: 2025-09-23 β’
Source: arXiv
The adoption of current mixed reality (MR) content creation is primarily based on external PC-centric platforms and third-party cameras, limiting adoption for standalone virtual reality (VR) users. In this work, we investigate the feasibility of integrating an enhanced LIV SDK-like MR compositing pipeline into the Meta Quest 3 hardware, enabling native first-person physical perspective (FPP) MR content creation without external infrastructure. We conducted a simulation-based feasibility study using hardware specifications, developer documentation, and benchmarking with ARM-based SoCs, including Snapdragon 8 Gen 3 and MediaTek Dimensity 9300. The approach suggested Camera Passthrough Enhancement using Meta's experimental Passthrough Camera API with on-device machine learning segmentation through Unity Sentis and FastSAM, and an optimized real-time compositing engine for standalone VR. Benchmarking results show that Quest 3's Snapdragon XR2 Gen 2 can support lightweight native MR compositing at 720p30 resolution using 95\% resource utilization, leaving 5\% thermal headroom for sustained runtime. Comparison with next-generation SoCs such as Snapdragon 8 Gen 3 demonstrates 34\% headroom, enabling more robust MR experiences with 1.5--2x faster CPU/GPU performance and higher memory bandwidth. While current Quest 3 hardware supports basic native MR compositing, thermal limits restrict operation to 5--10 minutes before throttling. Experimental results confirm standalone MR content creation is possible on current hardware for short recordings, with new XR SoCs offering the headroom for extended sessions and improved quality. These findings lay groundwork for transitioning MR content creation from PC-based workflows to all-in-one VR devices, enhancing MR production for content creators and researchers.
27. LiDAR Point Cloud Image-based Generation Using Denoising Diffusion Probabilistic Models
Authors: Amirhesam Aghanouri, Cristina Olaverri-Monreal β’
Published: 2025-09-23 β’
Source: arXiv
Autonomous vehicles (AVs) are expected to revolutionize transportation by improving efficiency and safety. Their success relies on 3D vision systems that effectively sense the environment and detect traffic agents. Among sensors AVs use to create a comprehensive view of surroundings, LiDAR provides high-resolution depth data enabling accurate object detection, safe navigation, and collision avoidance. However, collecting real-world LiDAR data is time-consuming and often affected by noise and sparsity due to adverse weather or sensor limitations. This work applies a denoising diffusion probabilistic model (DDPM), enhanced with novel noise scheduling and time-step embedding techniques to generate high-quality synthetic data for augmentation, thereby improving performance across a range of computer vision tasks, particularly in AV perception. These modifications impact the denoising process and the model's temporal awareness, allowing it to produce more realistic point clouds based on the projection. The proposed method was extensively evaluated under various configurations using the IAMCV and KITTI-360 datasets, with four performance metrics compared against state-of-the-art (SOTA) methods. The results demonstrate the model's superior performance over most existing baselines and its effectiveness in mitigating the effects of noisy and sparse LiDAR data, producing diverse point clouds with rich spatial relationships and structural detail.
28. The AI Literacy Heptagon: A Structured Approach to AI Literacy in Higher Education
Authors: Veronika Hackl, Alexandra Mueller, Maximilian Sailer β’
Published: 2025-09-23 β’
Source: arXiv
The integrative literature review addresses the conceptualization and implementation of AI Literacy (AIL) in Higher Education (HE) by examining recent research literature. Through an analysis of publications (2021-2024), we explore (1) how AIL is defined and conceptualized in current research, particularly in HE, and how it can be delineated from related concepts such as Data Literacy, Media Literacy, and Computational Literacy; (2) how various definitions can be synthesized into a comprehensive working definition, and (3) how scientific insights can be effectively translated into educational practice. Our analysis identifies seven central dimensions of AIL: technical, applicational, critical thinking, ethical, social, integrational, and legal. These are synthesized in the AI Literacy Heptagon, deepening conceptual understanding and supporting the structured development of AIL in HE. The study aims to bridge the gap between theoretical AIL conceptualizations and the practical implementation in academic curricula.
29. SmartWilds: Multimodal Wildlife Monitoring Dataset
Authors: Jenna Kline, Anirudh Potlapally, Bharath Pillai, Tanishka Wani, Rugved Katole, Vedant Patil, Penelope Covey, Hari Subramoni, Tanya Berger-Wolf, Christopher Stewart β’
Published: 2025-09-23 β’
Source: arXiv
We present the first release of SmartWilds, a multimodal wildlife monitoring dataset. SmartWilds is a synchronized collection of drone imagery, camera trap photographs and videos, and bioacoustic recordings collected during summer 2025 at The Wilds safari park in Ohio. This dataset supports multimodal AI research for comprehensive environmental monitoring, addressing critical needs in endangered species research, conservation ecology, and habitat management. Our pilot deployment captured four days of synchronized monitoring across three modalities in a 220-acre pasture containing Pere David's deer, Sichuan takin, Przewalski's horses, as well as species native to Ohio, including bald eagles, white-tailed deer, and coyotes. We provide a comparative analysis of sensor modality performance, demonstrating complementary strengths for landuse patterns, species detection, behavioral analysis, and habitat monitoring. This work establishes reproducible protocols for multimodal wildlife monitoring while contributing open datasets to advance conservation computer vision research. Future releases will include synchronized GPS tracking data from tagged individuals, citizen science data, and expanded temporal coverage across multiple seasons.
30. Attack for Defense: Adversarial Agents for Point Prompt Optimization Empowering Segment Anything Model
Authors: Xueyu Liu, Xiaoyi Zhang, Guangze Shi, Meilin Liu, Yexin Lai, Yongfei Wu, Mingqiang Wei β’
Published: 2025-09-23 β’
Source: arXiv
Prompt quality plays a critical role in the performance of the Segment Anything Model (SAM), yet existing approaches often rely on heuristic or manually crafted prompts, limiting scalability and generalization. In this paper, we propose Point Prompt Defender, an adversarial reinforcement learning framework that adopts an attack-for-defense paradigm to automatically optimize point prompts. We construct a task-agnostic point prompt environment by representing image patches as nodes in a dual-space graph, where edges encode both physical and semantic distances. Within this environment, an attacker agent learns to activate a subset of prompts that maximally degrade SAM's segmentation performance, while a defender agent learns to suppress these disruptive prompts and restore accuracy. Both agents are trained using Deep Q-Networks with a reward signal based on segmentation quality variation. During inference, only the defender is deployed to refine arbitrary coarse prompt sets, enabling enhanced SAM segmentation performance across diverse tasks without retraining. Extensive experiments show that Point Prompt Defender effectively improves SAM's robustness and generalization, establishing a flexible, interpretable, and plug-and-play framework for prompt-based segmentation.
31. LongCat-Flash-Thinking Technical Report
Authors: Meituan LongCat Team, Anchun Gui, Bei Li, Bingyang Tao, Bole Zhou, Borun Chen, Chao Zhang, Chao Zhang, Chengcheng Han, Chenhui Yang, Chi Zhang, Chong Peng, Chuyu Zhang, Cong Chen, Fengcun Li, Gang Xu, Guoyuan Lin, Hao Jiang, Hao Liang, Haomin Fu, Haoxiang Ma, Hong Liu, Hongyan Hao, Hongyin Tang, Hongyu Zang, Hongzhi Ni, Hui Su, Jiahao Liu, Jiahuan Li, Jialin Liu, Jianfei Zhang, Jianhao Xu, Jianing Wang, Jiaqi Sun, Jiaqi Zhang, Jiarong Shi, Jiawei Yang, Jingang Wang, Jinrui Ding, Jun Kuang, Jun Xu, Ke He, Kefeng Zhang, Keheng Wang, Keqing He, Li Wei, Liang Shi, Lin Qiu, Lingbin Kong, Lingchuan Liu, Linsen Guo, Longfei An, Mai Xia, Meng Zhou, Mengshen Zhu, Peng Pei, Pengcheng Jia, Qi Gu, Qi Guo, Qiong Huang, Quan Chen, Quanchi Weng, Rongxiang Weng, Ruichen Shao, Rumei Li, Shanglin Lei, Shuai Du, Shuaikang Liu, Shuang Zhou, Shuhao Hu, Siyu Xu, Songshan Gong, Tao Liang, Tianhao Hu, Wei He, Wei Shi, Wei Wang, Wei Wu, Wei Zhuo, Weifeng Tang, Wenjie Shi, Wenlong Zhu, Xi Su, Xiangcheng Liu, Xiangyu Xi, Xiangzhou Huang, Xiao Liu, Xiaochen Jiang, Xiaowei Shi, Xiaowen Shi, Xiaoyu Li, Xin Chen, Xinyue Zhao, Xuan Huang, Xuemiao Zhang, Xuezhi Cao, Xunliang Cai, Yajie Zhang, Yang Chen, Yang Liu, Yang Liu, Yang Zheng, Yaoming Wang, Yaqi Huo, Yerui Sun, Yifan Lu, Yiyang Li, Youshao Xiao, Yuanzhe Lei, Yuchen Xie, Yueqing Sun, Yufei Zhang, Yuhuai Wei, Yulei Qian, Yunke Zhao, Yuqing Ding, Yuwei Jiang, Zhaohua Yang, Zhengyu Chen, Zhijian Liu, Zhikang Xia, Zhongda Su, Ziran Li, Ziwen Wang, Ziyuan Zhuang, Zongyu Wang, Zunyuan Yang β’
Published: 2025-09-23 β’
Source: arXiv
We present LongCat-Flash-Thinking, an efficient 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model. Its advanced capabilities are cultivated through a meticulously crafted training process, beginning with long Chain-of-Thought (CoT) data cold-start and culminating in large-scale Reinforcement Learning (RL). We first employ a well-designed cold-start training strategy, which significantly enhances the reasoning potential and equips the model with specialized skills in both formal and agentic reasoning. Then, a core innovation is our domain-parallel training scheme, which decouples optimization across distinct domains (e.g., STEM, Code, Agentic) and subsequently fuses the resulting expert models into a single, nearly Pareto-optimal model. This entire process is powered by our Dynamic ORchestration for Asynchronous rollout (DORA) system, a large-scale RL framework that delivers a greater than threefold training speedup over synchronous methods on tens of thousands of accelerators. As a result, LongCat-Flash-Thinking achieves state-of-the-art performance among open-source models on a suite of complex reasoning tasks. The model exhibits exceptional efficiency in agentic reasoning, reducing average token consumption by 64.5% (from 19, 653 to 6, 965) on AIME-25, without degrading task accuracy. We release LongCat-Flash-Thinking to promote further advances in reasoning systems and agentic AI research.
32. On The Reproducibility Limitations of RAG Systems
Authors: Baiqiang Wang, Dongfang Zhao, Nathan R Tallent, Luanzheng Guo β’
Published: 2025-09-23 β’
Source: arXiv
Retrieval-Augmented Generation (RAG) is increasingly employed in generative AI-driven scientific workflows to integrate rapidly evolving scientific knowledge bases, yet its reliability is frequently compromised by non-determinism in their retrieval components. This paper introduces ReproRAG, a comprehensive benchmarking framework designed to systematically measure and quantify the reproducibility of vector-based retrieval systems. ReproRAG investigates sources of uncertainty across the entire pipeline, including different embedding models, precision, retrieval algorithms, hardware configurations, and distributed execution environments. Utilizing a suite of metrics, such as Exact Match Rate, Jaccard Similarity, and Kendall's Tau, the proposed framework effectively characterizes the trade-offs between reproducibility and performance. Our large-scale empirical study reveals critical insights; for instance, we observe that different embedding models have remarkable impact on RAG reproducibility. The open-sourced ReproRAG framework provides researchers and engineers productive tools to validate deployments, benchmark reproducibility, and make informed design decisions, thereby fostering more trustworthy AI for science.
33. Memory in Large Language Models: Mechanisms, Evaluation and Evolution
Authors: Dianxing Zhang, Wendong Li, Kani Song, Jiaye Lu, Gang Li, Liuchun Yang, Sheng Li β’
Published: 2025-09-23 β’
Source: arXiv
Under a unified operational definition, we define LLM memory as a persistent state written during pretraining, finetuning, or inference that can later be addressed and that stably influences outputs. We propose a four-part taxonomy (parametric, contextual, external, procedural/episodic) and a memory quadruple (location, persistence, write/access path, controllability). We link mechanism, evaluation, and governance via the chain write -> read -> inhibit/update. To avoid distorted comparisons across heterogeneous setups, we adopt a three-setting protocol (parametric only, offline retrieval, online retrieval) that decouples capability from information availability on the same data and timeline. On this basis we build a layered evaluation: parametric (closed-book recall, edit differential, memorization/privacy), contextual (position curves and the mid-sequence drop), external (answer correctness vs snippet attribution/faithfulness), and procedural/episodic (cross-session consistency and timeline replay, E MARS+). The framework integrates temporal governance and leakage auditing (freshness hits, outdated answers, refusal slices) and uncertainty reporting via inter-rater agreement plus paired tests with multiple-comparison correction. For updating and forgetting, we present DMM Gov: coordinating DAPT/TAPT, PEFT, model editing (ROME, MEND, MEMIT, SERAC), and RAG to form an auditable loop covering admission thresholds, rollout, monitoring, rollback, and change audits, with specs for timeliness, conflict handling, and long-horizon consistency. Finally, we give four testable propositions: minimum identifiability; a minimal evaluation card; causally constrained editing with verifiable forgetting; and when retrieval with small-window replay outperforms ultra-long-context reading. This yields a reproducible, comparable, and governable coordinate system for research and deployment.
34. Model selection meets clinical semantics: Optimizing ICD-10-CM prediction via LLM-as-Judge evaluation, redundancy-aware sampling, and section-aware fine-tuning
Authors: Hong-Jie Dai, Zheng-Hao Li, An-Tai Lu, Bo-Tsz Shain, Ming-Ta Li, Tatheer Hussain Mir, Kuang-Te Wang, Min-I Su, Pei-Kang Liu, Ming-Ju Tsai β’
Published: 2025-09-23 β’
Source: arXiv
Accurate International Classification of Diseases (ICD) coding is critical for clinical documentation, billing, and healthcare analytics, yet it remains a labour-intensive and error-prone task. Although large language models (LLMs) show promise in automating ICD coding, their challenges in base model selection, input contextualization, and training data redundancy limit their effectiveness. We propose a modular framework for ICD-10 Clinical Modification (ICD-10-CM) code prediction that addresses these challenges through principled model selection, redundancy-aware data sampling, and structured input design. The framework integrates an LLM-as-judge evaluation protocol with Plackett-Luce aggregation to assess and rank open-source LLMs based on their intrinsic comprehension of ICD-10-CM code definitions. We introduced embedding-based similarity measures, a redundancy-aware sampling strategy to remove semantically duplicated discharge summaries. We leverage structured discharge summaries from Taiwanese hospitals to evaluate contextual effects and examine section-wise content inclusion under universal and section-specific modelling paradigms. Experiments across two institutional datasets demonstrate that the selected base model after fine-tuning consistently outperforms baseline LLMs in internal and external evaluations. Incorporating more clinical sections consistently improves prediction performance. This study uses open-source LLMs to establish a practical and principled approach to ICD-10-CM code prediction. The proposed framework provides a scalable, institution-ready solution for real-world deployment of automated medical coding systems by combining informed model selection, efficient data refinement, and context-aware prompting.
35. Benchmarking Vision-Language and Multimodal Large Language Models in Zero-shot and Few-shot Scenarios: A study on Christian Iconography
Authors: Gianmarco Spinaci, Lukas Klic, Giovanni Colavizza β’
Published: 2025-09-23 β’
Source: arXiv
This study evaluates the capabilities of Multimodal Large Language Models (LLMs) and Vision Language Models (VLMs) in the task of single-label classification of Christian Iconography. The goal was to assess whether general-purpose VLMs (CLIP and SigLIP) and LLMs, such as GPT-4o and Gemini 2.5, can interpret the Iconography, typically addressed by supervised classifiers, and evaluate their performance. Two research questions guided the analysis: (RQ1) How do multimodal LLMs perform on image classification of Christian saints? And (RQ2), how does performance vary when enriching input with contextual information or few-shot exemplars? We conducted a benchmarking study using three datasets supporting Iconclass natively: ArtDL, ICONCLASS, and Wikidata, filtered to include the top 10 most frequent classes. Models were tested under three conditions: (1) classification using class labels, (2) classification with Iconclass descriptions, and (3) few-shot learning with five exemplars. Results were compared against ResNet50 baselines fine-tuned on the same datasets. The findings show that Gemini-2.5 Pro and GPT-4o outperformed the ResNet50 baselines. Accuracy dropped significantly on the Wikidata dataset, where Siglip reached the highest accuracy score, suggesting model sensitivity to image size and metadata alignment. Enriching prompts with class descriptions generally improved zero-shot performance, while few-shot learning produced lower results, with only occasional and minimal increments in accuracy. We conclude that general-purpose multimodal LLMs are capable of classification in visually complex cultural heritage domains. These results support the application of LLMs as metadata curation tools in digital humanities workflows, suggesting future research on prompt optimization and the expansion of the study to other classification strategies and models.