1. MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction
Authors: Zilin Xiao, Qi Ma, Mengting Gu, Chun-cheng Jason Chen, Xintao Chen, Vicente Ordonez, Vijai Mohan β’
Published: 2025-09-22 β’
Source: arXiv
Universal multimodal embedding models have achieved great success in capturing semantic relevance between queries and candidates. However, current methods either condense queries and candidates into a single vector, potentially limiting the expressiveness for fine-grained information, or produce too many vectors that are prohibitively expensive for multi-vector retrieval. In this work, we introduce MetaEmbed, a new framework for multimodal retrieval that rethinks how multimodal embeddings are constructed and interacted with at scale. During training, a fixed number of learnable Meta Tokens are appended to the input sequence. At test-time, their last-layer contextualized representations serve as compact yet expressive multi-vector embeddings. Through the proposed Matryoshka Multi-Vector Retrieval training, MetaEmbed learns to organize information by granularity across multiple vectors. As a result, we enable test-time scaling in multimodal retrieval, where users can balance retrieval quality against efficiency demands by selecting the number of tokens used for indexing and retrieval interactions. Extensive evaluations on the Massive Multimodal Embedding Benchmark (MMEB) and the Visual Document Retrieval Benchmark (ViDoRe) confirm that MetaEmbed achieves state-of-the-art retrieval performance while scaling robustly to models with 32B parameters.
2. UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
Authors: Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen β’
Published: 2025-09-22 β’
Source: arXiv
Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.
3. ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation
Authors: Guocheng Gordon Qian, Daniil Ostashev, Egor Nemchinov, Avihay Assouline, Sergey Tulyakov, Kuan-Chieh Jackson Wang, Kfir Aberman β’
Published: 2025-09-22 β’
Source: arXiv
Generating high-fidelity images of humans with fine-grained control over attributes such as hairstyle and clothing remains a core challenge in personalized text-to-image synthesis. While prior methods emphasize identity preservation from a reference image, they lack modularity and fail to provide disentangled control over specific visual attributes. We introduce a new paradigm for attribute-specific image prompting, in which distinct sets of reference images are used to guide the generation of individual aspects of human appearance, such as hair, clothing, and identity. Our method encodes these inputs into attribute-specific tokens, which are injected into a pre-trained text-to-image diffusion model. This enables compositional and disentangled control over multiple visual factors, even across multiple people within a single image. To promote natural composition and robust disentanglement, we curate a cross-reference training dataset featuring subjects in diverse poses and expressions, and propose a multi-attribute cross-reference training strategy that encourages the model to generate faithful outputs from misaligned attribute inputs while adhering to both identity and textual conditioning. Extensive experiments show that our method achieves state-of-the-art performance in accurately following both visual and textual prompts. Our framework paves the way for more configurable human image synthesis by combining visual prompting with text-driven generation. Webpage is available at: https://snap-research.github.io/composeme/.
4. DESI Strong Lens Foundry III: Keck Spectroscopy for Strong Lenses Discovered Using Residual Neural Networks
Authors: Shrihan Agarwal, Xiaosheng Huang, William Sheu, Christopher J. Storfer, Marcos Tamargo-Arizmendi, Suchitoto Tabares-Tarquinio, D. J. Schlegel, G. Aldering, A. Bolton, A. Cikota, Arjun Dey, A. Filipp, E. Jullo, K. J. Kwon, S. Perlmutter, Y. Shu, E. Sukay, N. Suzuki, J. Aguilar, S. Ahlen, S. BenZvi, D. Brooks, T. Claybaugh, P. Doel, J. E. Forero-Romero, E. GaztaΓ±aga, S. Gontcho A Gontcho, G. Gutierrez, K. Honscheid, M. Ishak, S. Juneau, R. Kehoe, T. Kisner, S. E. Koposov, A. Lambert, M. Landriau, L. Le Guillou, A. de la Macorra, A. Meisner, R. Miquel, J. Moustakas, A. D. Myers, C. Poppett, F. Prada, I. PΓ©rez-RΓ fols, G. Rossi, E. Sanchez, M. Schubnell, D. Sprayberry, G. TarlΓ©, B. A. Weaver, H. Zou β’
Published: 2025-09-22 β’
Source: arXiv
We present spectroscopic data of strong lenses and their source galaxies using the Keck Near-Infrared Echellette Spectrometer (NIRES) and the Dark Energy Spectroscopic Instrument (DESI), providing redshifts necessary for nearly all strong-lensing applications with these systems, especially the extraction of physical parameters from lensing modeling. These strong lenses were found in the DESI Legacy Imaging Surveys using Residual Neural Networks (ResNet) and followed up by our Hubble Space Telescope program, with all systems displaying unambiguous lensed arcs. With NIRES, we target eight lensed sources at redshifts difficult to measure in the optical range and determine the source redshifts for six, between $z_s$ = 1.675 and 3.332. DESI observed one of the remaining source redshifts, as well as an additional source redshift within the six systems. The two systems with non-detections by NIRES were observed for a considerably shorter 600s at high airmass. Combining NIRES infrared spectroscopy with optical spectroscopy from our DESI Strong Lensing Secondary Target Program, these results provide the complete lens and source redshifts for six systems, a resource for refining automated strong lens searches in future deep- and wide-field imaging surveys and addressing a range of questions in astrophysics and cosmology.
5. Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding
Authors: Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, Fatih Porikli β’
Published: 2025-09-22 β’
Source: arXiv
Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token generation rates. However, currently available open-source dLLMs often generate at much lower rates, typically decoding only a single token at every denoising timestep in order to maximize output quality. We present Spiffy, a speculative decoding algorithm that accelerates dLLM inference by $\mathbf{2.8{-}3.1\times}$ while provably preserving the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to the dLLM setting. Spiffy proposes draft states by leveraging the dLLM's distribution itself in an auto-speculative manner. This approach is efficient and effective, and eliminates the overheads of training and running an independent draft model. To structure the candidate draft states, we propose a novel directed draft graph which is uniquely designed to take advantage of the bidirectional, block-wise nature of dLLM generation and can be verified in parallel by the dLLM. To further optimize the structure of these draft graphs, we introduce an efficient, offline calibration algorithm that procedurally determines high-quality graph configurations. These optimized draft graphs, enabling increased acceptance rates, lead to a significant boost in the overall speedup achieved by the system. Crucially, Spiffy is also complementary to other recent innovations in improving dLLM generation speeds such as KV-caching and multi-token unmasking. We demonstrate that when combined with such parallel decoding algorithms, Spiffy is able to effectively multiply the benefits of these methods leading to total speedups of up to $\mathbf{7.9\times}$.
6. Star-forming galaxies in the cosmic web in the last 11 Gyr
Authors: B. Jego, K. Kraljic, M. BΓ©thermin, R. DavΓ© β’
Published: 2025-09-22 β’
Source: arXiv
We investigate how the star formation activity of galaxies depends on their position within the cosmic web using the SIMBA cosmological simulation from redshift $z=3$ to $z=0$. While previous studies found that galaxies closer to filaments tend to be more massive and quenched, it remained unclear whether these trends reflect intrinsic environmental effects or changes in the galaxy population mix. To address this, we focus exclusively on star-forming galaxies, robustly selected using both the specific star formation rate (sSFR) and gas depletion timescale criteria, in order to isolate the direct impact of the cosmic web on star-forming galaxies. We reconstruct the 3D cosmic web skeleton using DisPerSE and compute each galaxy's distance to its nearest filament. After removing mass dependencies, we examine deviations in star formation rate (SFR), sSFR, molecular and atomic gas depletion timescales, and gas fractions as a function of this distance. We find a clear and redshift-dependent modulation of star formation with filament proximity: at high redshift ($z \gtrsim 2$), galaxies closer to filaments show enhanced SFR and gas accretion, reflecting efficient filament-fed growth. At $z=0$, we observe a V-shaped trend in the sSFR and depletion timescales, with minima at intermediate distances ($\sim 0.25$ cMpc) and a surprising upturn very close to the filament cores, suggesting a resumed accretion in the densest environments. These effects are not driven by mergers and are primarily associated with satellite galaxies at low redshift. Our results demonstrate that large-scale cosmic web proximity modulates star formation in star-forming galaxies through a combination of gas supply regulation and environmental processing, with different mechanisms dominating across cosmic time.
7. Thermal field theory correlators in the large-$N$ limit and the spectral duality relation
Authors: SaΕ‘o Grozdanov, Mile Vrbica β’
Published: 2025-09-22 β’
Source: arXiv
In Ref.~\cite{Grozdanov:2024wgo}, we derived a spectral duality relation applicable to the spectra of 3$d$ conformal field theories (CFTs) and their holographically dual 4$d$ black holes. In this work, we further elaborate on the properties of this duality relation and argue that the same relation can be applied to certain pairs of thermal correlator spectra in large-$N$ quantum field theories in any number of spacetime dimensions, provided the correlators are meromorphic functions with only simple poles and satisfy the thermal product formula. We discuss a rich set of properties that such retarded two-point functions must exhibit. We then show that the spectral duality relation and its implications apply to pairs of correlators in double-trace deformed CFTs and, more generally, to correlators in theories related by the Legendre transform. We illustrate, through several examples, how the spectrum of one correlator can be reconstructed from that of its dual correlation function. Notably, this includes cases relating the thermal spectra of scalar primary operators at ultraviolet and infrared fixed points, as well as current operators in a CFT$_3$ and its particle-vortex dual.
8. The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies
Authors: Jiaxu Zhou, Jen-tse Huang, Xuhui Zhou, Man Ho Lam, Xintao Wang, Hao Zhu, Wenxuan Wang, Maarten Sap β’
Published: 2025-09-22 β’
Source: arXiv
Large Language Models (LLMs) are increasingly used for social simulation, where populations of agents are expected to reproduce human-like collective behavior. However, we find that many recent studies adopt experimental designs that systematically undermine the validity of their claims. From a survey of over 40 papers, we identify six recurring methodological flaws: agents are often homogeneous (Profile), interactions are absent or artificially imposed (Interaction), memory is discarded (Memory), prompts tightly control outcomes (Minimal-Control), agents can infer the experimental hypothesis (Unawareness), and validation relies on simplified theoretical models rather than real-world data (Realism). For instance, GPT-4o and Qwen-3 correctly infer the underlying social experiment in 53.1% of cases when given instructions from prior work-violating the Unawareness principle. We formalize these six requirements as the PIMMUR principles and argue they are necessary conditions for credible LLM-based social simulation. To demonstrate their impact, we re-run five representative studies using a framework that enforces PIMMUR and find that the reported social phenomena frequently fail to emerge under more rigorous conditions. Our work establishes methodological standards for LLM-based multi-agent research and provides a foundation for more reliable and reproducible claims about "AI societies."
9. Functional effects models: Accounting for preference heterogeneity in panel data with machine learning
Authors: Nicolas SalvadΓ©, Tim Hillel β’
Published: 2025-09-22 β’
Source: arXiv
In this paper, we present a general specification for Functional Effects Models, which use Machine Learning (ML) methodologies to learn individual-specific preference parameters from socio-demographic characteristics, therefore accounting for inter-individual heterogeneity in panel choice data. We identify three specific advantages of the Functional Effects Model over traditional fixed, and random/mixed effects models: (i) by mapping individual-specific effects as a function of socio-demographic variables, we can account for these effects when forecasting choices of previously unobserved individuals (ii) the (approximate) maximum-likelihood estimation of functional effects avoids the incidental parameters problem of the fixed effects model, even when the number of observed choices per individual is small; and (iii) we do not rely on the strong distributional assumptions of the random effects model, which may not match reality. We learn functional intercept and functional slopes with powerful non-linear machine learning regressors for tabular data, namely gradient boosting decision trees and deep neural networks. We validate our proposed methodology on a synthetic experiment and three real-world panel case studies, demonstrating that the Functional Effects Model: (i) can identify the true values of individual-specific effects when the data generation process is known; (ii) outperforms both state-of-the-art ML choice modelling techniques that omit individual heterogeneity in terms of predictive performance, as well as traditional static panel choice models in terms of learning inter-individual heterogeneity. The results indicate that the FI-RUMBoost model, which combines the individual-specific constants of the Functional Effects Model with the complex, non-linear utilities of RUMBoost, performs marginally best on large-scale revealed preference panel data.
10. Hybrid Reputation Aggregation: A Robust Defense Mechanism for Adversarial Federated Learning in 5G and Edge Network Environments
Authors: Saeid Sheikhi, Panos Kostakos, Lauri Loven β’
Published: 2025-09-22 β’
Source: arXiv
Federated Learning (FL) in 5G and edge network environments face severe security threats from adversarial clients. Malicious participants can perform label flipping, inject backdoor triggers, or launch Sybil attacks to corrupt the global model. This paper introduces Hybrid Reputation Aggregation (HRA), a novel robust aggregation mechanism designed to defend against diverse adversarial behaviors in FL without prior knowledge of the attack type. HRA combines geometric anomaly detection with momentum-based reputation tracking of clients. In each round, it detects outlier model updates via distance-based geometric analysis while continuously updating a trust score for each client based on historical behavior. This hybrid approach enables adaptive filtering of suspicious updates and long-term penalization of unreliable clients, countering attacks ranging from backdoor insertions to random noise Byzantine failures. We evaluate HRA on a large-scale proprietary 5G network dataset (3M+ records) and the widely used NF-CSE-CIC-IDS2018 benchmark under diverse adversarial attack scenarios. Experimental results reveal that HRA achieves robust global model accuracy of up to 98.66% on the 5G dataset and 96.60% on NF-CSE-CIC-IDS2018, outperforming state-of-the-art aggregators such as Krum, Trimmed Mean, and Bulyan by significant margins. Our ablation studies further demonstrate that the full hybrid system achieves 98.66% accuracy, while the anomaly-only and reputation-only variants drop to 84.77% and 78.52%, respectively, validating the synergistic value of our dual-mechanism approach. This demonstrates HRA's enhanced resilience and robustness in 5G/edge federated learning deployments, even under significant adversarial conditions.
11. Matricial ranges, dilations, and unital contractive maps
Authors: Pankaj Dey, Atanu Dhang, Mithun Mukherjee β’
Published: 2025-09-22 β’
Source: arXiv
Let $J_n$ be the Jordan block of size $n$ with all eigen values zero. Arveson introduced the notion of the matricial range of an operator in his remarkable article called Subalgebras of $C^*$-algebras II (Acta Math, 128, 1972) and established that every unital positive map on the operator system generated by $J_2$ is completely positive. This describes the matricial range of $J_2$ as the set of all matrices with numerical radius at most $\frac{1}{2}$. Later, Choi and Li generalize this result of Arveson and prove that every unital positive map on the operator system generated by any $2\times 2$ matrix or any $3\times3$ matrix with a reducing subspace is completely positive. After fifty years of the above result of Arveson, the matricial range of $J_n$ for $n\geq 3$ has not been characterized. This article aims to investigate this long-standing open problem for $n=3$. We begin by establishing a structure theorem for a dilation of an operator $B$ satisfying $BB^*+B^*B=I$ and then investigate whether every $B\in\mathbb{M}_n$ satisfying $BB^*+B^*B\leq I_n$ admits a dilation $\widetilde{B}$ for which $\widetilde{B}\widetilde{B}^*+\widetilde{B}^*\widetilde{B}=I$. This study plays the central role to the development of this paper. We use this to prove that every unital contractive map on the operator system generated by $J_3$ is $2$-positive and obtain some partial results towards characterizing the matricial range of $J_3$. Next, we study unital contractive maps on operator systems generated by $4\times 4$ normal matrices, and show that this is equivalent to studying a unital contractive map on the operator system generated by $T=\text{diag}(\lambda,-1,i,-i)$, where $\Re{(\lambda)}\geq 0$. We prove that every unital contractive map on the operator system generated by $T=\text{diag}(1,-1,i,-i)$ is completely positive.
12. On the geometry and uniqueness of asymptotically locally hyperbolic static vacuum black holes
Authors: Brian Harvie, Ye-Kai Wang β’
Published: 2025-09-22 β’
Source: arXiv
We establish an inequality relating the surface gravity and topology of a horizon in a $3$-dimensional asymptotically locally hyperbolic static space with the geometry at infinity. Equality is achieved only by the Kottler black holes, and this rigidity leads to several new static black hole uniqueness theorems for a negative cosmological constant. First, the ADS-Schwarzschild black hole with critical surface gravity $\kappa=\sqrt{-\Lambda}$ is unique. Second, the toroidal Kottler black holes are unique in the absence of spherical horizons. Third, the hyperbolic Kottler black holes with mass $m>0$ are unique *if* the generalized Penrose inequality holds for the corresponding class of static spaces. Building on work of Ge-Wang-Wu-Xia, we then use this fact to obtain uniqueness for static ALH graphs with hyperbolic infinities. This inequality follows from a generalization of the Minkowski inequality in ADS-Schwarzschild space due to Brendle-Hung-Wang. Using optimal coefficients for the sub-static Heintze-Karcher inequality, we construct a new monotone quantity under inverse mean curvature flow (IMCF) in static spaces with ${\Lambda} < 0$. Another fundamental tool developed in this paper is a regularity theorem for IMCF in asymptotically locally hyperbolic manifolds. Specifically, we prove that a weak solution of IMCF in an ALH 3-manifold with horizon boundary is eventually smooth. This extends the regularity theorem for a spherical infinity due to Shi and Zhu.
13. Open-system quantum many-body scars: a theory
Authors: Lorenzo Gotta β’
Published: 2025-09-22 β’
Source: arXiv
In this work, we undertake the problem of formally introducing a notion of quantum many-body scarring in open quantum systems governed by the Lindblad equation. To this goal, we rely on the commutant-algebra framework for the description of strong symmetries to introduce the unconventional strong-symmetry structure leading to the existence of anomalous stationary states, which we dub open-system quantum many-body scars (OSQMBS), besides a typical infinite-temperature state. We provide several benchmarks of the theoretical predictions on the stationary-state manifold and on convergence to stationarity, as well as describe the time-evolution of off-diagonal coherences among the Hilbert space symmetry sectors identified by OSQMBS and their orthogonal complement. Moreover, we investigate the existence of asymptotic OSQMBS (AOSQMBS), the latter being states that, despite converging to the typical infinite-temperature state in the large-time limit, display anomalously-large relaxation time scales, which we thoroughly describe by means of the behavior of their fidelity as a function of time through proper scaling Ansaetze.
14. StableGuard: Towards Unified Copyright Protection and Tamper Localization in Latent Diffusion Models
Authors: Haoxin Yang, Bangzhen Liu, Xuemiao Xu, Cheng Xu, Yuyang Yu, Zikai Huang, Yi Wang, Shengfeng He β’
Published: 2025-09-22 β’
Source: arXiv
The advancement of diffusion models has enhanced the realism of AI-generated content but also raised concerns about misuse, necessitating robust copyright protection and tampering localization. Although recent methods have made progress toward unified solutions, their reliance on post hoc processing introduces considerable application inconvenience and compromises forensic reliability. We propose StableGuard, a novel framework that seamlessly integrates a binary watermark into the diffusion generation process, ensuring copyright protection and tampering localization in Latent Diffusion Models through an end-to-end design. We develop a Multiplexing Watermark VAE (MPW-VAE) by equipping a pretrained Variational Autoencoder (VAE) with a lightweight latent residual-based adapter, enabling the generation of paired watermarked and watermark-free images. These pairs, fused via random masks, create a diverse dataset for training a tampering-agnostic forensic network. To further enhance forensic synergy, we introduce a Mixture-of-Experts Guided Forensic Network (MoE-GFN) that dynamically integrates holistic watermark patterns, local tampering traces, and frequency-domain cues for precise watermark verification and tampered region detection. The MPW-VAE and MoE-GFN are jointly optimized in a self-supervised, end-to-end manner, fostering a reciprocal training between watermark embedding and forensic accuracy. Extensive experiments demonstrate that StableGuard consistently outperforms state-of-the-art methods in image fidelity, watermark verification, and tampering localization.
15. On the Variational Costs of Changing Our Minds
Authors: David Hyland, Mahault Albarracin β’
Published: 2025-09-22 β’
Source: arXiv
The human mind is capable of extraordinary achievements, yet it often appears to work against itself. It actively defends its cherished beliefs even in the face of contradictory evidence, conveniently interprets information to conform to desired narratives, and selectively searches for or avoids information to suit its various purposes. Despite these behaviours deviating from common normative standards for belief updating, we argue that such 'biases' are not inherently cognitive flaws, but rather an adaptive response to the significant pragmatic and cognitive costs associated with revising one's beliefs. This paper introduces a formal framework that aims to model the influence of these costs on our belief updating mechanisms. We treat belief updating as a motivated variational decision, where agents weigh the perceived 'utility' of a belief against the informational cost required to adopt a new belief state, quantified by the Kullback-Leibler divergence from the prior to the variational posterior. We perform computational experiments to demonstrate that simple instantiations of this resource-rational model can be used to qualitatively emulate commonplace human behaviours, including confirmation bias and attitude polarisation. In doing so, we suggest that this framework makes steps toward a more holistic account of the motivated Bayesian mechanics of belief change and provides practical insights for predicting, compensating for, and correcting deviations from desired belief updating processes.
16. "I think this is fair'': Uncovering the Complexities of Stakeholder Decision-Making in AI Fairness Assessment
Authors: Lin Luo, Yuri Nakao, Mathieu Chollet, Hiroya Inakoshi, Simone Stumpf β’
Published: 2025-09-22 β’
Source: arXiv
Assessing fairness in artificial intelligence (AI) typically involves AI experts who select protected features, fairness metrics, and set fairness thresholds. However, little is known about how stakeholders, particularly those affected by AI outcomes but lacking AI expertise, assess fairness. To address this gap, we conducted a qualitative study with 30 stakeholders without AI expertise, representing potential decision subjects in a credit rating scenario, to examine how they assess fairness when placed in the role of deciding on features with priority, metrics, and thresholds. We reveal that stakeholders' fairness decisions are more complex than typical AI expert practices: they considered features far beyond legally protected features, tailored metrics for specific contexts, set diverse yet stricter fairness thresholds, and even preferred designing customized fairness. Our results extend the understanding of how stakeholders can meaningfully contribute to AI fairness governance and mitigation, underscoring the importance of incorporating stakeholders' nuanced fairness judgments.
17. Guided Multi-Fidelity Bayesian Optimization for Data-driven Controller Tuning with Digital Twins
Authors: Mahdi Nobar, JΓΌrg Keller, Alessandro Forino, John Lygeros, Alisa Rupenyan β’
Published: 2025-09-22 β’
Source: arXiv
We propose a \textit{guided multi-fidelity Bayesian optimization} framework for data-efficient controller tuning that integrates corrected digital twin (DT) simulations with real-world measurements. The method targets closed-loop systems with limited-fidelity simulations or inexpensive approximations. To address model mismatch, we build a multi-fidelity surrogate with a learned correction model that refines DT estimates from real data. An adaptive cost-aware acquisition function balances expected improvement, fidelity, and sampling cost. Our method ensures adaptability as new measurements arrive. The accuracy of DTs is re-estimated, dynamically adapting both cross-source correlations and the acquisition function. This ensures that accurate DTs are used more frequently, while inaccurate DTs are appropriately downweighted. Experiments on robotic drive hardware and supporting numerical studies demonstrate that our method enhances tuning efficiency compared to standard Bayesian optimization (BO) and multi-fidelity methods.
18. HICode: Hierarchical Inductive Coding with LLMs
Authors: Mian Zhong, Pristina Wang, Anjalie Field β’
Published: 2025-09-22 β’
Source: arXiv
Despite numerous applications for fine-grained corpus analysis, researchers continue to rely on manual labeling, which does not scale, or statistical tools like topic modeling, which are difficult to control. We propose that LLMs have the potential to scale the nuanced analyses that researchers typically conduct manually to large text corpora. To this effect, inspired by qualitative research methods, we develop HICode, a two-part pipeline that first inductively generates labels directly from analysis data and then hierarchically clusters them to surface emergent themes. We validate this approach across three diverse datasets by measuring alignment with human-constructed themes and demonstrating its robustness through automated and human evaluations. Finally, we conduct a case study of litigation documents related to the ongoing opioid crisis in the U.S., revealing aggressive marketing strategies employed by pharmaceutical companies and demonstrating HICode's potential for facilitating nuanced analyses in large-scale data.
19. Medical priority fusion: achieving dual optimization of sensitivity and interpretability in nipt anomaly detection
Authors: Xiuqi Ge, Zhibo Yao, Yaosong Du β’
Published: 2025-09-22 β’
Source: arXiv
Clinical machine learning faces a critical dilemma in high-stakes medical applications: algorithms achieving optimal diagnostic performance typically sacrifice the interpretability essential for physician decision-making, while interpretable methods compromise sensitivity in complex scenarios. This paradox becomes particularly acute in non-invasive prenatal testing (NIPT), where missed chromosomal abnormalities carry profound clinical consequences yet regulatory frameworks mandate explainable AI systems. We introduce Medical Priority Fusion (MPF), a constrained multi-objective optimization framework that resolves this fundamental trade-off by systematically integrating Naive Bayes probabilistic reasoning with Decision Tree rule-based logic through mathematically-principled weighted fusion under explicit medical constraints. Rigorous validation on 1,687 real-world NIPT samples characterized by extreme class imbalance (43.4:1 normal-to-abnormal ratio) employed stratified 5-fold cross-validation with comprehensive ablation studies and statistical hypothesis testing using McNemar's paired comparisons. MPF achieved simultaneous optimization of dual objectives: 89.3% sensitivity (95% CI: 83.9-94.7%) with 80% interpretability score, significantly outperforming individual algorithms (McNemar's test, p < 0.001). The optimal fusion configuration achieved Grade A clinical deployment criteria with large effect size (d = 1.24), establishing the first clinically-deployable solution that maintains both diagnostic accuracy and decision transparency essential for prenatal care. This work demonstrates that medical-constrained algorithm fusion can resolve the interpretability-performance trade-off, providing a mathematical framework for developing high-stakes medical decision support systems that meet both clinical efficacy and explainability requirements.
20. Orcust: Stepwise-Feedback Reinforcement Learning for GUI Agent
Authors: Junyu Lu, Songxin Zhang, Zejian Xie, Zhuoyang Song, Jiaxing Zhang β’
Published: 2025-09-22 β’
Source: arXiv
Recent advances in GUI agents have achieved remarkable grounding and action-prediction performance, yet existing models struggle with unreliable reward signals and limited online trajectory generation. In this paper, we introduce Orcust, a framework that integrates Principle-Constrained Reward Modeling (PCRM) and Online VM-Grounded Trajectory Construction (OVTC) to enhance reasoning reliability and data efficiency in interactive GUI tasks. We leverages environment-verifiable and LLM-derived principle to enforce interpretable reward signals that constrain long chain-of-thought reasoning and rule-based feedback. OVTC spins up instrumented virtual machines to autonomously collect structured GUI interaction trajectories with explicit procedural and structural objectives, enabling the training of a stepwise reward model that robustly captures human preferences and adheres to task-specific constraints. Extensive experiments on standard GUI benchmarks covering perceptual grounding, foundational operations, and end-to-end task execution reveal that Orcust achieves state-of-the-art performance, improving by 22.2\% on ScreenSpot and 23.9\% on ScreenSpot-Pro over the base model (i.e. Qwen2.5-VL-7B). The results demonstrate Orcust's effectiveness in enhancing the reasoning, adaptability and scalability of GUI agents across various environments and task complexities.
21. Trainee Action Recognition through Interaction Analysis in CCATT Mixed-Reality Training
Authors: Divya Mereddy, Marcos Quinones-Grueiro, Ashwin T S, Eduardo Davalos, Gautam Biswas, Kent Etherton, Tyler Davis, Katelyn Kay, Jill Lear, Benjamin Goldberg β’
Published: 2025-09-22 β’
Source: arXiv
This study examines how Critical Care Air Transport Team (CCATT) members are trained using mixed-reality simulations that replicate the high-pressure conditions of aeromedical evacuation. Each team - a physician, nurse, and respiratory therapist - must stabilize severely injured soldiers by managing ventilators, IV pumps, and suction devices during flight. Proficient performance requires clinical expertise and cognitive skills, such as situational awareness, rapid decision-making, effective communication, and coordinated task management, all of which must be maintained under stress. Recent advances in simulation and multimodal data analytics enable more objective and comprehensive performance evaluation. In contrast, traditional instructor-led assessments are subjective and may overlook critical events, thereby limiting generalizability and consistency. However, AI-based automated and more objective evaluation metrics still demand human input to train the AI algorithms to assess complex team dynamics in the presence of environmental noise and the need for accurate re-identification in multi-person tracking. To address these challenges, we introduce a systematic, data-driven assessment framework that combines Cognitive Task Analysis (CTA) with Multimodal Learning Analytics (MMLA). We have developed a domain-specific CTA model for CCATT training and a vision-based action recognition pipeline using a fine-tuned Human-Object Interaction model, the Cascade Disentangling Network (CDN), to detect and track trainee-equipment interactions over time. These interactions automatically yield performance indicators (e.g., reaction time, task duration), which are mapped onto a hierarchical CTA model tailored to CCATT operations, enabling interpretable, domain-relevant performance evaluations.
22. Sight Over Site: Perception-Aware Reinforcement Learning for Efficient Robotic Inspection
Authors: Richard Kuhlmann, Jakob Wolfram, Boyang Sun, Jiaxu Xing, Davide Scaramuzza, Marc Pollefeys, Cesar Cadena β’
Published: 2025-09-22 β’
Source: arXiv
Autonomous inspection is a central problem in robotics, with applications ranging from industrial monitoring to search-and-rescue. Traditionally, inspection has often been reduced to navigation tasks, where the objective is to reach a predefined location while avoiding obstacles. However, this formulation captures only part of the real inspection problem. In real-world environments, the inspection targets may become visible well before their exact coordinates are reached, making further movement both redundant and inefficient. What matters more for inspection is not simply arriving at the target's position, but positioning the robot at a viewpoint from which the target becomes observable. In this work, we revisit inspection from a perception-aware perspective. We propose an end-to-end reinforcement learning framework that explicitly incorporates target visibility as the primary objective, enabling the robot to find the shortest trajectory that guarantees visual contact with the target without relying on a map. The learned policy leverages both perceptual and proprioceptive sensing and is trained entirely in simulation, before being deployed to a real-world robot. We further develop an algorithm to compute ground-truth shortest inspection paths, which provides a reference for evaluation. Through extensive experiments, we show that our method outperforms existing classical and learning-based navigation approaches, yielding more efficient inspection trajectories in both simulated and real-world settings. The project is avialable at https://sight-over-site.github.io/
23. AEAS: Actionable Exploit Assessment System
Authors: Xiangmin Shen, Wenyuan Cheng, Yan Chen, Zhenyuan Li, Yuqiao Gu, Lingzhi Wang, Wencheng Zhao, Dawei Sun, Jiashui Wang β’
Published: 2025-09-22 β’
Source: arXiv
Security practitioners face growing challenges in exploit assessment, as public vulnerability repositories are increasingly populated with inconsistent and low-quality exploit artifacts. Existing scoring systems, such as CVSS and EPSS, offer limited support for this task. They either rely on theoretical metrics or produce opaque probability estimates without assessing whether usable exploit code exists. In practice, security teams often resort to manual triage of exploit repositories, which is time-consuming, error-prone, and difficult to scale. We present AEAS, an automated system designed to assess and prioritize actionable exploits through static analysis. AEAS analyzes both exploit code and associated documentation to extract a structured set of features reflecting exploit availability, functionality, and setup complexity. It then computes an actionability score for each exploit and produces ranked exploit recommendations. We evaluate AEAS on a dataset of over 5,000 vulnerabilities derived from 600+ real-world applications frequently encountered by red teams. Manual validation and expert review on representative subsets show that AEAS achieves a 100% top-3 success rate in recommending functional exploits and shows strong alignment with expert-validated rankings. These results demonstrate the effectiveness of AEAS in supporting exploit-driven vulnerability prioritization.
24. Fine-Grained Detection of AI-Generated Text Using Sentence-Level Segmentation
Authors: Lekkala Sai Teja, Annepaka Yadagiri, and Partha Pakray, Chukhu Chunka, Mangadoddi Srikar Vardhan β’
Published: 2025-09-22 β’
Source: arXiv
Generation of Artificial Intelligence (AI) texts in important works has become a common practice that can be used to misuse and abuse AI at various levels. Traditional AI detectors often rely on document-level classification, which struggles to identify AI content in hybrid or slightly edited texts designed to avoid detection, leading to concerns about the model's efficiency, which makes it hard to distinguish between human-written and AI-generated texts. A sentence-level sequence labeling model proposed to detect transitions between human- and AI-generated text, leveraging nuanced linguistic signals overlooked by document-level classifiers. By this method, detecting and segmenting AI and human-written text within a single document at the token-level granularity is achieved. Our model combines the state-of-the-art pre-trained Transformer models, incorporating Neural Networks (NN) and Conditional Random Fields (CRFs). This approach extends the power of transformers to extract semantic and syntactic patterns, and the neural network component to capture enhanced sequence-level representations, thereby improving the boundary predictions by the CRF layer, which enhances sequence recognition and further identification of the partition between Human- and AI-generated texts. The evaluation is performed on two publicly available benchmark datasets containing collaborative human and AI-generated texts. Our experimental comparisons are with zero-shot detectors and the existing state-of-the-art models, along with rigorous ablation studies to justify that this approach, in particular, can accurately detect the spans of AI texts in a completely collaborative text. All our source code and the processed datasets are available in our GitHub repository.
25. Enhancing the NAO: Extending Capabilities of Legacy Robots for Long-Term Research
Authors: Austin Wilson, Sahar Kapasi, Zane Greene, Alexis E. Block β’
Published: 2025-09-22 β’
Source: arXiv
Many research groups face challenges when legacy (unsupported) robotic platforms lose manufacturer support and cannot accommodate modern sensing, speech, and interaction capabilities. We present the Enhanced NAO, a revitalized version of Aldebaran's NAO robot that uses upgraded microphones, RGB-D and thermal cameras, and additional compute resources in a fully self-contained package. This system combines cloud and local models for perception and dialogue, while preserving the NAO's expressive body and behaviors. In a pilot validation study, the Enhanced NAO delivered significantly higher conversational quality and stronger user preference compared to the NAO AI Edition, without increasing response latency. Key upgrades, such as beamforming microphones and low-latency audio processing, reduced artifacts like self-hearing and improved multi-party separation. Expanded visual and thermal sensing established a foundation for future interaction capabilities. Beyond the NAO, our framework provides a platform-agnostic strategy for extending the lifespan and research utility of legacy robots, ensuring they remain valuable tools for human-robot interaction.
26. Adaptive Fast-and-Slow Visual Program Reasoning for Long-Form VideoQA
Authors: Chenglin Li, Feng Han, FengTao, Ruilin Li, Qianglong Chen, Jingqi Tong, Yin Zhang, Jiaqi Wang β’
Published: 2025-09-22 β’
Source: arXiv
Large language models (LLMs) have shown promise in generating program workflows for visual tasks. However, previous approaches often rely on closed-source models, lack systematic reasoning, and struggle with long-form video question answering (videoQA). To address these challenges, we introduce the FS-VisPR framework, an adaptive visual program reasoning approach that balances fast reasoning for simple queries with slow reasoning for difficult ones. First, we design efficient visual modules (e.g., key clip retrieval and subtitle retrieval) to support long-form video tasks. Then, we construct a diverse and high-quality fast-slow reasoning dataset with a strong LLM to align open-source language models' ability to generate visual program workflows as FS-LLM. Next, we design a fast-slow reasoning framework with FS-LLM: Simple queries are directly solved by VideoLLMs, while difficult ones invoke visual program reasoning, motivated by human-like reasoning processes. During this process, low-confidence fast-thinking answers will trigger a second-stage slow-reasoning process, and a fallback mechanism to fast reasoning is activated if the program execution fails. Moreover, we improve visual programs through parameter search during both training and inference. By adjusting the parameters of the visual modules within the program, multiple variants are generated: during training, programs that yield correct answers are selected, while during inference, the program with the highest confidence result is applied. Experiments show that FS-VisPR improves both efficiency and reliability in visual program workflows. It achieves 50.4% accuracy on LVBench, surpassing GPT-4o, matching the performance of Qwen2.5VL-72B on VideoMME.
27. Existence and Synthesis of Multi-Resolution Approximate Bisimulations for Continuous-State Dynamical Systems
Authors: Rudi Coppola, Yannik Schnitzer, Mirco Giacobbe, Alessandro Abate, Manuel Mazo Jr β’
Published: 2025-09-22 β’
Source: arXiv
We present a fully automatic framework for synthesising compact, finite-state deterministic abstractions of deterministic, continuous-state autonomous systems under locally specified resolution requirements. Our approach builds on multi-resolution approximate bisimulations, a generalisation of classical $\epsilon$-approximate bisimulations, that support state-dependent error bounds and subsumes both variable- and uniform-resolution relations. We show that some systems admit multi-resolution bisimulations but no $\epsilon$-approximate bisimulation. We prove the existence of multi-resolution approximately bisimilar abstractions for all incrementally uniformly bounded ($\delta$-UB) systems, thereby broadening the applicability of symbolic verification to a larger class of dynamics; as a trivial special case, this result also covers incrementally globally asymptotically stable ($\delta$-GAS) systems. The Multi-resolution Abstraction Synthesis Problem (MRASP) is solved via a scalable Counterexample-Guided Inductive Synthesis (CEGIS) loop, combining mesh refinement with counterexample-driven refinement. This ensures soundness for all $\delta$-UB systems, and ensures termination in certain special cases. Experiments on linear and nonlinear benchmarks, including non-$\delta$-GAS and non-differentiable cases, demonstrate that our algorithm yields abstractions up to 50\% smaller than Lyapunov-based grids while enforcing tighter, location-dependent error guarantees.
28. Empirical AI Ethics: Reconfiguring Ethics towards a Situated, Plural, and Transformative Approach
Authors: Paula Helm, Selin Gerlek β’
Published: 2025-09-22 β’
Source: arXiv
Mainstream AI ethics, with its reliance on top-down, principle-driven frameworks, fails to account for the situated realities of diverse communities affected by AI (Artificial Intelligence). Critics have argued that AI ethics frequently serves corporate interests through practices of 'ethics washing', operating more as a tool for public relations than as a means of preventing harm or advancing the common good. As a result, growing scepticism among critical scholars has cast the field as complicit in sustaining harmful systems rather than challenging or transforming them. In response, this paper adopts a Science and Technology Studies (STS) perspective to critically interrogate the field of AI ethics. It hence applies the same analytic tools STS has long directed at disciplines such as biology, medicine, and statistics to ethics. This perspective reveals a core tension between vertical (top-down, principle-based) and horizontal (risk-mitigating, implementation-oriented) approaches to ethics. By tracing how these models have shaped the discourse, we show how both fall short in addressing the complexities of AI as a socio-technical assemblage, embedded in practice and entangled with power. To move beyond these limitations, we propose a threefold reorientation of AI ethics. First, we call for a shift in foundations: from top-down abstraction to empirical grounding. Second, we advocate for pluralisation: moving beyond Western-centric frameworks toward a multiplicity of onto-epistemic perspectives. Finally, we outline strategies for reconfiguring AI ethics as a transformative force, moving from narrow paradigms of risk mitigation toward co-creating technologies of hope.
29. Automated Labeling of Intracranial Arteries with Uncertainty Quantification Using Deep Learning
Authors: Javier Bisbal, Patrick Winter, Sebastian Jofre, Aaron Ponce, Sameer A. Ansari, Ramez Abdalla, Michael Markl, Oliver Welin Odeback, Sergio Uribe, Cristian Tejos, Julio Sotelo, Susanne Schnell, David Marlevi β’
Published: 2025-09-22 β’
Source: arXiv
Accurate anatomical labeling of intracranial arteries is essential for cerebrovascular diagnosis and hemodynamic analysis but remains time-consuming and subject to interoperator variability. We present a deep learning-based framework for automated artery labeling from 3D Time-of-Flight Magnetic Resonance Angiography (3D ToF-MRA) segmentations (n=35), incorporating uncertainty quantification to enhance interpretability and reliability. We evaluated three convolutional neural network architectures: (1) a UNet with residual encoder blocks, reflecting commonly used baselines in vascular labeling; (2) CS-Net, an attention-augmented UNet incorporating channel and spatial attention mechanisms for enhanced curvilinear structure recognition; and (3) nnUNet, a self-configuring framework that automates preprocessing, training, and architectural adaptation based on dataset characteristics. Among these, nnUNet achieved the highest labeling performance (average Dice score: 0.922; average surface distance: 0.387 mm), with improved robustness in anatomically complex vessels. To assess predictive confidence, we implemented test-time augmentation (TTA) and introduced a novel coordinate-guided strategy to reduce interpolation errors during augmented inference. The resulting uncertainty maps reliably indicated regions of anatomical ambiguity, pathological variation, or manual labeling inconsistency. We further validated clinical utility by comparing flow velocities derived from automated and manual labels in co-registered 4D Flow MRI datasets, observing close agreement with no statistically significant differences. Our framework offers a scalable, accurate, and uncertainty-aware solution for automated cerebrovascular labeling, supporting downstream hemodynamic analysis and facilitating clinical integration.
30. Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues
Authors: Dongxu Lu, Johan Jeuring, Albert Gatt β’
Published: 2025-09-22 β’
Source: arXiv
Evaluating large language models (LLMs) in long-form, knowledge-grounded role-play dialogues remains challenging. This study compares LLM-generated and human-authored responses in multi-turn professional training simulations through human evaluation ($N=38$) and automated LLM-as-a-judge assessment. Human evaluation revealed significant degradation in LLM-generated response quality across turns, particularly in naturalness, context maintenance and overall quality, while human-authored responses progressively improved. In line with this finding, participants also indicated a consistent preference for human-authored dialogue. These human judgements were validated by our automated LLM-as-a-judge evaluation, where Gemini 2.0 Flash achieved strong alignment with human evaluators on both zero-shot pairwise preference and stochastic 6-shot construct ratings, confirming the widening quality gap between LLM and human responses over time. Our work contributes a multi-turn benchmark exposing LLM degradation in knowledge-grounded role-play dialogues and provides a validated hybrid evaluation framework to guide the reliable integration of LLMs in training simulations.
31. Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications
Authors: Selva TaΕ, Mahmut El Huseyni, Γzay Ezerceli, Reyhan Bayraktar, Fatma BetΓΌl TerzioΔlu β’
Published: 2025-09-22 β’
Source: arXiv
The widespread adoption of Large Language Models (LLMs) has been hindered by their tendency to hallucinate, generating plausible but factually incorrect information. While Retrieval-Augmented Generation (RAG) systems attempt to address this issue by grounding responses in external knowledge, hallucination remains a persistent challenge, particularly for morphologically complex, low-resource languages like Turkish. This paper introduces Turk-LettuceDetect, the first suite of hallucination detection models specifically designed for Turkish RAG applications. Building on the LettuceDetect framework, we formulate hallucination detection as a token-level classification task and fine-tune three distinct encoder architectures: a Turkish-specific ModernBERT, TurkEmbed4STS, and multilingual EuroBERT. These models were trained on a machine-translated version of the RAGTruth benchmark dataset containing 17,790 instances across question answering, data-to-text generation, and summarization tasks. Our experimental results show that the ModernBERT-based model achieves an F1-score of 0.7266 on the complete test set, with particularly strong performance on structured tasks. The models maintain computational efficiency while supporting long contexts up to 8,192 tokens, making them suitable for real-time deployment. Comparative analysis reveals that while state-of-the-art LLMs demonstrate high recall, they suffer from low precision due to over-generation of hallucinated content, underscoring the necessity of specialized detection mechanisms. By releasing our models and translated dataset, this work addresses a critical gap in multilingual NLP and establishes a foundation for developing more reliable and trustworthy AI applications for Turkish and other languages.
32. Development and validation of an AI foundation model for endoscopic diagnosis of esophagogastric junction adenocarcinoma: a cohort and deep learning study
Authors: Yikun Ma, Bo Li, Ying Chen, Zijie Yue, Shuchang Xu, Jingyao Li, Lei Ma, Liang Zhong, Duowu Zou, Leiming Xu, Yunshi Zhong, Xiaobo Li, Weiqun Ding, Minmin Zhang, Dongli He, Zhenghong Li, Ye Chen, Ye Zhao, Jialong Zhuo, Xiaofen Wu, Lisha Yi, Miaojing Shi, Huihui Sun β’
Published: 2025-09-22 β’
Source: arXiv
The early detection of esophagogastric junction adenocarcinoma (EGJA) is crucial for improving patient prognosis, yet its current diagnosis is highly operator-dependent. This paper aims to make the first attempt to develop an artificial intelligence (AI) foundation model-based method for both screening and staging diagnosis of EGJA using endoscopic images. In this cohort and learning study, we conducted a multicentre study across seven Chinese hospitals between December 28, 2016 and December 30, 2024. It comprises 12,302 images from 1,546 patients; 8,249 of them were employed for model training, while the remaining were divided into the held-out (112 patients, 914 images), external (230 patients, 1,539 images), and prospective (198 patients, 1,600 images) test sets for evaluation. The proposed model employs DINOv2 (a vision foundation model) and ResNet50 (a convolutional neural network) to extract features of global appearance and local details of endoscopic images for EGJA staging diagnosis. Our model demonstrates satisfactory performance for EGJA staging diagnosis across three test sets, achieving an accuracy of 0.9256, 0.8895, and 0.8956, respectively. In contrast, among representative AI models, the best one (ResNet50) achieves an accuracy of 0.9125, 0.8382, and 0.8519 on the three test sets, respectively; the expert endoscopists achieve an accuracy of 0.8147 on the held-out test set. Moreover, with the assistance of our model, the overall accuracy for the trainee, competent, and expert endoscopists improves from 0.7035, 0.7350, and 0.8147 to 0.8497, 0.8521, and 0.8696, respectively. To our knowledge, our model is the first application of foundation models for EGJA staging diagnosis and demonstrates great potential in both diagnostic accuracy and efficiency.
33. AutiHero: Leveraging Generative AI in Social Narratives to Engage Parents in Story-Driven Behavioral Guidance for Autistic Children
Authors: Jungeun Lee, Kyungah Lee, Inseok Hwang, SoHyun Park, Young-Ho Kim β’
Published: 2025-09-22 β’
Source: arXiv
Social narratives are known to help autistic children understand and navigate social situations through stories. To ensure effectiveness, however, the materials need to be customized to reflect each child's unique behavioral context, requiring considerable time and effort for parents to practice at home. We present AutiHero, a generative AI-based social narrative system for behavioral guidance, which supports parents to create personalized stories for their autistic children and read them together. AutiHero generates text and visual illustrations that reflect their children's interests, target behaviors, and everyday contexts. In a two-week deployment study with 16 autistic child-parent dyads, parents created 218 stories and read an average of 4.25 stories per day, demonstrating a high level of engagement. AutiHero also provided an effective, low-demanding means to guide children's social behaviors, encouraging positive change. We discuss the implications of generative AI-infused tools to empower parents in guiding their children's behaviors, fostering their social learning.
34. LIMI: Less is More for Agency
Authors: Yang Xiao, Mohan Jiang, Jie Sun, Keyu Li, Jifan Lin, Yumin Zhuang, Ji Zeng, Shijie Xia, Qishuo Hua, Xuefeng Li, Xiaojie Cai, Tongyu Wang, Yue Zhang, Liming Liu, Xia Wu, Jinlong Hou, Yuan Cheng, Wenjie Li, Xiang Wang, Dequan Wang, Pengfei Liu β’
Published: 2025-09-22 β’
Source: arXiv
We define Agency as the emergent capacity of AI systems to function as autonomous agents actively discovering problems, formulating hypotheses, and executing solutions through self-directed engagement with environments and tools. This fundamental capability marks the dawn of the Age of AI Agency, driven by a critical industry shift: the urgent need for AI systems that don't just think, but work. While current AI excels at reasoning and generating responses, industries demand autonomous agents that can execute tasks, operate tools, and drive real-world outcomes. As agentic intelligence becomes the defining characteristic separating cognitive systems from productive workers, efficiently cultivating machine autonomy becomes paramount. Current approaches assume that more data yields better agency, following traditional scaling laws from language modeling. We fundamentally challenge this paradigm. LIMI (Less Is More for Intelligent Agency) demonstrates that agency follows radically different development principles. Through strategic focus on collaborative software development and scientific research workflows, we show that sophisticated agentic intelligence can emerge from minimal but strategically curated demonstrations of autonomous behavior. Using only 78 carefully designed training samples, LIMI achieves 73.5% on comprehensive agency benchmarks, dramatically outperforming state-of-the-art models: Kimi-K2-Instruct (24.1%), DeepSeek-V3.1 (11.9%), Qwen3-235B-A22B-Instruct (27.5%), and GLM-4.5 (45.1%). Most strikingly, LIMI demonstrates 53.7% improvement over models trained on 10,000 samples-achieving superior agentic intelligence with 128 times fewer samples. Our findings establish the Agency Efficiency Principle: machine autonomy emerges not from data abundance but from strategic curation of high-quality agentic demonstrations.
35. A Multimodal Conversational Assistant for the Characterization of Agricultural Plots from Geospatial Open Data
Authors: Juan CaΓ±ada, RaΓΊl Alonso, Julio Molleda, Fidel DΓez β’
Published: 2025-09-22 β’
Source: arXiv
The increasing availability of open Earth Observation (EO) and agricultural datasets holds great potential for supporting sustainable land management. However, their high technical entry barrier limits accessibility for non-expert users. This study presents an open-source conversational assistant that integrates multimodal retrieval and large language models (LLMs) to enable natural language interaction with heterogeneous agricultural and geospatial data. The proposed architecture combines orthophotos, Sentinel-2 vegetation indices, and user-provided documents through retrieval-augmented generation (RAG), allowing the system to flexibly determine whether to rely on multimodal evidence, textual knowledge, or both in formulating an answer. To assess response quality, we adopt an LLM-as-a-judge methodology using Qwen3-32B in a zero-shot, unsupervised setting, applying direct scoring in a multi-dimensional quantitative evaluation framework. Preliminary results show that the system is capable of generating clear, relevant, and context-aware responses to agricultural queries, while remaining reproducible and scalable across geographic regions. The primary contributions of this work include an architecture for fusing multimodal EO and textual knowledge sources, a demonstration of lowering the barrier to access specialized agricultural information through natural language interaction, and an open and reproducible design.