πŸ€– AI Research Papers

August 14, 2025

πŸ€– AI-Generated Research Summary

Comprehensive Summary of 35 Research Papers on AI, LLMs, Agents, and Workflows

This summary synthesizes key insights from a diverse set of 35 recent research papers spanning large language models (LLMs), AI agents, workflow automation, and related domains. The analysis is structured to highlight key research trends, breakthrough findings, methodological approaches, applications and use cases, and future directions.


1. Key Research Trends

a. Advancements in Large Language Models (LLMs) and Multimodal Systems

b. AI Agents and Multi-Agent Systems

c. Retrieval-Augmented Generation (RAG) and Knowledge Integration

d. Evaluation, Benchmarking, and Transparency

e. Domain-Specific AI Applications


2. Breakthrough Findings


3. Methodological Approaches


4. Applications and Use Cases


5. Future Directions


Conclusion

This collection of papers reflects a vibrant, rapidly evolving AI research landscape. Key themes include the push for more efficient, transparent, and collaborative AI agents, the integration of retrieval and external knowledge, and the drive towards open, extensible frameworks. The field is also increasingly attentive to privacy, security, and societal impact, with a strong emphasis on practical applications across domains. Future research will likely focus on scalable memory architectures, privacy-preserving workflows, and robust human-AI collaboration, paving the way for more trustworthy and capable AI systems.

πŸ“š arXiv (35 papers)
1. Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models
Authors: Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, Qiuyu Wang, Hao Ouyang, Hao Chen, Chunhua Shen β€’ Published: 2025-08-12 β€’ Source: arXiv
Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.
2. Efficient Statistical Estimation for Sequential Adaptive Experiments with Implications for Adaptive Designs
Authors: Wenxin Zhang, Mark van der Laan β€’ Published: 2025-08-12 β€’ Source: arXiv
Adaptive experimental designs have gained popularity in clinical trials and online experiments. Unlike traditional, fixed experimental designs, adaptive designs can dynamically adjust treatment randomization probabilities and other design features in response to data accumulated sequentially during the experiment. These adaptations are useful to achieve diverse objectives, including reducing uncertainty in the estimation of causal estimands or increasing participants' chances of receiving better treatments during the experiment. At the end of the experiment, it is often desirable to answer causal questions from the observed data. However, the adaptive nature of such experiments and the resulting dependence among observations pose significant challenges to providing valid statistical inference and efficient estimation of causal estimands. Building upon the Targeted Maximum Likelihood Estimator (TMLE) framework tailored for adaptive designs (van der Laan, 2008), we introduce a new adaptive-design-likelihood-based TMLE (ADL-TMLE) to estimate a variety of causal estimands from adaptive experiment data. We establish asymptotic normality and semiparametric efficiency of ADL-TMLE under relaxed positivity and design stabilization assumptions for adaptive experiments. Motivated by efficiency results, we further propose a novel adaptive design aimed at minimizing the variance of estimators based on data generated under that design. Using the average treatment effect as a representative example, simulation studies show that ADL-TMLE demonstrates superior variance-reduction performance across different types of adaptive experiments, and that the proposed adaptive design attains lower variance than the standard efficiency-oriented adaptive design. Finally, we generalize this estimation and design framework to broader settings with longitudinal structures.
3. Instrument-based quantum resources: quantification, hierarchies and towards constructing resource theories
Authors: Jatin Ghai, Arindam Mitra β€’ Published: 2025-08-12 β€’ Source: arXiv
Quantum resources are certain features of the quantum world that provide advantages in certain information-theoretic, thermodynamic, or any other useful operational tasks that are outside the realm of what classical theories can achieve. Quantum resource theories provide us with an elegant framework for studying these resources quantitatively and rigorously. While numerous state-based quantum resource theories have already been investigated, and to some extent, measurement-based resource theories have also been explored, instrument-based resource theories remain largely unexplored, with only a few notable exceptions. As quantum instruments are devices that provide both the classical outcomes of induced measurements and the post-measurement quantum states, they are quite important, especially for scenarios where multiple parties sequentially act on a quantum system. In this work, we study several instrument-based resource theories, namely (1) the resource theory of information preservability, (2) the resource theory of (strong) entanglement preservability, (3) the resource theory of (strong) incompatibility preservability, (4) the resource theory of traditional incompatibility, and (5) the resource theory of parallel incompatibility. Furthermore, we outline the hierarchies of these instrument-based resources and provide measures to quantify them. In short, we provide a detailed framework for several instrument-based quantum resource theories.
4. BrowseMaster: Towards Scalable Web Browsing via Tool-Augmented Programmatic Agent Pair
Authors: Xianghe Pang, Shuo Tang, Rui Ye, Yuwen Du, Yaxin Du, Siheng Chen β€’ Published: 2025-08-12 β€’ Source: arXiv
Effective information seeking in the vast and ever-growing digital landscape requires balancing expansive search with strategic reasoning. Current large language model (LLM)-based agents struggle to achieve this balance due to limitations in search breadth and reasoning depth, where slow, serial querying restricts coverage of relevant sources and noisy raw inputs disrupt the continuity of multi-step reasoning. To address these challenges, we propose BrowseMaster, a scalable framework built around a programmatically augmented planner-executor agent pair. The planner formulates and adapts search strategies based on task constraints, while the executor conducts efficient, targeted retrieval to supply the planner with concise, relevant evidence. This division of labor preserves coherent, long-horizon reasoning while sustaining broad and systematic exploration, overcoming the trade-off that limits existing agents. Extensive experiments on challenging English and Chinese benchmarks show that BrowseMaster consistently outperforms open-source and proprietary baselines, achieving scores of 30.0 on BrowseComp-en and 46.5 on BrowseComp-zh, which demonstrates its strong capability in complex, reasoning-heavy information-seeking tasks at scale.
5. A Review On Safe Reinforcement Learning Using Lyapunov and Barrier Functions
Authors: Dhruv S. Kushwaha, Zoleikha A. Biron β€’ Published: 2025-08-12 β€’ Source: arXiv
Reinforcement learning (RL) has proven to be particularly effective in solving complex decision-making problems for a wide range of applications. From a control theory perspective, RL can be considered as an adaptive optimal control scheme. Lyapunov and barrier functions are the most commonly used certificates to guarantee system stability for a proposed/derived controller and constraint satisfaction guarantees, respectively, in control theoretic approaches. However, compared to theoretical guarantees available in control theoretic methods, RL lacks closed-loop stability of a computed policy and constraint satisfaction guarantees. Safe reinforcement learning refers to a class of constrained problems where the constraint violations lead to partial or complete system failure. The goal of this review is to provide an overview of safe RL techniques using Lyapunov and barrier functions to guarantee this notion of safety discussed (stability of the system in terms of a computed policy and constraint satisfaction during training and deployment). The different approaches employed are discussed in detail along with their shortcomings and benefits to provide critique and possible future research directions. Key motivation for this review is to discuss current theoretical approaches for safety and stability guarantees in RL similar to control theoretic approaches using Lyapunov and barrier functions. The review provides proven potential and promising scope of providing safety guarantees for complex dynamical systems with operational constraints using model-based and model-free RL.
6. Neutone SDK: An Open Source Framework for Neural Audio Processing
Authors: Christopher Mitcheltree, Bogdan Teleaga, Andrew Fyfe, Naotake Masuda, Matthias SchΓ€fer, Alfie Bradic, Nao Tokui β€’ Published: 2025-08-12 β€’ Source: arXiv
Neural audio processing has unlocked novel methods of sound transformation and synthesis, yet integrating deep learning models into digital audio workstations (DAWs) remains challenging due to real-time / neural network inference constraints and the complexities of plugin development. In this paper, we introduce the Neutone SDK: an open source framework that streamlines the deployment of PyTorch-based neural audio models for both real-time and offline applications. By encapsulating common challenges such as variable buffer sizes, sample rate conversion, delay compensation, and control parameter handling within a unified, model-agnostic interface, our framework enables seamless interoperability between neural models and host plugins while allowing users to work entirely in Python. We provide a technical overview of the interfaces needed to accomplish this, as well as the corresponding SDK implementations. We also demonstrate the SDK's versatility across applications such as audio effect emulation, timbre transfer, and sample generation, as well as its adoption by researchers, educators, companies, and artists alike. The Neutone SDK is available at https://github.com/Neutone/neutone_sdk
7. Complex Logical Instruction Generation
Authors: Mian Zhang, Shujian Liu, Sixun Dong, Ming Yin, Yebowen Hu, Xun Wang, Steven Ma, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Zhiyu Zoey Chen, Kaiqiang Song β€’ Published: 2025-08-12 β€’ Source: arXiv
Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF
8. OpenCUA: Open Foundations for Computer-Use Agents
Authors: Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu, Huarong Chen, Zaida Zhou, Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong, Flood Sung, Y. Charles, Zhilin Yang, Tao Yu β€’ Published: 2025-08-12 β€’ Source: arXiv
Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state-action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-32B achieves an average success rate of 34.8% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.
9. Spectral Efficiency Considerations for 6G
Authors: Joseph Boccuzzi β€’ Published: 2025-08-12 β€’ Source: arXiv
As wireless connectivity continues to evolve towards 6G, there is an ever-increasing demand to not only deliver higher throughput, lower latency, and improved reliability, but also do so as efficiently as possible. To this point, the term efficiency has been quantified through applications to Spectral Efficiency (SE) and Energy Efficiency (EE). In this paper we introduce a new system metric called Radio Resource Utilization Efficiency (RUE). This metric quantifies the efficiency of the available radio resources (Spectrum, Access Method, Time Slots, Data Symbols, etc.) used to deliver future 6G demands. We compare the system performance of Typical Cellular and Cell-Free Massive MIMO deployments as a vehicle to demonstrate the need for this new metric. We begin by providing a concise treatment of items impacting SE by introducing three categories: 5G Radio Resources, Practical Limitations (such as channel matrix rank deficiency) and Implementation Losses (SINR degradation). For the example Radio Access Technology configuration analyzed, we show 5G yields an RUE of 47% (revealing significant room for improvement when defining 6G). Practical limitation assumptions are compared to 5G Multi-User MIMO (MU-MIMO) measurements conducted in a commercialized deployment. SE losses are characterized to offer guidance to advanced algorithms employing Machine Learning (ML) based techniques. We present the benefits of increasing the transmission Bandwidth (BW) from 100MHz to 1.6GHz. We describe a Next Generation RAN architecture that can support 6G and AI-RAN.
10. Deep Neural Network Calibration by Reducing Classifier Shift with Stochastic Masking
Authors: Jiani Ni, He Zhao, Yibo Yang, Dandan Guo β€’ Published: 2025-08-12 β€’ Source: arXiv
In recent years, deep neural networks (DNNs) have shown competitive results in many fields. Despite this success, they often suffer from poor calibration, especially in safety-critical scenarios such as autonomous driving and healthcare, where unreliable confidence estimates can lead to serious consequences. Recent studies have focused on improving calibration by modifying the classifier, yet such efforts remain limited. Moreover, most existing approaches overlook calibration errors caused by underconfidence, which can be equally detrimental. To address these challenges, we propose MaC-Cal, a novel mask-based classifier calibration method that leverages stochastic sparsity to enhance the alignment between confidence and accuracy. MaC-Cal adopts a two-stage training scheme with adaptive sparsity, dynamically adjusting mask retention rates based on the deviation between confidence and accuracy. Extensive experiments show that MaC-Cal achieves superior calibration performance and robustness under data corruption, offering a practical and effective solution for reliable confidence estimation in DNNs.
11. Machine Learning Phonon Spectra for Fast and Accurate Optical Lineshapes of Defects
Authors: Mark E. Turiansky, John L. Lyons, Noam Bernstein β€’ Published: 2025-08-12 β€’ Source: arXiv
The optical properties of defects in solids produce rich physics, from gemstone coloration to single-photon emission for quantum networks. Essential to describing optical transitions is electron-phonon coupling, which can be predicted from first principles but requires computationally expensive evaluation of all phonon modes in simulation cells containing hundreds of atoms. We demonstrate that this bottleneck can be overcome using machine learning interatomic potentials with negligible accuracy loss. A key finding is that atomic relaxation data from routine first-principles calculations suffice as a dataset for fine-tuning, though additional data can further improve models. The efficiency of this approach enables studies of defect vibrational properties with high-level theory. We fine-tune to hybrid functional calculations to obtain highly accurate spectra, comparing with explicit calculations and experiments for various defects. Notably, we resolve fine details of local vibrational mode coupling in the luminescence spectrum of the T center in Si, a prominent quantum defect.
12. Tracing Large Scale Structure Morphology with Multiwavelength Line Intensity Maps
Authors: Manas Mohit Dosibhatla, Suman Majumdar, Chandra Shekhar Murmu, Samit Kumar Pal, Saswata Dasgupta, Satadru Bag β€’ Published: 2025-08-12 β€’ Source: arXiv
Line intensity mapping (LIM) is an emerging technique for probing the large scale structure (LSS) in the post-reionisation era. This captures the integrated flux of a particular spectral line emission from multiple sources within a patch of the sky without resolving them. Mapping different galaxy line emissions, such as the HI $21$-cm and CO rotational lines via LIM, can reveal complementary information about the bias with which the line emitters trace the underlying matter distribution and how different astrophysical phenomena affect the clustering pattern of these signals. The stage where the structures in the cosmic web merge to form a single connected structure is known as the percolation transition. Using mock HI $21$-cm and CO($1-0$) LIM signals in the post-reionisation universe, we explore the connectivity of structures through percolation analysis and compare it with that of the underlying galaxy distribution. We probe the relative contributions of voids, filaments, and sheets to the galaxy density and line intensity maps using a morphological measure known as the local dimension. The CO($1-0$) map exhibits an increased filamentary behaviour and larger contribution from sheets than the $21$-cm map. We attempt to explain such an emission of the CO($1-0$) line from biased environments. The upcoming SKA-Mid will produce tomographic intensity maps of the $21$-cm signal at $z \lesssim 3$ in Band-1. CO maps can be produced at these redshifts in phase 2 of SKA-Mid, where the frequency coverage is expected to increase up to $\sim 50$ GHz. We present forecasts for the recovery of the local dimensions of these intensity maps contaminated by instrumental noise, considering SKA-Mid observations.
13. AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
Authors: Jason Chou, Ao Liu, Yuchi Deng, Zhiying Zeng, Tao Zhang, Haotian Zhu, Jianwei Cai, Yue Mao, Chenchen Zhang, Lingyun Tan, Ziyan Xu, Bohui Zhai, Hengyi Liu, Speed Zhu, Wiggin Zhou, Fengzong Lian β€’ Published: 2025-08-12 β€’ Source: arXiv
Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities, these benchmarks face several critical limitations. First, they often rely on manual annotations, which are time-consuming and difficult to scale across different programming languages and problem complexities. Second, most existing benchmarks focus primarily on Python, while the few multilingual benchmarks suffer from limited difficulty and uneven language distribution. To address these challenges, we propose AutoCodeGen, an automated method for generating high-difficulty multilingual code generation datasets without manual annotations. AutoCodeGen ensures the correctness and completeness of test cases by generating test inputs with LLMs and obtaining test outputs through a multilingual sandbox, while achieving high data quality through reverse-order problem generation and multiple filtering steps. Using this novel method, we introduce AutoCodeBench, a large-scale code generation benchmark comprising 3,920 problems evenly distributed across 20 programming languages. It is specifically designed to evaluate LLMs on challenging, diverse, and practical multilingual tasks. We evaluate over 30 leading open-source and proprietary LLMs on AutoCodeBench and its simplified version AutoCodeBench-Lite. The results show that even the most advanced LLMs struggle with the complexity, diversity, and multilingual nature of these tasks. Besides, we introduce AutoCodeBench-Complete, specifically designed for base models to assess their few-shot code generation capabilities. We hope the AutoCodeBench series will serve as a valuable resource and inspire the community to focus on more challenging and practical multilingual code generation scenarios.
14. Chi-Geometry: A Library for Benchmarking Chirality Prediction of GNNs
Authors: Rylie Weaver, Massamiliano Lupo Pasini β€’ Published: 2025-08-12 β€’ Source: arXiv
We introduce Chi-Geometry - a library that generates graph data for testing and benchmarking GNNs' ability to predict chirality. Chi-Geometry generates synthetic graph samples with (i) user-specified geometric and topological traits to isolate certain types of samples and (ii) randomized node positions and species to minimize extraneous correlations. Each generated graph contains exactly one chiral center labeled either R or S, while all other nodes are labeled N/A (non-chiral). The generated samples are then combined into a cohesive dataset that can be used to assess a GNN's ability to predict chirality as a node classification task. Chi-Geometry allows more interpretable and less confounding benchmarking of GNNs for prediction of chirality in the graph samples which can guide the design of new GNN architectures with improved predictive performance. We illustrate Chi-Geometry's efficacy by using it to generate synthetic datasets for benchmarking various state-of-the-art (SOTA) GNN architectures. The conclusions of these benchmarking results guided our design of two new GNN architectures. The first GNN architecture established all-to-all connections in the graph to accurately predict chirality across all challenging configurations where previously tested SOTA models failed, but at a computational cost (both for training and inference) that grows quadratically with the number of graph nodes. The second GNN architecture avoids all-to-all connections by introducing a virtual node in the original graph structure of the data, which restores the linear scaling of training and inference computational cost with respect to the number of nodes in the graph, while still ensuring competitive accuracy in detecting chirality with respect to SOTA GNN architectures.
15. Deep Learning Models for Robust Facial Liveness Detection
Authors: Oleksandr Kuznetsov, Emanuele Frontoni, Luca Romeo, Riccardo Rosati, Andrea Maranesi, Alessandro Muscatello β€’ Published: 2025-08-12 β€’ Source: arXiv
In the rapidly evolving landscape of digital security, biometric authentication systems, particularly facial recognition, have emerged as integral components of various security protocols. However, the reliability of these systems is compromised by sophisticated spoofing attacks, where imposters gain unauthorized access by falsifying biometric traits. Current literature reveals a concerning gap: existing liveness detection methodologies - designed to counteract these breaches - fall short against advanced spoofing tactics employing deepfakes and other artificial intelligence-driven manipulations. This study introduces a robust solution through novel deep learning models addressing the deficiencies in contemporary anti-spoofing techniques. By innovatively integrating texture analysis and reflective properties associated with genuine human traits, our models distinguish authentic presence from replicas with remarkable precision. Extensive evaluations were conducted across five diverse datasets, encompassing a wide range of attack vectors and environmental conditions. Results demonstrate substantial advancement over existing systems, with our best model (AttackNet V2.2) achieving 99.9% average accuracy when trained on combined data. Moreover, our research unveils critical insights into the behavioral patterns of impostor attacks, contributing to a more nuanced understanding of their evolving nature. The implications are profound: our models do not merely fortify the authentication processes but also instill confidence in biometric systems across various sectors reliant on secure access.
16. Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages
Authors: Imalsha Puranegedara, Themira Chathumina, Nisal Ranathunga, Nisansa de Silva, Surangika Ranathunga, Mokanarangan Thayaparan β€’ Published: 2025-08-12 β€’ Source: arXiv
Large Language Models (LLMs) excel in English, but their performance degrades significantly on low-resource languages (LRLs) due to English-centric training. While methods like LangBridge align LLMs with multilingual encoders such as the Massively Multilingual Text-to-Text Transfer Transformer (mT5), they typically use only the final encoder layer. We propose a novel architecture that fuses all intermediate layers, enriching the linguistic information passed to the LLM. Our approach features two strategies: (1) a Global Softmax weighting for overall layer importance, and (2) a Transformer Softmax model that learns token-specific weights. The fused representations are mapped into the LLM's embedding space, enabling it to process multilingual inputs. The model is trained only on English data, without using any parallel or multilingual data. Evaluated on XNLI, IndicXNLI, Sinhala News Classification, and Amazon Reviews, our Transformer Softmax model significantly outperforms the LangBridge baseline. We observe strong performance gains in LRLs, improving Sinhala classification accuracy from 71.66% to 75.86% and achieving clear improvements across Indic languages such as Tamil, Bengali, and Malayalam. These specific gains contribute to an overall boost in average XNLI accuracy from 70.36% to 71.50%. This approach offers a scalable, data-efficient path toward more capable and equitable multilingual LLMs.
17. SPARC: Soft Probabilistic Adaptive multi-interest Retrieval Model via Codebooks for recommender system
Authors: Jialiang Shi, Yaguang Dou, Tian Qi β€’ Published: 2025-08-12 β€’ Source: arXiv
Modeling multi-interests has arisen as a core problem in real-world RS. Current multi-interest retrieval methods pose three major challenges: 1) Interests, typically extracted from predefined external knowledge, are invariant. Failed to dynamically evolve with users' real-time consumption preferences. 2) Online inference typically employs an over-exploited strategy, mainly matching users' existing interests, lacking proactive exploration and discovery of novel and long-tail interests. To address these challenges, we propose a novel retrieval framework named SPARC(Soft Probabilistic Adaptive Retrieval Model via Codebooks). Our contribution is two folds. First, the framework utilizes Residual Quantized Variational Autoencoder (RQ-VAE) to construct a discretized interest space. It achieves joint training of the RQ-VAE with the industrial large scale recommendation model, mining behavior-aware interests that can perceive user feedback and evolve dynamically. Secondly, a probabilistic interest module that predicts the probability distribution over the entire dynamic and discrete interest space. This facilitates an efficient "soft-search" strategy during online inference, revolutionizing the retrieval paradigm from "passive matching" to "proactive exploration" and thereby effectively promoting interest discovery. Online A/B tests on an industrial platform with tens of millions daily active users, have achieved substantial gains in business metrics: +0.9% increase in user view duration, +0.4% increase in user page views (PV), and a +22.7% improvement in PV500(new content reaching 500 PVs in 24 hours). Offline evaluations are conducted on open-source Amazon Product datasets. Metrics, such as Recall@K and Normalized Discounted Cumulative Gain@K(NDCG@K), also showed consistent improvement. Both online and offline experiments validate the efficacy and practical value of the proposed method.
18. Charged Proca Stars
Authors: Yahir Mio, Miguel Alcubierre β€’ Published: 2025-08-12 β€’ Source: arXiv
We consider self-gravitating stationary configurations of a charged massive complex Proca field, also known as charged Proca stars, in the particular case of spherical symmetry. We first present a general 3+1 decomposition of the Einstein--Maxwell--Proca system, starting from the action and field equations. We then restrict our system to the case of spherical symmetry and, after imposing a harmonic time dependence ansatz for the Proca field, we construct families of charged Proca stars for different values of the charge parameter $q$, and different values of the central Proca scalar potential $\varphi$. In a similar way to the case of scalar boson stars, one can define a critical charge $q=q_c$ that corresponds to the value for which the Coulomb repulsion of the charged Proca field exactly cancels their newtonian gravitational attraction. We find that supercritical solutions can exist for a limited range of charges above the critical value $q>q_c$. We also consider the binding energy $E_B$ for the different families of solutions, and find that gravitationally bound solutions such that $E_B<0$ can only exist for subcritical charges such that $q
19. READER: Retrieval-Assisted Drafter for Efficient LLM Inference
Authors: Maxim Divilkovskiy, Vitaly Malygin, Sergey Zlobin, Sultan Isali, Vasily Kalugin, Stanislav Ilyushin, Nuriza Aitassova, Yi Fei, Zeng Weidi β€’ Published: 2025-08-12 β€’ Source: arXiv
Large Language Models (LLMs) generate tokens autoregressively, with each token depending on the preceding context. This sequential nature makes the inference process inherently difficult to accelerate, posing a significant challenge for efficient deployment. In recent years, various methods have been proposed to address this issue, with the most effective approaches often involving the training of additional draft models. In this paper, we introduce READER (Retrieval-Assisted Drafter for Efficient LLM Inference), a novel lossless speculative decoding method that enhances model-based approaches by leveraging self-repetitions in the text. Our algorithm expands the speculative decoding tree using tokens obtained through statistical search. This work focuses on large batch sizes (>= 8), an underexplored yet important area for industrial applications. We also analyze the key-value (KV) cache size during speculative decoding and propose an optimization to improve performance for large batches. As a result, READER outperforms existing speculative decoding methods. Notably, READER requires no additional training and can reuse pre-trained speculator models, increasing the speedup by over 40\%. Our method demonstrates particularly strong performance on search-based tasks, such as retrieval-augmented generation, where we achieve more than 10x speedup.
20. Finite-dimensional approximations of generalized squeezing
Authors: Sahel Ashhab, Felix Fischer, Davide Lonigro, Daniel Braak, Daniel Burgarth β€’ Published: 2025-08-12 β€’ Source: arXiv
We show unexpected behaviour in simulations of generalized squeezing performed with finite-dimensional truncations of the Fock space: even for extremely large dimension of the state space, the results depend on whether the truncation dimension is even or odd. This situation raises the question whether the simulation results are physically meaningful. We demonstrate that, in fact, the two truncation schemes correspond to two well-defined, distinct unitary evolutions whose generators are defined on different subsets of the infinite-dimensional Fock space. This is a consequence of the fact that the generalized squeezing Hamiltonian is not self-adjoint on states with finite excitations, but possesses multiple self-adjoint extensions. Furthermore, we present results on the spectrum of the squeezing operators corresponding to even and odd truncation size that elucidate the properties of the two different self-adjoint extensions corresponding to the even and odd truncation scheme. To make the squeezing operator applicable to a physical system, we must regularize it by other terms that depend on the specifics of the experimental implementation. We show that the addition of a Kerr interaction term in the Hamiltonian leads to uniquely converging simulations, with no dependence on the parity of the truncation size, and demonstrate that the Kerr term indeed renders the Hamiltonian self-adjoint and thus physically interpretable.
21. Can We Trust AI to Govern AI? Benchmarking LLM Performance on Privacy and AI Governance Exams
Authors: Zane Witherspoon, Thet Mon Aye, YingYing Hao β€’ Published: 2025-08-12 β€’ Source: arXiv
The rapid emergence of large language models (LLMs) has raised urgent questions across the modern workforce about this new technology's strengths, weaknesses, and capabilities. For privacy professionals, the question is whether these AI systems can provide reliable support on regulatory compliance, privacy program management, and AI governance. In this study, we evaluate ten leading open and closed LLMs, including models from OpenAI, Anthropic, Google DeepMind, Meta, and DeepSeek, by benchmarking their performance on industry-standard certification exams: CIPP/US, CIPM, CIPT, and AIGP from the International Association of Privacy Professionals (IAPP). Each model was tested using official sample exams in a closed-book setting and compared to IAPP's passing thresholds. Our findings show that several frontier models such as Gemini 2.5 Pro and OpenAI's GPT-5 consistently achieve scores exceeding the standards for professional human certification - demonstrating substantial expertise in privacy law, technical controls, and AI governance. The results highlight both the strengths and domain-specific gaps of current LLMs and offer practical insights for privacy officers, compliance leads, and technologists assessing the readiness of AI tools for high-stakes data governance roles. This paper provides an overview for professionals navigating the intersection of AI advancement and regulatory risk and establishes a machine benchmark based on human-centric evaluations.
22. Beyond Predictions: A Study of AI Strength and Weakness Transparency Communication on Human-AI Collaboration
Authors: Tina Behzad, Nikolos Gurney, Ning Wang, David V. Pynadath β€’ Published: 2025-08-12 β€’ Source: arXiv
The promise of human-AI teaming lies in humans and AI working together to achieve performance levels neither could accomplish alone. Effective communication between AI and humans is crucial for teamwork, enabling users to efficiently benefit from AI assistance. This paper investigates how AI communication impacts human-AI team performance. We examine AI explanations that convey an awareness of its strengths and limitations. To achieve this, we train a decision tree on the model's mistakes, allowing it to recognize and explain where and why it might err. Through a user study on an income prediction task, we assess the impact of varying levels of information and explanations about AI predictions. Our results show that AI performance insights enhance task performance, and conveying AI awareness of its strengths and weaknesses improves trust calibration. These findings highlight the importance of considering how information delivery influences user trust and reliance in AI-assisted decision-making.
23. Stochastic Decentralized Optimization of Non-Smooth Convex and Convex-Concave Problems over Time-Varying Networks
Authors: Maxim Divilkovskiy, Alexander Gasnikov β€’ Published: 2025-08-12 β€’ Source: arXiv
We study non-smooth stochastic decentralized optimization problems over time-varying networks, where objective functions are distributed across nodes and network connections may intermittently appear or break. Specifically, we consider two settings: (i) stochastic non-smooth (strongly) convex optimization, and (ii) stochastic non-smooth (strongly) convex-(strongly) concave saddle point optimization. Convex problems of this type commonly arise in deep neural network training, while saddle point problems are central to machine learning tasks such as the training of generative adversarial networks (GANs). Prior works have primarily focused on the smooth setting, or time-invariant network scenarios. We extend the existing theory to the more general non-smooth and stochastic setting over time-varying networks and saddle point problems. Our analysis establishes upper bounds on both the number of stochastic oracle calls and communication rounds, matching lower bounds for both convex and saddle point optimization problems.
24. MechaFormer: Sequence Learning for Kinematic Mechanism Design Automation
Authors: Diana Bolanos, Mohammadmehdi Ataei, Pradeep Kumar Jayaraman β€’ Published: 2025-08-12 β€’ Source: arXiv
Designing mechanical mechanisms to trace specific paths is a classic yet notoriously difficult engineering problem, characterized by a vast and complex search space of discrete topologies and continuous parameters. We introduce MechaFormer, a Transformer-based model that tackles this challenge by treating mechanism design as a conditional sequence generation task. Our model learns to translate a target curve into a domain-specific language (DSL) string, simultaneously determining the mechanism's topology and geometric parameters in a single, unified process. MechaFormer significantly outperforms existing baselines, achieving state-of-the-art path-matching accuracy and generating a wide diversity of novel and valid designs. We demonstrate a suite of sampling strategies that can dramatically improve solution quality and offer designers valuable flexibility. Furthermore, we show that the high-quality outputs from MechaFormer serve as excellent starting points for traditional optimizers, creating a hybrid approach that finds superior solutions with remarkable efficiency.
25. Intrinsic Memory Agents: Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory
Authors: Sizhe Yuen, Francisco Gomez Medina, Ting Su, Yali Du, Adam J. Sobey β€’ Published: 2025-08-12 β€’ Source: arXiv
Multi-agent systems built on Large Language Models (LLMs) show exceptional promise for complex collaborative problem-solving, yet they face fundamental challenges stemming from context window limitations that impair memory consistency, role adherence, and procedural integrity. This paper introduces Intrinsic Memory Agents, a novel framework that addresses these limitations through structured agent-specific memories that evolve intrinsically with agent outputs. Specifically, our method maintains role-aligned memory templates that preserve specialized perspectives while focusing on task-relevant information. We benchmark our approach on the PDDL dataset, comparing its performance to existing state-of-the-art multi-agentic memory approaches and showing an improvement of 38.6\% with the highest token efficiency. An additional evaluation is performed on a complex data pipeline design task, we demonstrate that our approach produces higher quality designs when comparing 5 metrics: scalability, reliability, usability, cost-effectiveness and documentation with additional qualitative evidence of the improvements. Our findings suggest that addressing memory limitations through structured, intrinsic approaches can improve the capabilities of multi-agent LLM systems on structured planning tasks.
26. A Framework for FAIR and CLEAR Ecological Data and Knowledge: Semantic Units for Synthesis and Causal Modelling
Authors: Lars Vogt, Birgitta KΓΆnig-Ries, Tim Alamenciak, Joshua I. Brian, Carlos Alberto Arnillas, Lotte Korell, Robert FrΓΌhstΓΌckl, Tina Heger β€’ Published: 2025-08-12 β€’ Source: arXiv
Ecological research increasingly relies on integrating heterogeneous datasets and knowledge to explain and predict complex phenomena. Yet, differences in data types, terminology, and documentation often hinder interoperability, reuse, and causal understanding. We present the Semantic Units Framework, a novel, domain-agnostic semantic modelling approach applied here to ecological data and knowledge in compliance with the FAIR (Findable, Accessible, Interoperable, Reusable) and CLEAR (Cognitively interoperable, semantically Linked, contextually Explorable, easily Accessible, human-Readable and -interpretable) Principles. The framework models data and knowledge as modular, logic-aware semantic units: single propositions (statement units) or coherent groups of propositions (compound units). Statement units can model measurements, observations, or universal relationships, including causal ones, and link to methods and evidence. Compound units group related statement units into reusable, semantically coherent knowledge objects. Implemented using RDF, OWL, and knowledge graphs, semantic units can be serialized as FAIR Digital Objects with persistent identifiers, provenance, and semantic interoperability. We show how universal statement units build ecological causal networks, which can be composed into causal maps and perspective-specific subnetworks. These support causal reasoning, confounder detection (back-door), effect identification with unobserved confounders (front-door), application of do-calculus, and alignment with Bayesian networks, structural equation models, and structural causal models. By linking fine-grained empirical data to high-level causal reasoning, the Semantic Units Framework provides a foundation for ecological knowledge synthesis, evidence annotation, cross-domain integration, reproducible workflows, and AI-ready ecological research.
27. Shape Completion and Real-Time Visualization in Robotic Ultrasound Spine Acquisitions
Authors: Miruna-Alexandra Gafencu, Reem Shaban, Yordanka Velikova, Mohammad Farid Azampour, Nassir Navab β€’ Published: 2025-08-12 β€’ Source: arXiv
Ultrasound (US) imaging is increasingly used in spinal procedures due to its real-time, radiation-free capabilities; however, its effectiveness is hindered by shadowing artifacts that obscure deeper tissue structures. Traditional approaches, such as CT-to-US registration, incorporate anatomical information from preoperative CT scans to guide interventions, but they are limited by complex registration requirements, differences in spine curvature, and the need for recent CT imaging. Recent shape completion methods can offer an alternative by reconstructing spinal structures in US data, while being pretrained on large set of publicly available CT scans. However, these approaches are typically offline and have limited reproducibility. In this work, we introduce a novel integrated system that combines robotic ultrasound with real-time shape completion to enhance spinal visualization. Our robotic platform autonomously acquires US sweeps of the lumbar spine, extracts vertebral surfaces from ultrasound, and reconstructs the complete anatomy using a deep learning-based shape completion network. This framework provides interactive, real-time visualization with the capability to autonomously repeat scans and can enable navigation to target locations. This can contribute to better consistency, reproducibility, and understanding of the underlying anatomy. We validate our approach through quantitative experiments assessing shape completion accuracy and evaluations of multiple spine acquisition protocols on a phantom setup. Additionally, we present qualitative results of the visualization on a volunteer scan.
28. Sound Signal Synthesis with Auxiliary Classifier GAN, COVID-19 cough as an example
Authors: Yahya Sherif Solayman Mohamed Saleh, Ahmed Mohammed Dabbous, Lama Alkhaled, Hum Yan Chai, Muhammad Ehsan Rana, Hamam Mokayed β€’ Published: 2025-08-12 β€’ Source: arXiv
One of the fastest-growing domains in AI is healthcare. Given its importance, it has been the interest of many researchers to deploy ML models into the ever-demanding healthcare domain to aid doctors and increase accessibility. Delivering reliable models, however, demands a sizable amount of data, and the recent COVID-19 pandemic served as a reminder of the rampant and scary nature of healthcare that makes training models difficult. To alleviate such scarcity, many published works attempted to synthesize radiological cough data to train better COVID-19 detection models on the respective radiological data. To accommodate the time sensitivity expected during a pandemic, this work focuses on detecting COVID-19 through coughs using synthetic data to improve the accuracy of the classifier. The work begins by training a CNN on a balanced subset of the Coughvid dataset, establishing a baseline classification test accuracy of 72%. The paper demonstrates how an Auxiliary Classification GAN (ACGAN) may be trained to conditionally generate novel synthetic Mel Spectrograms of both healthy and COVID-19 coughs. These coughs are used to augment the training dataset of the CNN classifier, allowing it to reach a new test accuracy of 75%. The work highlights the expected messiness and inconsistency in training and offers insights into detecting and handling such shortcomings.
29. Reducing Cognitive Load in Multi-Agent Reinforcement Learning for Mathematical Problem Solving: Decoupling Reasoning and Code Generation
Authors: Dayu Wang, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li β€’ Published: 2025-08-12 β€’ Source: arXiv
Current tool-integrated mathematical reasoning systems often adopt a single-agent paradigm, where one large language model handles problem reasoning, code generation, and code execution in an integrated workflow. While this design eases coordination, we hypothesize that it imposes cognitive load interference, as the agent must interleave long-horizon reasoning with precise program synthesis. We validate this hypothesis through a controlled comparison between a reasoning-only agent and a reasoning-plus-code agent, finding that the latter produces significantly fewer correct reasoning paths despite having tool-calling capabilities. To address this, we propose a dual-agent hybrid framework: a Reasoning Agent performs stepwise problem decomposition, and a Code Agent handles code generation and execution. Training combines imitation learning and reinforcement learning: the Code Agent receives strong rewards for matching intermediate ground-truth programs and weaker rewards for valid execution, while the Reasoning Agent is optimized chiefly via final-answer accuracy using advantage estimation to credit intermediate steps. This decoupled role design reduces cognitive interference and promotes stable reasoning-coding coordination.
30. Optimal Autonomous MLIP Dataset Building
Authors: Vincent G. Fletcher, Albert P. BartΓ³k, Livia B. PΓ‘rtay β€’ Published: 2025-08-12 β€’ Source: arXiv
We propose a novel approach for constructing training databases for Machine Learning Interatomic Potential (MLIP) models, specifically designed to capture phase properties across a wide range of conditions. The framework is uniquely appealing due to its ease of automation, its suitability for iterative learning, and its independence from prior knowledge of stable phases, avoiding bias towards pre-existing structural data. The approach uses Nested Sampling (NS) to explore the configuration space and generate thermodynamically relevant configurations, forming the database which undergoes ab-initio Density Functional Theory (DFT) evaluation. We use the Atomic Cluster Expansion (ACE) architecture to fit a model on the resulting database. To demonstrate the efficiency of the framework, we apply it to magnesium, developing a model capable of accurately describing behaviour across pressure and temperature ranges of 0-600 GPa and 0-8000 K, respectively. We benchmark the model's performance by calculating phonon spectra and elastic constants, as well as the pressure-temperature phase diagram within this region. The results showcase the power of the framework to produce robust MLIPs while maintaining transferability and generality, for reduced computational cost.
31. A bottleneck model with shared autonomous vehicles: Scale economies and price regulations
Authors: Koki Satsukawa, Yuki Takayama β€’ Published: 2025-08-12 β€’ Source: arXiv
This study examines how scale economies in the operation of shared autonomous vehicles (SAVs) affect the efficiency of a transportation system where SAVs coexist with normal vehicles (NVs). We develop a bottleneck model where commuters choose their departure times and mode of travel between SAVs and NVs, and analyze equilibria under three SAV-fare scenarios: marginal-cost pricing, average-cost pricing, and unregulated monopoly pricing. Marginal-cost pricing reduces commuting costs but results in financial deficits for the service provider. Average-cost pricing ensures financial sustainability but has contrasting effects depending on the timing of implementation due to the existence of multiple equilibria: when implemented too early, it discourages adoption of SAVs and increases commuting costs; when introduced after SAV adoption reaches the monopoly equilibrium level, it promotes high adoption and achieves substantial cost reductions without a deficit. We also show that expanding road capacity may increase commuting costs under average-cost pricing, demonstrating the Downs--Thomson paradox in transportation systems with SAVs. We next examine two optimal policies that improve social cost, including the operator's profit: the first-best policy that combines marginal-cost pricing with congestion tolls, and the second-best policy that relies on fare regulation alone. Our analysis shows that these policies can limit excessive adoption by discouraging overuse of SAVs. This suggests that promoting SAV adoption does not always lower social cost.
32. The Roots of International Perceptions: Simulating US Attitude Changes Towards China with LLM Agents
Authors: Nicholas Sukiennik, Yichuan Xu, Yuqing Kan, Jinghua Piao, Yuwei Yan, Chen Gao, Yong Li β€’ Published: 2025-08-12 β€’ Source: arXiv
The rise of LLMs poses new possibilities in modeling opinion evolution, a long-standing task in simulation, by leveraging advanced reasoning abilities to recreate complex, large-scale human cognitive trends. While most prior works focus on opinion evolution surrounding specific isolated events or the views within a country, ours is the first to model the large-scale attitude evolution of a population representing an entire country towards another -- US citizens' perspectives towards China. To tackle the challenges of this broad scenario, we propose a framework that integrates media data collection, user profile creation, and cognitive architecture for opinion updates to successfully reproduce the real trend of US attitudes towards China over a 20-year period from 2005 to today. We also leverage LLMs' capabilities to introduce debiased media exposure, extracting neutral events from typically subjective news contents, to uncover the roots of polarized opinion formation, as well as a devils advocate agent to help explain the rare reversal from negative to positive attitudes towards China, corresponding with changes in the way Americans obtain information about the country. The simulation results, beyond validating our framework architecture, also reveal the impact of biased framing and selection bias in shaping attitudes. Overall, our work contributes to a new paradigm for LLM-based modeling of cognitive behaviors in a large-scale, long-term, cross-border social context, providing insights into the formation of international biases and offering valuable implications for media consumers to better understand the factors shaping their perspectives, and ultimately contributing to the larger social need for bias reduction and cross-cultural tolerance.
33. Efficient Agent: Optimizing Planning Capability for Multimodal Retrieval Augmented Generation
Authors: Yuechen Wang, Yuming Qiao, Dan Meng, Jun Yang, Haonan Lu, Zhenyu Yang, Xudong Zhang β€’ Published: 2025-08-12 β€’ Source: arXiv
Multimodal Retrieval-Augmented Generation (mRAG) has emerged as a promising solution to address the temporal limitations of Multimodal Large Language Models (MLLMs) in real-world scenarios like news analysis and trending topics. However, existing approaches often suffer from rigid retrieval strategies and under-utilization of visual information. To bridge this gap, we propose E-Agent, an agent framework featuring two key innovations: a mRAG planner trained to dynamically orchestrate multimodal tools based on contextual reasoning, and a task executor employing tool-aware execution sequencing to implement optimized mRAG workflows. E-Agent adopts a one-time mRAG planning strategy that enables efficient information retrieval while minimizing redundant tool invocations. To rigorously assess the planning capabilities of mRAG systems, we introduce the Real-World mRAG Planning (RemPlan) benchmark. This novel benchmark contains both retrieval-dependent and retrieval-independent question types, systematically annotated with essential retrieval tools required for each instance. The benchmark's explicit mRAG planning annotations and diverse question design enhance its practical relevance by simulating real-world scenarios requiring dynamic mRAG decisions. Experiments across RemPlan and three established benchmarks demonstrate E-Agent's superiority: 13% accuracy gain over state-of-the-art mRAG methods while reducing redundant searches by 37%.
34. Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering
Authors: Yunfeng Ning, Mayi Xu, Jintao Wen, Qiankun Pi, Yuanyuan Zhu, Ming Zhong, Jiawei Jiang, Tieyun Qian β€’ Published: 2025-08-12 β€’ Source: arXiv
LLMs often suffer from hallucinations and outdated or incomplete knowledge. RAG is proposed to address these issues by integrating external knowledge like that in KGs into LLMs. However, leveraging private KGs in RAG systems poses significant privacy risks due to the black-box nature of LLMs and potential insecure data transmission, especially when using third-party LLM APIs lacking transparency and control. In this paper, we investigate the privacy-protected RAG scenario for the first time, where entities in KGs are anonymous for LLMs, thus preventing them from accessing entity semantics. Due to the loss of semantics of entities, previous RAG systems cannot retrieve question-relevant knowledge from KGs by matching questions with the meaningless identifiers of anonymous entities. To realize an effective RAG system in this scenario, two key challenges must be addressed: (1) How can anonymous entities be converted into retrievable information. (2) How to retrieve question-relevant anonymous entities. Hence, we propose a novel ARoG framework including relation-centric abstraction and structure-oriented abstraction strategies. For challenge (1), the first strategy abstracts entities into high-level concepts by dynamically capturing the semantics of their adjacent relations. It supplements meaningful semantics which can further support the retrieval process. For challenge (2), the second strategy transforms unstructured natural language questions into structured abstract concept paths. These paths can be more effectively aligned with the abstracted concepts in KGs, thereby improving retrieval performance. To guide LLMs to effectively retrieve knowledge from KGs, the two strategies strictly protect privacy from being exposed to LLMs. Experiments on three datasets demonstrate that ARoG achieves strong performance and privacy-robustness.
35. DevNous: An LLM-Based Multi-Agent System for Grounding IT Project Management in Unstructured Conversation
Authors: Stavros Doropoulos, Stavros Vologiannidis, Ioannis Magnisalis β€’ Published: 2025-08-12 β€’ Source: arXiv
The manual translation of unstructured team dialogue into the structured artifacts required for Information Technology (IT) project governance is a critical bottleneck in modern information systems management. We introduce DevNous, a Large Language Model-based (LLM) multi-agent expert system, to automate this unstructured-to-structured translation process. DevNous integrates directly into team chat environments, identifying actionable intents from informal dialogue and managing stateful, multi-turn workflows for core administrative tasks like automated task formalization and progress summary synthesis. To quantitatively evaluate the system, we introduce a new benchmark of 160 realistic, interactive conversational turns. The dataset was manually annotated with a multi-label ground truth and is publicly available. On this benchmark, DevNous achieves an exact match turn accuracy of 81.3\% and a multiset F1-Score of 0.845, providing strong evidence for its viability. The primary contributions of this work are twofold: (1) a validated architectural pattern for developing ambient administrative agents, and (2) the introduction of the first robust empirical baseline and public benchmark dataset for this challenging problem domain.