πŸ€– AI Research Papers

August 06, 2025

πŸ€– AI-Generated Research Summary

Comprehensive Summary of 35 Recent Papers on AI, LLMs, Agents, and Workflows

This summary synthesizes key insights from a diverse set of 35 research papers spanning large language models (LLMs), agents, multimodal AI, workflow automation, and their applications. The analysis is structured as follows:

  1. Key Research Trends
  2. Breakthrough Findings
  3. Methodological Approaches
  4. Applications and Use Cases
  5. Future Directions

Key Research Trends

1. Agentic and Multi-Agent Systems

2. Retrieval-Augmented Generation (RAG) and Safety

3. Multimodal and Domain-Specific AI

4. Reinforcement Learning (RL) and Optimization

5. Data Efficiency and Automated Data Generation


Breakthrough Findings


Methodological Approaches


Applications and Use Cases


Future Directions


Conclusion

This collection of papers highlights a vibrant, rapidly evolving landscape in AI research, with a strong focus on agentic workflows, multimodal intelligence, robust and secure LLM/RAG systems, and practical applications across healthcare, industry, education, and the arts. The field is moving toward more collaborative, context-aware, and trustworthy AI systems, with ongoing challenges in data efficiency, security, and fairness. Researchers and practitioners are encouraged to build on these trends, leveraging new methodologies and cross-domain insights to advance the state of the art.

πŸ“š arXiv (35 papers)
1. Can Large Vision-Language Models Understand Multimodal Sarcasm?
Authors: Xinyu Wang, Yue Zhang, Liqiang Jing β€’ Published: 2025-08-05 β€’ Source: arXiv
Sarcasm is a complex linguistic phenomenon that involves a disparity between literal and intended meanings, making it challenging for sentiment analysis and other emotion-sensitive tasks. While traditional sarcasm detection methods primarily focus on text, recent approaches have incorporated multimodal information. However, the application of Large Visual Language Models (LVLMs) in Multimodal Sarcasm Analysis (MSA) remains underexplored. In this paper, we evaluate LVLMs in MSA tasks, specifically focusing on Multimodal Sarcasm Detection and Multimodal Sarcasm Explanation. Through comprehensive experiments, we identify key limitations, such as insufficient visual understanding and a lack of conceptual knowledge. To address these issues, we propose a training-free framework that integrates in-depth object extraction and external conceptual knowledge to improve the model's ability to interpret and explain sarcasm in multimodal contexts. The experimental results on multiple models show the effectiveness of our proposed framework. The code is available at https://github.com/cp-cp/LVLM-MSA.
2. Optimized imaging prefiltering for enhanced image segmentation
Authors: Ronny Vallejos, Felipe Osorio, Sebastian Vidal, Grisel Britos β€’ Published: 2025-08-05 β€’ Source: arXiv
The Box-Cox transformation, introduced in 1964, is a widely used statistical tool for stabilizing variance and improving normality in data analysis. Its application in image processing, particularly for image enhancement, has gained increasing attention in recent years. This paper investigates the use of the Box-Cox transformation as a preprocessing step for image segmentation, with a focus on the estimation of the transformation parameter. We evaluate the effectiveness of the transformation by comparing various segmentation methods, highlighting its advantages for traditional machine learning techniques-especially in situations where no training data is available. The results demonstrate that the transformation enhances feature separability and computational efficiency, making it particularly beneficial for models like discriminant analysis. In contrast, deep learning models did not show consistent improvements, underscoring the differing impacts of the transformation across model types and image characteristics.
3. DiWA: Diffusion Policy Adaptation with World Models
Authors: Akshay L Chandra, Iman Nematollahi, Chenguang Huang, Tim Welschehold, Wolfram Burgard, Abhinav Valada β€’ Published: 2025-08-05 β€’ Source: arXiv
Fine-tuning diffusion policies with reinforcement learning (RL) presents significant challenges. The long denoising sequence for each action prediction impedes effective reward propagation. Moreover, standard RL methods require millions of real-world interactions, posing a major bottleneck for practical fine-tuning. Although prior work frames the denoising process in diffusion policies as a Markov Decision Process to enable RL-based updates, its strong dependence on environment interaction remains highly inefficient. To bridge this gap, we introduce DiWA, a novel framework that leverages a world model for fine-tuning diffusion-based robotic skills entirely offline with reinforcement learning. Unlike model-free approaches that require millions of environment interactions to fine-tune a repertoire of robot skills, DiWA achieves effective adaptation using a world model trained once on a few hundred thousand offline play interactions. This results in dramatically improved sample efficiency, making the approach significantly more practical and safer for real-world robot learning. On the challenging CALVIN benchmark, DiWA improves performance across eight tasks using only offline adaptation, while requiring orders of magnitude fewer physical interactions than model-free baselines. To our knowledge, this is the first demonstration of fine-tuning diffusion policies for real-world robotic skills using an offline world model. We make the code publicly available at https://diwa.cs.uni-freiburg.de.
4. Visual Execution and Validation of Finite-State Machines and Pushdown Automata
Authors: Marco T. MorazÑn, David Anthony K. Fields, Andrés M. Garced, Tijana Minić ‒ Published: 2025-08-05 ‒ Source: arXiv
In Formal Languages and Automata Theory courses, students find understanding nondeterministic finite-state and pushdown automata difficult. In many cases, this means that it is challenging for them to comprehend the operational semantics of such machines and, as a consequence, determine why a word is accepted or rejected. This is not entirely surprising, because students are mostly trained to design and implement deterministic programs. Comprehension of pushdown automata is further complicated, because reasoning about the stack is necessary. A common difficulty students face, for example, is understanding that two different computations on the same word may reach the same state with different stack values. To aid student understanding, we present two novel dynamic visualization tools for FSM -- a domain-specific programming language for the Automata Theory classroom -- to support the design of such machines. These tools visualize all computations that may be performed, respectively, by a nondeterministic finite-state machine or by a pushdown automata in a stepwise manner. In addition, these tools aid the machine verification process by allowing users to visually validate whether the properties a state represents hold when a machine transitions into it.
5. Design Support for Multitape Turing Machines
Authors: Marco T. MorazΓ‘n, Oliwia Kempinski, AndrΓ©s M. Garced β€’ Published: 2025-08-05 β€’ Source: arXiv
Many Formal Languages and Automata Theory courses introduce students to Turing machine extensions. One of the most widely-used extensions endows Turing machines with multiple tapes. Although multitape Turing machines are an abstraction to simplify Turing machine design, students find them no less challenging. To aid students in understanding these machines, the FSM programming language provides support for their definition and execution. This, however, has proven insufficient for many students to understand the operational semantics of such machines and to understand why such machines accept or reject a word. To address this problem, three visualization tools have been developed. The first is a dynamic visualization tool that simulates machine execution. The second is a static visualization tool that automatically renders a graphic for a multitape Turing machine's transition diagram. The third is a static visualization tool that automatically renders computation graphs for multitape Turing machines. This article presents these tools and illustrates how they are used to help students design and implement multitape Turing machines. In addition, empirical data is presented that suggests these tools are well-received and found useful by students.
6. Cross-patient Seizure Onset Zone Classification by Patient-Dependent Weight
Authors: Xuyang Zhao, Hidenori Sugano, Toshihisa Tanaka β€’ Published: 2025-08-05 β€’ Source: arXiv
Identifying the seizure onset zone (SOZ) in patients with focal epilepsy is essential for surgical treatment and remains challenging due to its dependence on visual judgment by clinical experts. The development of machine learning can assist in diagnosis and has made promising progress. However, unlike data in other fields, medical data is usually collected from individual patients, and each patient has different illnesses, physical conditions, and medical histories, which leads to differences in the distribution of each patient's data. This makes it difficult for a machine learning model to achieve consistently reliable performance in every new patient dataset, which we refer to as the "cross-patient problem." In this paper, we propose a method to fine-tune a pretrained model using patient-specific weights for every new test patient to improve diagnostic performance. First, the supervised learning method is used to train a machine learning model. Next, using the intermediate features of the trained model obtained through the test patient data, the similarity between the test patient data and each training patient's data is defined to determine the weight of each training patient to be used in the following fine-tuning. Finally, we fine-tune all parameters in the pretrained model with training data and patient weights. In the experiment, the leave-one-patient-out method is used to evaluate the proposed method, and the results show improved classification accuracy for every test patient, with an average improvement of more than 10%.
7. FPG-NAS: FLOPs-Aware Gated Differentiable Neural Architecture Search for Efficient 6DoF Pose Estimation
Authors: Nassim Ali Ousalah, Peyman Rostami, Anis Kacem, Enjie Ghorbel, Emmanuel Koumandakis, Djamila Aouada β€’ Published: 2025-08-05 β€’ Source: arXiv
We introduce FPG-NAS, a FLOPs-aware Gated Differentiable Neural Architecture Search framework for efficient 6DoF object pose estimation. Estimating 3D rotation and translation from a single image has been widely investigated yet remains computationally demanding, limiting applicability in resource-constrained scenarios. FPG-NAS addresses this by proposing a specialized differentiable NAS approach for 6DoF pose estimation, featuring a task-specific search space and a differentiable gating mechanism that enables discrete multi-candidate operator selection, thus improving architectural diversity. Additionally, a FLOPs regularization term ensures a balanced trade-off between accuracy and efficiency. The framework explores a vast search space of approximately 10\textsuperscript{92} possible architectures. Experiments on the LINEMOD and SPEED+ datasets demonstrate that FPG-NAS-derived models outperform previous methods under strict FLOPs constraints. To the best of our knowledge, FPG-NAS is the first differentiable NAS framework specifically designed for 6DoF object pose estimation.
8. Oxide Interface-Based Polymorphic Electronic Devices for Neuromorphic Computing
Authors: Soumen Pradhan, Kirill Miller, Fabian Hartmann, Merit Spring, Judith Gabel, Berengar Leikert, Silke Kuhn, Martin Kamp, Victor Lopez-Richard, Michael Sing, Ralph Claessen, Sven HΓΆfling β€’ Published: 2025-08-05 β€’ Source: arXiv
Aside from recent advances in artificial intelligence (AI) models, specialized AI hardware is crucial to address large volumes of unstructured and dynamic data. Hardware-based AI, built on conventional complementary metal-oxidesemiconductor (CMOS)-technology, faces several critical challenges including scaling limitation of devices [1, 2], separation of computation and memory units [3] and most importantly, overall system energy efficiency [4]. While numerous materials with emergent functionalities have been proposed to overcome these limitations, scalability, reproducibility, and compatibility remain critical obstacles [5, 6]. Here, we demonstrate oxide-interface based polymorphic electronic devices with programmable transistor, memristor, and memcapacitor functionalities by manipulating the quasi-two-dimensional electron gas in LaAlO3/SrTiO3 heterostructures [7, 8] using lateral gates. A circuit utilizing two polymorphic functionalities of transistor and memcapacitor exhibits nonlinearity and short-term memory, enabling implementation in physical reservoir computing. An integrated circuit incorporating transistor and memristor functionalities is utilized for the transition from short- to long-term synaptic plasticity and for logic operations, along with in-situ logic output storage. The same circuit with advanced reconfigurable synaptic logic operations presents high-level multi-input decision-making tasks, such as patient-monitoring in healthcare applications. Our findings pave the way for oxide-based monolithic integrated circuits in a scalable, silicon compatible, energy efficient single platform, advancing both the polymorphic and neuromorphic computings.
9. Theatre in the Loop: A Rehearsal-Based, Collaborative Workflow for Expressive Robotic Behaviours
Authors: Pavlos Panagiotidis, Victor Zhi Heung Ngo, Sean Myatt, Roma Patel, Rachel Ramchurn, Alan Chamberlain, Ayse Kucukyilmaz β€’ Published: 2025-08-05 β€’ Source: arXiv
In this paper, we propose theatre-in-the-loop, a framework for developing expressive robot behaviours tailored to artistic performance through a director-guided puppeteering workflow. Leveraging theatrical methods, we use narrative objectives to direct a puppeteer in generating improvised robotic gestures that convey specific emotions. These improvisations are captured and curated to build a dataset of reusable movement templates for standalone playback in future autonomous performances. Initial trials demonstrate the feasibility of this approach, illustrating how the workflow enables precise sculpting of robotic gestures into coherent emotional arcs while revealing challenges posed by the robot's mechanical constraints. We argue that this practice-led framework provides a model for interdisciplinary teams creating socially expressive robot behaviours, contributing to (1) theatre as an interactive training ground for human-robot interaction and (2) co-creation methodologies between humans and machines.
10. EditGarment: An Instruction-Based Garment Editing Dataset Constructed with Automated MLLM Synthesis and Semantic-Aware Evaluation
Authors: Deqiang Yin, Junyi Guo, Huanda Lu, Fangyu Wu, Dongming Lu β€’ Published: 2025-08-05 β€’ Source: arXiv
Instruction-based garment editing enables precise image modifications via natural language, with broad applications in fashion design and customization. Unlike general editing tasks, it requires understanding garment-specific semantics and attribute dependencies. However, progress is limited by the scarcity of high-quality instruction-image pairs, as manual annotation is costly and hard to scale. While MLLMs have shown promise in automated data synthesis, their application to garment editing is constrained by imprecise instruction modeling and a lack of fashion-specific supervisory signals. To address these challenges, we present an automated pipeline for constructing a garment editing dataset. We first define six editing instruction categories aligned with real-world fashion workflows to guide the generation of balanced and diverse instruction-image triplets. Second, we introduce Fashion Edit Score, a semantic-aware evaluation metric that captures semantic dependencies between garment attributes and provides reliable supervision during construction. Using this pipeline, we construct a total of 52,257 candidate triplets and retain 20,596 high-quality triplets to build EditGarment, the first instruction-based dataset tailored to standalone garment editing. The project page is https://yindq99.github.io/EditGarment-project/.
11. ParticleSAM: Small Particle Segmentation for Material Quality Monitoring in Recycling Processes
Authors: Yu Zhou, Pelle Thielmann, Ayush Chamoli, Bruno Mirbach, Didier Stricker, Jason Rambach β€’ Published: 2025-08-05 β€’ Source: arXiv
The construction industry represents a major sector in terms of resource consumption. Recycled construction material has high reuse potential, but quality monitoring of the aggregates is typically still performed with manual methods. Vision-based machine learning methods could offer a faster and more efficient solution to this problem, but existing segmentation methods are by design not directly applicable to images with hundreds of small particles. In this paper, we propose ParticleSAM, an adaptation of the segmentation foundation model to images with small and dense objects such as the ones often encountered in construction material particles. Moreover, we create a new dense multi-particle dataset simulated from isolated particle images with the assistance of an automated data generation and labeling pipeline. This dataset serves as a benchmark for visual material quality control automation while our segmentation approach has the potential to be valuable in application areas beyond construction where small-particle segmentation is needed. Our experimental results validate the advantages of our method by comparing to the original SAM method both in quantitative and qualitative experiments.
12. VQA support to Arabic Language Learning Educational Tool
Authors: Khaled Bachir Delassi, Lakhdar Zeggane, Hadda Cherroun, Abdelhamid Haouhat, Kaoutar Bouzouad β€’ Published: 2025-08-05 β€’ Source: arXiv
We address the problem of scarcity of educational Arabic Language Learning tools that advocate modern pedagogical models such as active learning which ensures language proficiency. In fact, we investigate the design and evaluation of an AI-powered educational tool designed to enhance Arabic language learning for non-native speakers with beginner-to-intermediate proficiency level. The tool leverages advanced AI models to generate interactive visual quizzes, deploying Visual Question Answering as the primary activity. Adopting a constructivist learning approach, the system encourages active learning through real-life visual quizzes, and image-based questions that focus on improving vocabulary, grammar, and comprehension. The system integrates Vision-Language Pretraining models to generate contextually relevant image description from which Large Language Model generate assignments based on customized Arabic language Learning quizzes thanks to prompting. The effectiveness of the tool is evaluated through a manual annotated benchmark consisting of 1266 real-life visual quizzes, with human participants providing feedback. The results show a suitable accuracy rates, validating the tool's potential to bridge the gap in Arabic language education and highlighting the tool's promise as a reliable, AI-powered resource for Arabic learners, offering personalized and interactive learning experiences.
13. When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models
Authors: Dasol Choi Jihwan Lee, Minjae Lee, Minsuk Kahng β€’ Published: 2025-08-05 β€’ Source: arXiv
While prior research on text-to-image generation has predominantly focused on biases in human depictions, we investigate a more subtle yet pervasive phenomenon: demographic bias in generated objects (e.g., cars). We introduce SODA (Stereotyped Object Diagnostic Audit), a novel framework for systematically measuring such biases. Our approach compares visual attributes of objects generated with demographic cues (e.g., "for young people'') to those from neutral prompts, across 2,700 images produced by three state-of-the-art models (GPT Image-1, Imagen 4, and Stable Diffusion) in five object categories. Through a comprehensive analysis, we uncover strong associations between specific demographic groups and visual attributes, such as recurring color patterns prompted by gender or ethnicity cues. These patterns reflect and reinforce not only well-known stereotypes but also more subtle and unintuitive biases. We also observe that some models generate less diverse outputs, which in turn amplifies the visual disparities compared to neutral prompts. Our proposed auditing framework offers a practical approach for testing, revealing how stereotypes still remain embedded in today's generative models. We see this as an essential step toward more systematic and responsible AI development.
14. On the Evaluation of Large Language Models in Multilingual Vulnerability Repair
Authors: Dong wang, Junji Yu, Honglin Shu, Michael Fu, Chakkrit Tantithamthavorn, Yasutaka Kamei, Junjie Chen β€’ Published: 2025-08-05 β€’ Source: arXiv
Various Deep Learning-based approaches with pre-trained language models have been proposed for automatically repairing software vulnerabilities. However, these approaches are limited to a specific programming language (C/C++). Recent advances in large language models (LLMs) offer language-agnostic capabilities and strong semantic understanding, exhibiting potential to overcome multilingual vulnerability limitations. Although some work has begun to explore LLMs' repair performance, their effectiveness is unsatisfactory. To address these limitations, we conducted a large-scale empirical study to investigate the performance of automated vulnerability repair approaches and state-of-the-art LLMs across seven programming languages. Results show GPT-4o, instruction-tuned with few-shot prompting, performs competitively against the leading approach, VulMaster. Additionally, the LLM-based approach shows superior performance in repairing unique vulnerabilities and is more likely to repair the most dangerous vulnerabilities. Instruction-tuned GPT-4o demonstrates strong generalization on vulnerabilities in previously unseen language, outperforming existing approaches. Analysis shows Go consistently achieves the highest effectiveness across all model types, while C/C++ performs the worst. Based on findings, we discuss the promise of LLM on multilingual vulnerability repair and the reasons behind LLM's failed cases. This work takes the first look at repair approaches and LLMs across multiple languages, highlighting the promising future of adopting LLMs for multilingual vulnerability repair.
15. Learning to Incentivize: LLM-Empowered Contract for AIGC Offloading in Teleoperation
Authors: Zijun Zhan, Yaxian Dong, Daniel Mawunyo Doe, Yuqing Hu, Shuai Li, Shaohua Cao, Zhu Han β€’ Published: 2025-08-05 β€’ Source: arXiv
With the rapid growth in demand for AI-generated content (AIGC), edge AIGC service providers (ASPs) have become indispensable. However, designing incentive mechanisms that motivate ASPs to deliver high-quality AIGC services remains a challenge, especially in the presence of information asymmetry. In this paper, we address bonus design between a teleoperator and an edge ASP when the teleoperator cannot observe the ASP's private settings and chosen actions (diffusion steps). We formulate this as an online learning contract design problem and decompose it into two subproblems: ASP's settings inference and contract derivation. To tackle the NP-hard setting-inference subproblem with unknown variable sizes, we introduce a large language model (LLM)-empowered framework that iteratively refines a naive seed solver using the LLM's domain expertise. Upon obtaining the solution from the LLM-evolved solver, we directly address the contract derivation problem using convex optimization techniques and obtain a near-optimal contract. Simulation results on our Unity-based teleoperation platform show that our method boosts the teleoperator's utility by $5 \sim 40\%$ compared to benchmarks, while preserving positive incentives for the ASP. The code is available at https://github.com/Zijun0819/llm4contract.
16. Reconstructing the Probability Measure of a Curie-Weiss Model Observing the Realisations of a Subset of Spins
Authors: Miguel Ballesteros, Ivan Naumkin, Gabor Toth β€’ Published: 2025-08-05 β€’ Source: arXiv
We study the problem of reconstructing the probability measure of the Curie-Weiss model from a sample of the voting behaviour of a subset of the population. While originally used to study phase transitions in statistical mechanics, the Curie-Weiss or mean-field model has been applied to study phenomena, where many agents interact with each other. It is useful to measure the degree of social cohesion in social groups, which manifests in the way the members of the group influence each others' decisions. In practice, statisticians often only have access to survey data from a representative subset of a population. As such, it is useful to provide methods to estimate social cohesion from such data. The estimators we study have some positive properties, such as consistency, asymptotic normality, and large deviation principles. The main advantages are that they require only a sample of votes belonging to a (possibly very small) subset of the population and have a low computational cost. Due to the wide application of models such as Curie-Weiss, these estimators are potentially useful in disciplines such as political science, sociology, automated voting, and preference aggregation.
17. Data Overdose? Time for a Quadruple Shot: Knowledge Graph Construction using Enhanced Triple Extraction
Authors: Taine J. Elliott, Stephen P. Levitt, Ken Nixon, Martin Bekker β€’ Published: 2025-08-05 β€’ Source: arXiv
The rapid expansion of publicly-available medical data presents a challenge for clinicians and researchers alike, increasing the gap between the volume of scientific literature and its applications. The steady growth of studies and findings overwhelms medical professionals at large, hindering their ability to systematically review and understand the latest knowledge. This paper presents an approach to information extraction and automatic knowledge graph (KG) generation to identify and connect biomedical knowledge. Through a pipeline of large language model (LLM) agents, the system decomposes 44 PubMed abstracts into semantically meaningful proposition sentences and extracts KG triples from these sentences. The triples are enhanced using a combination of open domain and ontology-based information extraction methodologies to incorporate ontological categories. On top of this, a context variable is included during extraction to allow the triple to stand on its own - thereby becoming `quadruples'. The extraction accuracy of the LLM is validated by comparing natural language sentences generated from the enhanced triples to the original propositions, achieving an average cosine similarity of 0.874. The similarity for generated sentences of enhanced triples were compared with generated sentences of ordinary triples showing an increase as a result of the context variable. Furthermore, this research explores the ability for LLMs to infer new relationships and connect clusters in the knowledge base of the knowledge graph. This approach leads the way to provide medical practitioners with a centralised, updated in real-time, and sustainable knowledge source, and may be the foundation of similar gains in a wide variety of fields.
18. AI on the Pulse: Real-Time Health Anomaly Detection with Wearable and Ambient Intelligence
Authors: Davide Gabrielli, Bardh Prenkaj, Paola Velardi, Stefano Faralli β€’ Published: 2025-08-05 β€’ Source: arXiv
We introduce AI on the Pulse, a real-world-ready anomaly detection system that continuously monitors patients using a fusion of wearable sensors, ambient intelligence, and advanced AI models. Powered by UniTS, a state-of-the-art (SoTA) universal time-series model, our framework autonomously learns each patient's unique physiological and behavioral patterns, detecting subtle deviations that signal potential health risks. Unlike classification methods that require impractical, continuous labeling in real-world scenarios, our approach uses anomaly detection to provide real-time, personalized alerts for reactive home-care interventions. Our approach outperforms 12 SoTA anomaly detection methods, demonstrating robustness across both high-fidelity medical devices (ECG) and consumer wearables, with a ~ 22% improvement in F1 score. However, the true impact of AI on the Pulse lies in @HOME, where it has been successfully deployed for continuous, real-world patient monitoring. By operating with non-invasive, lightweight devices like smartwatches, our system proves that high-quality health monitoring is possible without clinical-grade equipment. Beyond detection, we enhance interpretability by integrating LLMs, translating anomaly scores into clinically meaningful insights for healthcare professionals.
19. Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling
Authors: Xinlei Yu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Ruolin Shen, Jiangning Zhang, Xiaobin Hu, Yanwei Fu, Shuicheng Yan β€’ Published: 2025-08-05 β€’ Source: arXiv
Existing vision-language models (VLMs), whether generalists or specialists, remain constrained by their parameter scale, lack robust self-correction capabilities, and underperform in tasks involving long visual contexts and complex reasoning, resulting in suboptimal performance on document-based tasks. To address this, we propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling, tailored for visual document understanding and visual question answering (VQA). It comprises four distinct small-scale agents, i.e., planning, execution, judgment, and answer agents, with clearly defined roles and effective collaboration. Notably, the judgment agent exclusively verifies correctness and redirects to prior agents for revisions, outperforming conventional correction strategies. To further expand the capability boundaries of the framework, we propose mixed reward modeling that balances agent-specific abilities and global collaboration, as well as agent-wise hybrid test-time scaling, which customizes different scaling strategies for each agent based on their functions. Evaluated on benchmarks spanning both document-based and non-document-based settings, our MACT shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks. Especially, it stands out in benchmarks involving long visual contexts and complicated reasoning. The three variants of MACT consistently hold the top three positions in average scores, leading in 13 of the 15 benchmarks. Code will be available at: https://github.com/YU-deep/MACT.git.
20. A Closed-Loop Multi-Agent Framework for Aerodynamics-Aware Automotive Styling Design
Authors: Xinyu Jin, Shengmao Yan, Qingtao Wang, Shisong Deng, Yanzhen Jiang, Shuangyao Zhao β€’ Published: 2025-08-05 β€’ Source: arXiv
The core challenge in automotive exterior design is balancing subjective aesthetics with objective aerodynamic performance while dramatically accelerating the development cycle. To address this, we propose a novel, LLM-driven multi-agent framework that automates the end-to-end workflow from ambiguous requirements to 3D concept model performance validation. The workflow is structured in two stages: conceptual generation and performance validation. In the first stage, agents collaborate to interpret fuzzy design requirements, generate concept sketches, and produce photorealistic renderings using diffusion models. In the second stage, the renderings are converted to 3D point clouds, where a Drag Prediction Agent, built upon a lightweight surrogate model, provides near-instantaneous predictions of the drag coefficient and pressure fields, replacing time-consuming CFD simulations. The primary contribution of this work is the seamless integration of creative generation with a rapid engineering validation loop within a unified, automated system, which provides a new paradigm for efficiently balancing creative exploration with engineering constraints in the earliest stages of design.
21. When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs
Authors: Bodam Kim, Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin β€’ Published: 2025-08-05 β€’ Source: arXiv
As large language models become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that can manipulate state-of-the-art audio language models to generate harmful content. Our method uses imperceptible perturbations in audio inputs that remain benign to human listeners. The first stage uses a novel reward-based optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to guide the target model to circumvent its own safety protocols and generate harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use Projected Gradient Descent (PGD) to optimize subtle perturbations that are embedded into benign audio carriers, such as weather queries or greeting messages. Validated under the rigorous StrongREJECT, LlamaGuard, as well as Human Evaluation safety evaluation framework, our experiments demonstrate a success rate exceeding 86% across Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, and Phi-4-Multimodal. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating AI behavior.
22. Remini: Leveraging Chatbot-Mediated Mutual Reminiscence for Promoting Positive Affect and Feeling of Connectedness among Loved Ones
Authors: Zhuoqun Jiang, ShunYi Yeo, Wei Xuan Donovan Seow, Simon Perrault β€’ Published: 2025-08-05 β€’ Source: arXiv
Mutual reminiscence, defined as revisiting shared positive memories through reciprocal self-disclosure, strengthens emotional bonds, enhances well-being, and deepens intimacy. However, most technology-mediated reminiscence tools emphasize individual reflection or one-way storytelling, which overlooks the dynamic, interactive dialogue essential for meaningful mutual reminiscence. To address this limitation, we introduce Remini, a chatbot designed to support reciprocal self-disclosure between close partners such as couples, friends, or family members. Grounded in the Social Functions of Autobiographical Memory (SFAM) framework, Remini uses conversational AI to guide emotionally rich exchanges through five narrative phases: rapport building, memory narration, elaboration, reflection, and summary. In a mixed-method, both between- and within- subjects study (N = 48, 24 dyads), we compare Remini to a baseline chatbot that offers minimal memory-trigger prompts. Our findings show that structured guidance from Remini significantly improves positive affect, feeling of connection, and engagement. It also fosters more detailed narrative co-construction and greater reciprocal self-disclosure. Participant feedback highlights the practical value, perceived benefits, and design considerations of chatbot-mediated reminiscence. We contribute empirically grounded design implications for conversational agents that strengthen human connection through mutual reminiscence.
23. Adaptive AI Agent Placement and Migration in Edge Intelligence Systems
Authors: Xingdan Wang, Jiayi He, Zhiqing Tang, Jianxiong Guo, Jiong Lou, Liping Qian, Tian Wang, Weijia Jia β€’ Published: 2025-08-05 β€’ Source: arXiv
The rise of LLMs such as ChatGPT and Claude fuels the need for AI agents capable of real-time task handling. However, migrating data-intensive, multi-modal edge workloads to cloud data centers, traditionally used for agent deployment, introduces significant latency. Deploying AI agents at the edge improves efficiency and reduces latency. However, edge environments present challenges due to limited and heterogeneous resources. Maintaining QoS for mobile users necessitates agent migration, which is complicated by the complexity of AI agents coordinating LLMs, task planning, memory, and external tools. This paper presents the first systematic deployment and management solution for LLM-based AI agents in dynamic edge environments. We propose a novel adaptive framework for AI agent placement and migration in edge intelligence systems. Our approach models resource constraints and latency/cost, leveraging ant colony algorithms and LLM-based optimization for efficient decision-making. It autonomously places agents to optimize resource utilization and QoS and enables lightweight agent migration by transferring only essential state. Implemented on a distributed system using AgentScope and validated across globally distributed edge servers, our solution significantly reduces deployment latency and migration costs.
24. Key-Augmented Neural Triggers for Knowledge Sharing
Authors: Alex Wolf, Marco Edoardo Palma, Pooja Rani, Harald C. Gall β€’ Published: 2025-08-05 β€’ Source: arXiv
Repository-level code comprehension and knowledge sharing remain core challenges in software engineering. Large language models (LLMs) have shown promise by generating explanations of program structure and logic. However, these approaches still face limitations: First, relevant knowledge is distributed across multiple files within a repository, aka semantic fragmentation. Second, retrieval inefficiency and attention saturation degrade performance in RAG pipelines, where long, unaligned contexts overwhelm attention. Third, repository specific training data is scarce and often outdated. Finally, proprietary LLMs hinder industrial adoption due to privacy and deployment constraints. To address these issues, we propose Key-Augmented Neural Triggers (KANT), a novel approach that embeds knowledge anchors into both training and inference. Unlike prior methods, KANT enables internal access to repository specific knowledge, reducing fragmentation and grounding inference in localized context. Moreover, we synthesize specialized data directly from code. At inference, knowledge anchors replace verbose context, reducing token overhead and latency while supporting efficient, on premise deployment. We evaluate KANT via: a qualitative human evaluation of the synthesized dataset's intent coverage and quality across five dimensions; compare against SOTA baselines across five qualitative dimensions and inference speed; and replication across different LLMs to assess generalizability. Results show that the synthetic training data aligned with information-seeking needs. KANT achieved over 60% preference from human annotators and a LocalStack expert (preferring 79% of cases). Also, KANT reduced inference latency by up to 85% across all models. Overall, it is well-suited for scalable, low-latency, on-premise deployments, providing a strong foundation for code comprehension.
25. CTTS: Collective Test-Time Scaling
Authors: Zhende Song, Shengji Tang, Peng Ye, Jiayuan Fan, Tao Chen β€’ Published: 2025-08-05 β€’ Source: arXiv
Test-time scaling (TTS) has emerged as a promising research field for enhancing the effectiveness of large language models (LLMs) without extra training. However, most existing approaches, e.g., Best-of-N and Self-Consistency rely on a single agent interacting with a reward model (SA-SR), constrained by limited capabilities of a single test-time scaling (STTS) paradigm. On the other hand, recent works demonstrate that collective-agent methods can break through the upper bound of single-agent systems by orchestrating diverse models. Thus, in this paper, we take a first step towards exploring Collective Test-Time Scaling (CTTS). Consider the different interaction types of single and multiple models, we design three primary paradigms to investigate the optimal paradigm of CTTS: (1) single agent to multiple reward models (SA-MR); (2) multiple agents to single reward model (MA-SR); and (3) multiple agents to multiple reward models (MA-MR). Extensive experiments demonstrate that MA-MR consistently achieves the best performance. Based on this, we propose a novel framework named CTTS-MM that effectively leverages both multi-agent and multi-reward-model collaboration for enhanced inference. Specifically, for multi-agent collaboration, we propose an Agent Collaboration Search (ACS), which searches for the most effective combination of LLM agents from a large candidate pool; for multi-reward-model collaboration, we propose Mixture of Reword Models (MoR), which consists of a curated question pool and a Prior Reward model Ensemble Selection (PRES) to select the optimal combinations of reward models via Pair-wise Reward Ranking (PRR) metric. Experiments across seven mainstream benchmarks demonstrate that the proposed CTTS-MM consistently obtains superior performance. Code will be released at https://github.com/magent4aci/CTTS-MM.
26. Can Large Language Models Identify Materials from Radar Signals?
Authors: Jiangyou Zhu, Hongyu Deng, He Chen β€’ Published: 2025-08-05 β€’ Source: arXiv
Accurately identifying the material composition of objects is a critical capability for AI robots powered by large language models (LLMs) to perform context-aware manipulation. Radar technologies offer a promising sensing modality for material recognition task. When combined with deep learning, radar technologies have demonstrated strong potential in identifying the material of various objects. However, existing radar-based solutions are often constrained to closed-set object categories and typically require task-specific data collection to train deep learning models, largely limiting their practical applicability. This raises an important question: Can we leverage the powerful reasoning capabilities of pre-trained LLMs to directly infer material composition from raw radar signals? Answering this question is non-trivial due to the inherent redundancy of radar signals and the fact that pre-trained LLMs have no prior exposure to raw radar data during training. To address this, we introduce LLMaterial, the first study to investigate the feasibility of using LLM to identify materials directly from radar signals. First, we introduce a physics-informed signal processing pipeline that distills high-redundancy radar raw data into a set of compact intermediate parameters that encapsulate the material's intrinsic characteristics. Second, we adopt a retrieval-augmented generation (RAG) strategy to provide the LLM with domain-specific knowledge, enabling it to interpret and reason over the extracted intermediate parameters. Leveraging this integration, the LLM is empowered to perform step-by-step reasoning on the condensed radar features, achieving open-set material recognition directly from raw radar signals. Preliminary results show that LLMaterial can effectively distinguish among a variety of common materials, highlighting its strong potential for real-world material identification applications.
27. Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation
Authors: Haoran Wang, Xiongxiao Xu, Baixiang Huang, Kai Shu β€’ Published: 2025-08-05 β€’ Source: arXiv
Retrieval-Augmented Generation (RAG) enhances the factual accuracy of large language models (LLMs) by conditioning outputs on external knowledge sources. However, when retrieval involves private or sensitive data, RAG systems are susceptible to extraction attacks that can leak confidential information through generated responses. We propose Privacy-Aware Decoding (PAD), a lightweight, inference-time defense that adaptively injects calibrated Gaussian noise into token logits during generation. PAD integrates confidence-based screening to selectively protect high-risk tokens, efficient sensitivity estimation to minimize unnecessary noise, and context-aware noise calibration to balance privacy with generation quality. A \renyi Differential Privacy (RDP) accountant rigorously tracks cumulative privacy loss, enabling explicit per-response $(\varepsilon, \delta)$-DP guarantees for sensitive outputs. Unlike prior approaches requiring retraining or corpus-level filtering, PAD is model-agnostic and operates entirely at decoding time with minimal computational overhead. Experiments on three real-world datasets demonstrate that PAD substantially reduces private information leakage while preserving response utility, outperforming existing retrieval- and post-processing-based defenses. Our work takes an important step toward mitigating privacy risks in RAG via decoding strategies, paving the way for universal and scalable privacy solutions in sensitive domains. Our code is available: https://github.com/wang2226/PAD.
28. ADSeeker: A Knowledge-Infused Framework for Anomaly Detection and Reasoning
Authors: Kai Zhang, Zekai Zhang, Xihe Sun, Jingmeng Nie, Qinghui Chen, Han Hao, Jianyuan Guo, Jinglin Zhang β€’ Published: 2025-08-05 β€’ Source: arXiv
Automatic vision inspection holds significant importance in industry inspection. While multimodal large language models (MLLMs) exhibit strong language understanding capabilities and hold promise for this task, their performance remains significantly inferior to that of human experts. In this context, we identify two key challenges: (i) insufficient integration of anomaly detection (AD) knowledge during pre-training, and (ii) the lack of technically precise and conte-aware language generation for anomaly reasoning. To address these issues, we propose ADSeeker, an anomaly task assistant designed to enhance inspection performance through knowledge-grounded reasoning. ADSeeker leverages a curated visual document knowledge base, SEEK-MVTec&VisA (SEEK-M&V), which we construct to address the limitations of existing resources that rely solely on unstructured text. SEEK-M&V includes semantic-rich descriptions and image-document pairs, enabling more comprehensive anomaly understanding. To effectively retrieve and utilize this knowledge, we introduce the Query Image-Knowledge Retrieval-Augmented Generation (Q2K RAG) framework. To further enhance the performance in zero-shot anomaly detection (ZSAD), ADSeeker leverages the Hierarchical Sparse Prompt mechanism and type-level features to efficiently extract anomaly patterns. Furthermore, to tackle the challenge of limited in industry anomaly detection (IAD) data, we introduce the largest-scale AD dataset, Multi-type Anomaly (MulA), encompassing 72 multi-scale defect types across 26 Categories. Extensive experiments show that our plug-and-play framework, ADSeeker, achieves state-of-the-art zero-shot performance on several benchmark datasets.
29. Highlight & Summarize: RAG without the jailbreaks
Authors: Giovanni Cherubin, Andrew Paverd β€’ Published: 2025-08-04 β€’ Source: arXiv
Preventing jailbreaking and model hijacking of Large Language Models (LLMs) is an important yet challenging task. For example, when interacting with a chatbot, malicious users can input specially crafted prompts to cause the LLM to generate undesirable content or perform a completely different task from its intended purpose. Existing mitigations for such attacks typically rely on hardening the LLM's system prompt or using a content classifier trained to detect undesirable content or off-topic conversations. However, these probabilistic approaches are relatively easy to bypass due to the very large space of possible inputs and undesirable outputs. In this paper, we present and evaluate Highlight & Summarize (H&S), a new design pattern for retrieval-augmented generation (RAG) systems that prevents these attacks by design. The core idea is to perform the same task as a standard RAG pipeline (i.e., to provide natural language answers to questions, based on relevant sources) without ever revealing the user's question to the generative LLM. This is achieved by splitting the pipeline into two components: a highlighter, which takes the user's question and extracts relevant passages ("highlights") from the retrieved documents, and a summarizer, which takes the highlighted passages and summarizes them into a cohesive answer. We describe several possible instantiations of H&S and evaluate their generated responses in terms of correctness, relevance, and response quality. Surprisingly, when using an LLM-based highlighter, the majority of H&S responses are judged to be better than those of a standard RAG pipeline.
30. Decomposed Reasoning with Reinforcement Learning for Relevance Assessment in UGC Platforms
Authors: Xiaowei Yuan, Lei Jin, Haoxin Zhang, Yan Gao, Yi Wu, Yao Hu, Ziyang Huang, Jun Zhao, Kang Liu β€’ Published: 2025-08-04 β€’ Source: arXiv
Retrieval-augmented generation (RAG) plays a critical role in user-generated content (UGC) platforms, but its effectiveness depends heavily on accurate relevance assessment of query-document pairs. Despite recent advances in applying large language models (LLMs) to relevance modeling, UGC platforms present unique challenges: 1) ambiguous user intent due to sparse user feedback in RAG scenarios, and 2) substantial noise introduced by informal and unstructured language. To address these issues, we propose the Reinforced Reasoning Model for Relevance Assessment (R3A), which introduces a decomposed reasoning framework over queries and candidate documents before scoring. R3A first leverages auxiliary high-ranked documents within the platform to infer latent query intent. It then performs verbatim fragment extraction to justify relevance decisions, thereby reducing errors caused by noisy UGC. Based on a reinforcement learning framework, R3A is optimized to mitigate distortions arising from ambiguous queries and unstructured content. Experimental results show that R3A significantly outperforms existing baseline methods in terms of relevance accuracy, across both offline benchmarks and online experiments.
31. Simple Methods Defend RAG Systems Well Against Real-World Attacks
Authors: Ilias Triantafyllopoulos, Renyi Qu, Salvatore Giorgi, Brenda Curtis, Lyle H. Ungar, JoΓ£o Sedoc β€’ Published: 2025-08-04 β€’ Source: arXiv
Ensuring safety and in-domain responses for Retrieval-Augmented Generation (RAG) systems is paramount in safety-critical applications, yet remains a significant challenge. To address this, we evaluate four methodologies for Out-Of-Domain (OOD) query detection: GPT-4o, regression-based, Principal Component Analysis (PCA)-based, and Neural Collapse (NC), to ensure the RAG system only responds to queries confined to the system's knowledge base. Specifically, our evaluation explores two novel dimensionality reduction and feature separation strategies: \textit{PCA}, where top components are selected using explained variance or OOD separability, and an adaptation of \textit{Neural Collapse Feature Separation}. We validate our approach on standard datasets (StackExchange and MSMARCO) and real-world applications (Substance Use and COVID-19), including tests against LLM-simulated and actual attacks on a COVID-19 vaccine chatbot. Through human and LLM-based evaluations of response correctness and relevance, we confirm that an external OOD detector is crucial for maintaining response relevance.
32. Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning
Authors: Wenchuan Zhang, Jingru Guo, Hengzhe Zhang, Penghao Zhang, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, Hong Bu β€’ Published: 2025-08-04 β€’ Source: arXiv
Although Vision Language Models (VLMs) have shown strong generalization in medical imaging, pathology presents unique challenges due to ultra-high resolution, complex tissue structures, and nuanced clinical semantics. These factors make pathology VLMs prone to hallucinations, i.e., generating outputs inconsistent with visual evidence, which undermines clinical trust. Existing RAG approaches in this domain largely depend on text-based knowledge bases, limiting their ability to leverage diagnostic visual cues. To address this, we propose Patho-AgenticRAG, a multimodal RAG framework with a database built on page-level embeddings from authoritative pathology textbooks. Unlike traditional text-only retrieval systems, it supports joint text-image search, enabling direct retrieval of textbook pages that contain both the queried text and relevant visual cues, thus avoiding the loss of critical image-based information. Patho-AgenticRAG also supports reasoning, task decomposition, and multi-turn search interactions, improving accuracy in complex diagnostic scenarios. Experiments show that Patho-AgenticRAG significantly outperforms existing multimodal models in complex pathology tasks like multiple-choice diagnosis and visual question answering. Our project is available at the Patho-AgenticRAG repository: https://github.com/Wenchuan-Zhang/Patho-AgenticRAG.
33. CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search
Authors: Xiaoya Li, Xiaofei Sun, Albert Wang, Chris Shum, Jiwei Li β€’ Published: 2025-08-04 β€’ Source: arXiv
Approximate nearest-neighbor search (ANNS) algorithms have become increasingly critical for recent AI applications, particularly in retrieval-augmented generation (RAG) and agent-based LLM applications. In this paper, we present CRINN, a new paradigm for ANNS algorithms. CRINN treats ANNS optimization as a reinforcement learning problem where execution speed serves as the reward signal. This approach enables the automatic generation of progressively faster ANNS implementations while maintaining accuracy constraints. Our experimental evaluation demonstrates CRINN's effectiveness across six widely-used NNS benchmark datasets. When compared against state-of-the-art open-source ANNS algorithms, CRINN achieves best performance on three of them (GIST-960-Euclidean, MNIST-784-Euclidean, and GloVe-25-angular), and tied for first place on two of them (SIFT-128-Euclidean and GloVe-25-angular). The implications of CRINN's success reach well beyond ANNS optimization: It validates that LLMs augmented with reinforcement learning can function as an effective tool for automating sophisticated algorithmic optimizations that demand specialized knowledge and labor-intensive manual refinement.Code can be found at https://github.com/deepreinforce-ai/CRINN
34. Contextually Aware E-Commerce Product Question Answering using RAG
Authors: Praveen Tangarajan, Anand A. Rajasekar, Manish Rathi, Vinay Rao Dandin, Ozan Ersoy β€’ Published: 2025-08-04 β€’ Source: arXiv
E-commerce product pages contain a mix of structured specifications, unstructured reviews, and contextual elements like personalized offers or regional variants. Although informative, this volume can lead to cognitive overload, making it difficult for users to quickly and accurately find the information they need. Existing Product Question Answering (PQA) systems often fail to utilize rich user context and diverse product information effectively. We propose a scalable, end-to-end framework for e-commerce PQA using Retrieval Augmented Generation (RAG) that deeply integrates contextual understanding. Our system leverages conversational history, user profiles, and product attributes to deliver relevant and personalized answers. It adeptly handles objective, subjective, and multi-intent queries across heterogeneous sources, while also identifying information gaps in the catalog to support ongoing content improvement. We also introduce novel metrics to measure the framework's performance which are broadly applicable for RAG system evaluations.
35. SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension
Authors: Junjie Wu, Jiangnan Li, Yuqing Li, Lemao Liu, Liyan Xu, Jiwei Li, Dit-Yan Yeung, Jie Zhou, Mo Yu β€’ Published: 2025-08-03 β€’ Source: arXiv
Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth. We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance -- i.e., situating a chunk's meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.