AI/LLM/Agent/Workflow Papers

📚 arXiv (35 papers)

1. LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation

Authors: Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, Ziwei Liu • Published: 2025-08-05 • Source: arXiv

Controllable ultra-long video generation is a fundamental yet challenging task. Although existing methods are effective for short clips, they struggle to scale due to issues such as temporal inconsistency and visual degradation. In this paper, we initially investigate and identify three key factors: separate noise initialization, independent control signal normalization, and the limitations of single-modality guidance. To address these issues, we propose LongVie, an end-to-end autoregressive framework for controllable long video generation. LongVie introduces two core designs to ensure temporal consistency: 1) a unified noise initialization strategy that maintains consistent generation across clips, and 2) global control signal normalization that enforces alignment in the control space throughout the entire video. To mitigate visual degradation, LongVie employs 3) a multi-modal control framework that integrates both dense (e.g., depth maps) and sparse (e.g., keypoints) control signals, complemented by 4) a degradation-aware training strategy that adaptively balances modality contributions over time to preserve visual quality. We also introduce LongVGenBench, a comprehensive benchmark consisting of 100 high-resolution videos spanning diverse real-world and synthetic environments, each lasting over one minute. Extensive experiments show that LongVie achieves state-of-the-art performance in long-range controllability, consistency, and quality.

🔗 View Paper 📄 PDF

2. MaLV-OS: Rethinking the Operating System Architecture for Machine Learning in Virtualized Clouds

Authors: Stella Bitchebe, Oana Balmau • Published: 2025-08-05 • Source: arXiv

A large body of research has employed Machine Learning (ML) models to develop learned operating systems (OSes) and kernels. The latter dynamically adapts to the job load and dynamically adjusts resources (CPU, IO, memory, network bandwidth) allocation to respond to the actual user demand. What this work has in common is that it utilizes ML to improve kernel decisions. To this day, and to the best of our knowledge, no work has taken the opposite direction, i.e., using OS to improve ML. While some work proposes applying system-level optimizations to ML algorithms, they do not tailor the OS to adapt to the ML context. To address this limitation, we take an orthogonal approach in this paper by leveraging the OS to enhance the performance of ML models and algorithms. We explore the path towards an ML-specialized OS, MaLV-OS. MaLV-OS rethinks the OS architecture to make it specifically tailored to ML workloads, especially in virtualized clouds, which are now widely used to run ML applications. MaLV-OS envisioned architecture includes (1) a micro-kernel, Micro-LAKE, which allows kernel space applications to use the GPU, and (2) an MLaaS (ML as a Service) subsystem that gathers ML models to help Micro-LAKE with memory management and CPU scheduling. MaLV-OS architecture also offloads system-sensitive parts of the models to the OS, to lighten the model complexity and programming, and speed up its execution. Finally, MaLV-OS integrates an open-source GPU virtualization software, merged directly into the hypervisor. For more flexibility, MaLV-OS vision is to enable the virtual machine to dynamically select MLaaS policies that can improve the performance of the model the user is running. Because MLaaS is designed as loadable kernel modules, the MaLV-OS architecture enables the dynamic addition of new capabilities to the MLaaS subsystem.

🔗 View Paper 📄 PDF

3. A New Approach to Partial Conjunction Analysis in Neuroimaging

Authors: Monitirtha Dey, Anna Vesely, Thorsten Dickhaus • Published: 2025-08-05 • Source: arXiv

The problem of identifying the brain regions activated through a particular cognitive task is pivotal in neuroimaging. This problem becomes even more complex if we have several cognitive tasks or several subjects. In this paper, we view this problem as a partial conjunction (PC) hypotheses testing problem, i.e., we are testing whether a specific brain region is activated in at least $\gamma$ (for some pre-fixed $\gamma$) subjects. We propose the application of a recent advance in the simultaneous statistical inference literature to activation localization in neuroimaging. We apply the recently proposed CoFilter method to neuroimaging data to discover brain regions activated in at least $\gamma$ subjects. Our proposal has two distinct advantages. First, it alleviates the conservativeness displayed by the traditional multiple testing procedures in testing PC hypotheses by eliminating many of the conservative PC $p$-values. Second, it is especially suitable for several high-dimensional studies, each of which examines a large number of null hypotheses. We also compare the performance of our proposal with existing methods for testing PC hypotheses through extensive simulation studies on neuroimaging data and a real dataset.

🔗 View Paper 📄 PDF

4. Graded chain conditions and graded Jacobson radical of groupoid graded modules

Authors: Zaqueu Cristiano, Wellington Marques de Souza, Javier Sánchez • Published: 2025-08-05 • Source: arXiv

In this work, we continue to lay the groundwork for the theory of groupoid graded rings and modules. The main topics we address include graded chain conditions, the graded Jacobson radical, and the gr-socle for graded modules. We present several descending (ascending) chain conditions for graded modules and we refer to the most general one as $\Gamma_0$-artinian ($\Gamma_0$-noetherian). We show that $\Gamma_0$-artinian (resp. $\Gamma_0$-noetherian) modules share many properties with artinian (noetherian) modules in the classical theory. However, we present an example of a right $\Gamma_0$-artinian ring that is not right $\Gamma_0$-noetherian. Following the pattern of the classical case, we examine the basic properties of the graded Jacobson radical and the gr-socle for groupoid graded modules. We also establish some fundamental properties of the graded Jacobson radical of groupoid graded rings. Finally, we introduce the notion of gr-semilocal ring, which simultaneously generalizes the concepts of semilocal ring and semilocal (small) category.

🔗 View Paper 📄 PDF

5. Efficient Morphology-Aware Policy Transfer to New Embodiments

Authors: Michael Przystupa, Hongyao Tang, Martin Jagersand, Santiago Miret, Mariano Phielipp, Matthew E. Taylor, Glen Berseth • Published: 2025-08-05 • Source: arXiv

Morphology-aware policy learning is a means of enhancing policy sample efficiency by aggregating data from multiple agents. These types of policies have previously been shown to help generalize over dynamic, kinematic, and limb configuration variations between agent morphologies. Unfortunately, these policies still have sub-optimal zero-shot performance compared to end-to-end finetuning on morphologies at deployment. This limitation has ramifications in practical applications such as robotics because further data collection to perform end-to-end finetuning can be computationally expensive. In this work, we investigate combining morphology-aware pretraining with parameter efficient finetuning (PEFT) techniques to help reduce the learnable parameters necessary to specialize a morphology-aware policy to a target embodiment. We compare directly tuning sub-sets of model weights, input learnable adapters, and prefix tuning techniques for online finetuning. Our analysis reveals that PEFT techniques in conjunction with policy pre-training generally help reduce the number of samples to necessary to improve a policy compared to training models end-to-end from scratch. We further find that tuning as few as less than 1% of total parameters will improve policy performance compared the zero-shot performance of the base pretrained a policy.

🔗 View Paper 📄 PDF

6. Maximally non-projective measurements are not always symmetric informationally complete

Authors: Gabriele Cobucci, Raphael Brinster, Shishir Khandelwal, Hermann Kampermann, Dagmar Bruß, Nikolai Wyderka, Armin Tavakoli • Published: 2025-08-05 • Source: arXiv

Whereas standard quantum measurements are projective, the most general notion of a measurement is represented by positive operator-valued measures (POVMs). It is therefore natural to consider how accurately an experimenter with access only to projective measurements and classical processing can simulate POVMs. The most well-known class of non-projective measurements is called symmetric informationally complete (SIC). Such measurements are both ubiquitous in the broader scope of quantum information theory and known to be the most strongly non-projective measurements in qubit systems. Here, we show that beyond qubit systems, the SIC property is in general not associated with the most non-projective measurement. For this, we put forward a semidefinite programming criterion for detecting genuinely non-projective measurements. This method allows us to determine quantitative simulability thresholds for generic POVMs and to put forward a conjecture on which qutrit and ququart measurements that are most strongly non-projective.

🔗 View Paper 📄 PDF

7. Probing the Gaps in ChatGPT Live Video Chat for Real-World Assistance for People who are Blind or Visually Impaired

Authors: Ruei-Che Chang, Rosiana Natalie, Wenqian Xu, Jovan Zheng Feng Yap, Anhong Guo • Published: 2025-08-05 • Source: arXiv

Recent advancements in large multimodal models have provided blind or visually impaired (BVI) individuals with new capabilities to interpret and engage with the real world through interactive systems that utilize live video feeds. However, the potential benefits and challenges of such capabilities to support diverse real-world assistive tasks remain unclear. In this paper, we present findings from an exploratory study with eight BVI participants. Participants used ChatGPT's Advanced Voice with Video, a state-of-the-art live video AI released in late 2024, in various real-world scenarios, from locating objects to recognizing visual landmarks, across unfamiliar indoor and outdoor environments. Our findings indicate that current live video AI effectively provides guidance and answers for static visual scenes but falls short in delivering essential live descriptions required in dynamic situations. Despite inaccuracies in spatial and distance information, participants leveraged the provided visual information to supplement their mobility strategies. Although the system was perceived as human-like due to high-quality voice interactions, assumptions about users' visual abilities, hallucinations, generic responses, and a tendency towards sycophancy led to confusion, distrust, and potential risks for BVI users. Based on the results, we discuss implications for assistive video AI agents, including incorporating additional sensing capabilities for real-world use, determining appropriate intervention timing beyond turn-taking interactions, and addressing ecological and safety concerns.

🔗 View Paper 📄 PDF

8. Fast radio bursts as cosmic lightning

Authors: Parsa Kafashi, Sohrab Rahvar • Published: 2025-08-05 • Source: arXiv

We propose a new model for the origin of Fast Radio Bursts (FRBs), attributing these phenomena to sudden discharges of accumulated electric charge in the accretion disk of compact objects such as black holes. Our framework demonstrates how Compton scattering within the disk plasma generates charge separation, creating a capacitor-like system stabilized by the equilibrium between radiation pressure and electrostatic forces. We detail the discharge process through destabilizing mechanisms in this capacitor, resulting in radiative emission. We compare our model's prediction on radiation signatures with observational data, using FRB2018725A as an example to obtain key quantitative relationships. Additionally, we estimate the total charge buildup via Compton scattering for a stellar-mass black hole, constrained by the best-fit between our model and observations, and determine the corresponding electron density in the accretion disk for this mechanism to operate.

🔗 View Paper 📄 PDF

9. Demystifying Sequential Recommendations: Counterfactual Explanations via Genetic Algorithms

Authors: Domiziano Scarcelli, Filippo Betello, Giuseppe Perelli, Fabrizio Silvestri, Gabriele Tolomei • Published: 2025-08-05 • Source: arXiv

Sequential Recommender Systems (SRSs) have demonstrated remarkable effectiveness in capturing users' evolving preferences. However, their inherent complexity as "black box" models poses significant challenges for explainability. This work presents the first counterfactual explanation technique specifically developed for SRSs, introducing a novel approach in this space, addressing the key question: What minimal changes in a user's interaction history would lead to different recommendations? To achieve this, we introduce a specialized genetic algorithm tailored for discrete sequences and show that generating counterfactual explanations for sequential data is an NP-Complete problem. We evaluate these approaches across four experimental settings, varying between targeted-untargeted and categorized-uncategorized scenarios, to comprehensively assess their capability in generating meaningful explanations. Using three different datasets and three models, we are able to demonstrate that our methods successfully generate interpretable counterfactual explanation while maintaining model fidelity close to one. Our findings contribute to the growing field of Explainable AI by providing a framework for understanding sequential recommendation decisions through the lens of "what-if" scenarios, ultimately enhancing user trust and system transparency.

🔗 View Paper 📄 PDF

10. MultiRAG: A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation

Authors: Wenlong Wu, Haofen Wang, Bohan Li, Peixuan Huang, Xinzhe Zhao, Lei Liang • Published: 2025-08-05 • Source: arXiv

Retrieval Augmented Generation (RAG) has emerged as a promising solution to address hallucination issues in Large Language Models (LLMs). However, the integration of multiple retrieval sources, while potentially more informative, introduces new challenges that can paradoxically exacerbate hallucination problems. These challenges manifest primarily in two aspects: the sparse distribution of multi-source data that hinders the capture of logical relationships and the inherent inconsistencies among different sources that lead to information conflicts. To address these challenges, we propose MultiRAG, a novel framework designed to mitigate hallucination in multi-source retrieval-augmented generation through knowledge-guided approaches. Our framework introduces two key innovations: (1) a knowledge construction module that employs multi-source line graphs to efficiently aggregate logical relationships across different knowledge sources, effectively addressing the sparse data distribution issue; and (2) a sophisticated retrieval module that implements a multi-level confidence calculation mechanism, performing both graph-level and node-level assessments to identify and eliminate unreliable information nodes, thereby reducing hallucinations caused by inter-source inconsistencies. Extensive experiments on four multi-domain query datasets and two multi-hop QA datasets demonstrate that MultiRAG significantly enhances the reliability and efficiency of knowledge retrieval in complex multi-source scenarios. \textcolor{blue}{Our code is available in https://github.com/wuwenlong123/MultiRAG.

🔗 View Paper 📄 PDF

11. Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations

Authors: Peng Lai, Jianjie Zheng, Sijie Cheng, Yun Chen, Peng Li, Yang Liu, Guanhua Chen • Published: 2025-08-05 • Source: arXiv

The growing scale of evaluation tasks has led to the widespread adoption of automated evaluation using large language models, a paradigm known as "LLMas-a-judge." However, improving its alignment with human preferences without complex prompts or fine-tuning remains challenging. In this work, motivated by preliminary findings that middle-to-upper layers encode semantically and taskrelevant representations that are often more aligned with human judgments than the final layer, we propose LAGER, a lightweight and efficient framework for enhancing LLM-as-a-Judge alignment with human scoring, via internal representations. LAGER produces fine-grained judgment scores by aggregating cross-layer scoretoken logits and computing the expected score from a softmax-based distribution, with the LLM backbone kept frozen. LAGER fully leverages the complementary information across different layers, overcoming the limitations of relying solely on the final layer. We evaluate our method on the standard alignment benchmarks Flask, HelpSteer, and BIGGen using Spearman correlation, and find that LAGER achieves improvements of up to 7.5% over the best baseline across these benchmarks. Without reasoning steps, LAGER matches or outperforms reasoning-based methods. Experiments on downstream applications, such as data selection and emotional understanding, further show the effectiveness of our method.

🔗 View Paper 📄 PDF

12. Marito: Structuring and Building Open Multilingual Terminologies for South African NLP

Authors: Vukosi Marivate, Isheanesu Dzingirai, Fiskani Banda, Richard Lastrucci, Thapelo Sindane, Keabetswe Madumo, Kayode Olaleye, Abiodun Modupe, Unarine Netshifhefhe, Herkulaas Combrink, Mohlatlego Nakeng, Matome Ledwaba • Published: 2025-08-05 • Source: arXiv

The critical lack of structured terminological data for South Africa's official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. \emph{Marito} addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational \emph{Marito} dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models. \emph{Marito} provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa's rich linguistic diversity is represented in the digital age.

🔗 View Paper 📄 PDF

13. Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning

Authors: Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, Boris Yangel • Published: 2025-08-05 • Source: arXiv

Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent's success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models. On SWE-rebench, our agent matches or outperforms leading open-weight models such as DeepSeek-V3-0324 and Qwen3-235B-A22B using an identical scaffolding, offering a viable path toward building more capable autonomous agents for complex real-world problems based on open models.

🔗 View Paper 📄 PDF

14. BitsAI-Fix: LLM-Driven Approach for Automated Lint Error Resolution in Practice

Authors: Yuanpeng Li, Qi Long, Zhiyuan Yao, Jian Xu, Lintao Xie, Xu He, Lu Geng, Xin Han, Yueyan Chen, Wenbo Duan • Published: 2025-08-05 • Source: arXiv

As enterprise codebases continue to grow in scale and complexity, the volume of lint errors far exceeds engineers' manual remediation capacity, leading to continuous accumulation of technical debt and hindered development efficiency. This paper presents BitsAI-Fix, an automated lint error remediation workflow based on Large Language Models (LLMs), designed to address this critical challenge in industrial-scale environments. BitsAI-Fix employs tree-sitter for context expansion and generates search-and-replace format patches through specially trained LLMs, followed by lint scan re-verification to output final remediation results. Additionally, our approach introduces an innovative progressive reinforcement learning (RL) training strategy that can automatically acquire verifiable training data during the project cold-start phase and continuously iterate the model by collecting online samples through feedback after system deployment. Furthermore, we designed a targeted rule-based reward mechanism that combines format rewards and correctness rewards while penalizing redundant modifications. We also propose a "code diff matching" methodology to continuously track online effectiveness. In production deployment at ByteDance, our solution has supported over 5,000 engineers, resolved more than 12,000 static analysis issues, achieved approximately 85% remediation accuracy, with around 1,000 weekly active adopters. This work demonstrates the practical feasibility of LLM-based code remediation solutions in enterprise environments and serves as a reference for automated code fix in large-scale industrial scenarios.

🔗 View Paper 📄 PDF

15. A Genetic Algorithm Framework for Optimizing Three-Impulse Orbital Transfers with Poliastro Simulation

Authors: Phuc Hao Do, Tran Duc Le • Published: 2025-08-05 • Source: arXiv

Orbital maneuver planning is a critical aspect of mission design, aimed at minimizing propellant consumption, which is directly correlated with the total velocity change ($\Delta V$). While analytical solutions like the Hohmann and Bi-elliptic transfers offer optimal strategies for specific cases, they lack the flexibility for more general optimization problems. This paper presents a computational framework that couples a Genetic Algorithm (GA) with the Poliastro orbital mechanics library to autonomously discover fuel-optimal, three-impulse transfer trajectories between coplanar circular orbits. We validate this framework across two distinct scenarios: a low-energy transfer from Low Earth Orbit (LEO) to a Geostationary Orbit (GEO), and a high-energy transfer to a distant orbit with a radius 20 times that of LEO. Our results demonstrate the framework's remarkable adaptability. For the LEO-to-GEO transfer, the GA precisely converges to the classical Hohmann transfer, achieving an identical $\Delta V$ of 3853.96 m/s and validating the method's accuracy. Conversely, for the high-energy transfer, the GA identifies a superior Bi-elliptic trajectory that yields a significant $\Delta V$ saving of 213.47 m/s compared to the Hohmann transfer. This fuel efficiency, however, necessitates a trade-off, extending the mission duration from approximately 1 day to over 140 years. This work demonstrates an accessible and powerful toolchain for the rapid prototyping of optimal trajectories, showcasing how combining evolutionary algorithms with open-source libraries provides a robust method for solving complex astrodynamics problems and quantifying their critical design trade-offs.

🔗 View Paper 📄 PDF

16. An Auditable Agent Platform For Automated Molecular Optimisation

Authors: Atabey Ünlü, Phil Rohr, Ahmet Celebi • Published: 2025-08-05 • Source: arXiv

Drug discovery frequently loses momentum when data, expertise, and tools are scattered, slowing design cycles. To shorten this loop we built a hierarchical, tool using agent framework that automates molecular optimisation. A Principal Researcher defines each objective, a Database agent retrieves target information, an AI Expert generates de novo scaffolds with a sequence to molecule deep learning model, a Medicinal Chemist edits them while invoking a docking tool, a Ranking agent scores the candidates, and a Scientific Critic polices the logic. Each tool call is summarised and stored causing the full reasoning path to remain inspectable. The agents communicate through concise provenance records that capture molecular lineage, to build auditable, molecule centered reasoning trajectories and reuse successful transformations via in context learning. Three cycle research loops were run against AKT1 protein using five large language models. After ranking the models by mean docking score, we ran 20 independent scale ups on the two top performers. We then compared the leading LLMs' binding affinity results across three configurations, LLM only, single agent, and multi agent. Our results reveal an architectural trade off, the multi agent setting excelled at focused binding optimization, improving average predicted binding affinity by 31%. In contrast, single agent runs generated molecules with superior drug like properties at the cost of less potent binding scores. Unguided LLM runs finished fastest, yet their lack of transparent tool signals left the validity of their reasoning paths unverified. These results show that test time scaling, focused feedback loops and provenance convert general purpose LLMs into auditable systems for molecular design, and suggest that extending the toolset to ADMET and selectivity predictors could push research workflows further along the discovery pipeline.

🔗 View Paper 📄 PDF

17. Multi-Objective Infeasibility Diagnosis for Routing Problems Using Large Language Models

Authors: Kai Li, Ruihao Zheng, Xinye Hao, Zhenkun Wang • Published: 2025-08-05 • Source: arXiv

In real-world routing problems, users often propose conflicting or unreasonable requirements, which result in infeasible optimization models due to overly restrictive or contradictory constraints, leading to an empty feasible solution set. Existing Large Language Model (LLM)-based methods attempt to diagnose infeasible models, but modifying such models often involves multiple potential adjustments that these methods do not consider. To fill this gap, we introduce Multi-Objective Infeasibility Diagnosis (MOID), which combines LLM agents and multi-objective optimization within an automatic routing solver, to provide a set of representative actionable suggestions. Specifically, MOID employs multi-objective optimization to consider both path cost and constraint violation, generating a set of trade-off solutions, each encompassing varying degrees of model adjustments. To extract practical insights from these solutions, MOID utilizes LLM agents to generate a solution analysis function for the infeasible model. This function analyzes these distinct solutions to diagnose the original infeasible model, providing users with diverse diagnostic insights and suggestions. Finally, we compare MOID with several LLM-based methods on 50 types of infeasible routing problems. The results indicate that MOID automatically generates multiple diagnostic suggestions in a single run, providing more practical insights for restoring model feasibility and decision-making compared to existing methods.

🔗 View Paper 📄 PDF

18. Agentic AI in 6G Software Businesses: A Layered Maturity Model

Authors: Muhammad Zohaib, Muhammad Azeem Akbar, Sami Hyrynsalmi, Arif Ali Khan • Published: 2025-08-05 • Source: arXiv

The emergence of agentic AI systems in 6G software businesses presents both strategic opportunities and significant challenges. While such systems promise increased autonomy, scalability, and intelligent decision-making across distributed environments, their adoption raises concerns regarding technical immaturity, integration complexity, organizational readiness, and performance-cost trade-offs. In this study, we conducted a preliminary thematic mapping to identify factors influencing the adoption of agentic software within the context of 6G. Drawing on a multivocal literature review and targeted scanning, we identified 29 motivators and 27 demotivators, which were further categorized into five high-level themes in each group. This thematic mapping offers a structured overview of the enabling and inhibiting forces shaping organizational readiness for agentic transformation. Positioned as a feasibility assessment, the study represents an early phase of a broader research initiative aimed at developing and validating a layered maturity model grounded in CMMI model with the software architectural three dimensions possibly Data, Business Logic, and Presentation. Ultimately, this work seeks to provide a practical framework to help software-driven organizations assess, structure, and advance their agent-first capabilities in alignment with the demands of 6G.

🔗 View Paper 📄 PDF

19. Can We Fix Social Media? Testing Prosocial Interventions using Generative Social Simulation

Authors: Maik Larooij, Petter Törnberg • Published: 2025-08-05 • Source: arXiv

Social media platforms have been widely linked to societal harms, including rising polarization and the erosion of constructive debate. Can these problems be mitigated through prosocial interventions? We address this question using a novel method - generative social simulation - that embeds Large Language Models within Agent-Based Models to create socially rich synthetic platforms. We create a minimal platform where agents can post, repost, and follow others. We find that the resulting following-networks reproduce three well-documented dysfunctions: (1) partisan echo chambers; (2) concentrated influence among a small elite; and (3) the amplification of polarized voices - creating a 'social media prism' that distorts political discourse. We test six proposed interventions, from chronological feeds to bridging algorithms, finding only modest improvements - and in some cases, worsened outcomes. These results suggest that core dysfunctions may be rooted in the feedback between reactive engagement and network growth, raising the possibility that meaningful reform will require rethinking the foundational dynamics of platform architecture.

🔗 View Paper 📄 PDF

20. A Comparative Study of Neurosymbolic AI Approaches to Interpretable Logical Reasoning

Authors: Michael K. Chen • Published: 2025-08-05 • Source: arXiv

General logical reasoning, defined as the ability to reason deductively on domain-agnostic tasks, continues to be a challenge for large language models (LLMs). Current LLMs fail to reason deterministically and are not interpretable. As such, there has been a recent surge in interest in neurosymbolic AI, which attempts to incorporate logic into neural networks. We first identify two main neurosymbolic approaches to improving logical reasoning: (i) the integrative approach comprising models where symbolic reasoning is contained within the neural network, and (ii) the hybrid approach comprising models where a symbolic solver, separate from the neural network, performs symbolic reasoning. Both contain AI systems with promising results on domain-specific logical reasoning benchmarks. However, their performance on domain-agnostic benchmarks is understudied. To the best of our knowledge, there has not been a comparison of the contrasting approaches that answers the following question: Which approach is more promising for developing general logical reasoning? To analyze their potential, the following best-in-class domain-agnostic models are introduced: Logic Neural Network (LNN), which uses the integrative approach, and LLM-Symbolic Solver (LLM-SS), which uses the hybrid approach. Using both models as case studies and representatives of each approach, our analysis demonstrates that the hybrid approach is more promising for developing general logical reasoning because (i) its reasoning chain is more interpretable, and (ii) it retains the capabilities and advantages of existing LLMs. To support future works using the hybrid approach, we propose a generalizable framework based on LLM-SS that is modular by design, model-agnostic, domain-agnostic, and requires little to no human input.

🔗 View Paper 📄 PDF

21. Taggus: An Automated Pipeline for the Extraction of Characters' Social Networks from Portuguese Fiction Literature

Authors: Tiago G Canário, Catarina Duarte, Flávio L. Pinheiro, João L. M. Pereira • Published: 2025-08-05 • Source: arXiv

Automatically identifying characters and their interactions from fiction books is, arguably, a complex task that requires pipelines that leverage multiple Natural Language Processing (NLP) methods, such as Named Entity Recognition (NER) and Part-of-speech (POS) tagging. However, these methods are not optimized for the task that leads to the construction of Social Networks of Characters. Indeed, the currently available methods tend to underperform, especially in less-represented languages, due to a lack of manually annotated data for training. Here, we propose a pipeline, which we call Taggus, to extract social networks from literary fiction works in Portuguese. Our results show that compared to readily available State-of-the-Art tools -- off-the-shelf NER tools and Large Language Models (ChatGPT) -- the resulting pipeline, which uses POS tagging and a combination of heuristics, achieves satisfying results with an average F1-Score of $94.1\%$ in the task of identifying characters and solving for co-reference and $75.9\%$ in interaction detection. These represent, respectively, an increase of $50.7\%$ and $22.3\%$ on results achieved by the readily available State-of-the-Art tools. Further steps to improve results are outlined, such as solutions for detecting relationships between characters. Limitations on the size and scope of our testing samples are acknowledged. The Taggus pipeline is publicly available to encourage development in this field for the Portuguese language.2

🔗 View Paper 📄 PDF

22. Compressing Chain-of-Thought in LLMs via Step Entropy

Authors: Zeju Li, Jianyuan Zhong, Ziyang Zheng, Xiangyu Wen, Zhijian Xu, Yingying Cheng, Fan Zhang, Qiang Xu • Published: 2025-08-05 • Source: arXiv

Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies the informational contribution of individual reasoning steps to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80\% of low-entropy intermediate steps can be pruned with minor degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy pruning, which severely impairs reasoning performance. Building on this, we propose a novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. This approach enables LLMs to autonomously learn to generate compressed COTs during inference by strategically incorporating [SKIP] tokens. Our method significantly enhances LLM inference efficiency while rigorously preserving accuracy, offering profound implications for practical LLM deployment and a deeper understanding of reasoning structures.

🔗 View Paper 📄 PDF

23. From Legacy to Standard: LLM-Assisted Transformation of Cybersecurity Playbooks into CACAO Format

Authors: Mehdi Akbari Gurabi, Lasse Nitz, Radu-Mihai Castravet, Roman Matzutt, Avikarsha Mandal, Stefan Decker • Published: 2025-08-05 • Source: arXiv

Existing cybersecurity playbooks are often written in heterogeneous, non-machine-readable formats, which limits their automation and interoperability across Security Orchestration, Automation, and Response platforms. This paper explores the suitability of Large Language Models, combined with Prompt Engineering, to automatically translate legacy incident response playbooks into the standardized, machine-readable CACAO format. We systematically examine various Prompt Engineering techniques and carefully design prompts aimed at maximizing syntactic accuracy and semantic fidelity for control flow preservation. Our modular transformation pipeline integrates a syntax checker to ensure syntactic correctness and features an iterative refinement mechanism that progressively reduces syntactic errors. We evaluate the proposed approach on a custom-generated dataset comprising diverse legacy playbooks paired with manually created CACAO references. The results demonstrate that our method significantly improves the accuracy of playbook transformation over baseline models, effectively captures complex workflow structures, and substantially reduces errors. It highlights the potential for practical deployment in automated cybersecurity playbook transformation tasks.

🔗 View Paper 📄 PDF

24. Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science

Authors: Jiayan Nan, Wenquan Ma, Wenlong Wu, Yize Chen • Published: 2025-08-05 • Source: arXiv

Large Language Models (LLMs) demonstrate remarkable capabilities, yet their inability to maintain persistent memory in long contexts limits their effectiveness as autonomous agents in long-term interactions. While existing memory systems have made progress, their reliance on arbitrary granularity for defining the basic memory unit and passive, rule-based mechanisms for knowledge extraction limits their capacity for genuine learning and evolution. To address these foundational limitations, we present Nemori, a novel self-organizing memory architecture inspired by human cognitive principles. Nemori's core innovation is twofold: First, its Two-Step Alignment Principle, inspired by Event Segmentation Theory, provides a principled, top-down method for autonomously organizing the raw conversational stream into semantically coherent episodes, solving the critical issue of memory granularity. Second, its Predict-Calibrate Principle, inspired by the Free-energy Principle, enables the agent to proactively learn from prediction gaps, moving beyond pre-defined heuristics to achieve adaptive knowledge evolution. This offers a viable path toward handling the long-term, dynamic workflows of autonomous agents. Extensive experiments on the LoCoMo and LongMemEval benchmarks demonstrate that Nemori significantly outperforms prior state-of-the-art systems, with its advantage being particularly pronounced in longer contexts.

🔗 View Paper 📄 PDF

25. Industrial LLM-based Code Optimization under Regulation: A Mixture-of-Agents Approach

Authors: Mari Ashiga, Vardan Voskanyan, Fateme Dinmohammadi, Jingzhi Gong, Paul Brookes, Matthew Truscott, Rafail Giavrimis, Mike Basios, Leslie Kanthan, Wei Jie • Published: 2025-08-05 • Source: arXiv

Recent advancements in Large Language Models (LLMs) for code optimization have enabled industrial platforms to automate software performance engineering at unprecedented scale and speed. Yet, organizations in regulated industries face strict constraints on which LLMs they can use - many cannot utilize commercial models due to data privacy regulations and compliance requirements, creating a significant challenge for achieving high-quality code optimization while maintaining cost-effectiveness. We address this by implementing a Mixture-of-Agents (MoA) approach that directly synthesizes code from multiple specialized LLMs, comparing it against TurinTech AI's vanilla Genetic Algorithm (GA)-based ensemble system and individual LLM optimizers using real-world industrial codebases. Our key contributions include: (1) First MoA application to industrial code optimization using real-world codebases; (2) Empirical evidence that MoA excels with open-source models, achieving 14.3% to 22.2% cost savings and 28.6% to 32.2% faster optimization times for regulated environments; (3) Deployment guidelines demonstrating GA's advantage with commercial models while both ensembles outperform individual LLMs; and (4) Real-world validation across 50 code snippets and seven LLM combinations, generating over 8,700 variants, addresses gaps in industrial LLM ensemble evaluation. This provides actionable guidance for organizations balancing regulatory compliance with optimization performance in production environments.

🔗 View Paper 📄 PDF

26. GUI-ReRank: Enhancing GUI Retrieval with Multi-Modal LLM-based Reranking

Authors: Kristian Kolthoff, Felix Kretzer, Christian Bartelt, Alexander Maedche, Simone Paolo Ponzetto • Published: 2025-08-05 • Source: arXiv

GUI prototyping is a fundamental component in the development of modern interactive systems, which are now ubiquitous across diverse application domains. GUI prototypes play a critical role in requirements elicitation by enabling stakeholders to visualize, assess, and refine system concepts collaboratively. Moreover, prototypes serve as effective tools for early testing, iterative evaluation, and validation of design ideas with both end users and development teams. Despite these advantages, the process of constructing GUI prototypes remains resource-intensive and time-consuming, frequently demanding substantial effort and expertise. Recent research has sought to alleviate this burden through NL-based GUI retrieval approaches, which typically rely on embedding-based retrieval or tailored ranking models for specific GUI repositories. However, these methods often suffer from limited retrieval performance and struggle to generalize across arbitrary GUI datasets. In this work, we present GUI-ReRank, a novel framework that integrates rapid embedding-based constrained retrieval models with highly effective MLLM-based reranking techniques. GUI-ReRank further introduces a fully customizable GUI repository annotation and embedding pipeline, enabling users to effortlessly make their own GUI repositories searchable, which allows for rapid discovery of relevant GUIs for inspiration or seamless integration into customized LLM-based RAG workflows. We evaluated our approach on an established NL-based GUI retrieval benchmark, demonstrating that GUI-ReRank significantly outperforms SOTA tailored LTR models in both retrieval accuracy and generalizability. Additionally, we conducted a comprehensive cost and efficiency analysis of employing MLLMs for reranking, providing valuable insights regarding the trade-offs between retrieval effectiveness and computational resources. Video: https://youtu.be/_7x9UCh82ug

🔗 View Paper 📄 PDF

27. Token-Level Precise Attack on RAG: Searching for the Best Alternatives to Mislead Generation

Authors: Zizhong Li, Haopeng Zhang, Jiawei Zhang • Published: 2025-08-05 • Source: arXiv

While large language models (LLMs) have achieved remarkable success in providing trustworthy responses for knowledge-intensive tasks, they still face critical limitations such as hallucinations and outdated knowledge. To address these issues, the retrieval-augmented generation (RAG) framework enhances LLMs with access to external knowledge via a retriever, enabling more accurate and real-time outputs about the latest events. However, this integration brings new security vulnerabilities: the risk that malicious content in the external database can be retrieved and used to manipulate model outputs. Although prior work has explored attacks on RAG systems, existing approaches either rely heavily on access to the retriever or fail to jointly consider both retrieval and generation stages, limiting their effectiveness, particularly in black-box scenarios. To overcome these limitations, we propose Token-level Precise Attack on the RAG (TPARAG), a novel framework that targets both white-box and black-box RAG systems. TPARAG leverages a lightweight white-box LLM as an attacker to generate and iteratively optimize malicious passages at the token level, ensuring both retrievability and high attack success in generation. Extensive experiments on open-domain QA datasets demonstrate that TPARAG consistently outperforms previous approaches in retrieval-stage and end-to-end attack effectiveness. These results further reveal critical vulnerabilities in RAG pipelines and offer new insights into improving their robustness.

🔗 View Paper 📄 PDF

28. SustainableQA: A Comprehensive Question Answering Dataset for Corporate Sustainability and EU Taxonomy Reporting

Authors: Mohammed Ali, Abdelrahman Abdallah, Adam Jatowt • Published: 2025-08-05 • Source: arXiv

The growing demand for corporate sustainability transparency, particularly under new regulations like the EU Taxonomy, necessitates precise data extraction from large, unstructured corporate reports. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems, requires high-quality, domain-specific question-answering (QA) datasets to excel at particular domains. To address this, we introduce SustainableQA, a novel dataset and a scalable pipeline for generating a comprehensive QA datasets from corporate sustainability reports and annual reports. Our approach integrates semantic chunk classification, a hybrid span extraction pipeline combining fine-tuned Named Entity Recognition (NER), rule-based methods, and LLM-driven refinement, alongside a specialized table-to-paragraph transformation. With over 195,000 diverse factoid and non-factoid QA pairs, SustainableQA is an effective resource for developing and benchmarking advanced knowledge assistants capable of navigating complex sustainability compliance

🔗 View Paper 📄 PDF

29. MRG-Bench: Evaluating and Exploring the Requirements of Context for Repository-Level Code Generation

Authors: Haiyang Li • Published: 2025-08-05 • Source: arXiv

Large Language Models (LLMs) have demonstrated impressive capabilities in code generation. However, current evaluation datasets suffer from issues such as the lack of runnable test cases, deviation from the distribution of real-world code, and the ability to evaluate only the Python language. These limitations undermine the credibility of the evaluation results. To address these limitations, we introduce \textbf{MRG-Bench} (Multi-language Repository-level Code Generation Benchmark), a novel dataset that provides a more accurate evaluation of LLMs in practical repository-level code generation tasks. MRG-Bench has three main features: (1) practical data sourced from real-world code repositories that align to the practical distribution, (2) multiple programming languages support, including Python, Java, and Go, and (3) project-level runnable test cases to assess the quality of the generated code. Based on MRG-Bench, we conducted extensive experiments including large language models, long-context models, and RAG-related methods. These evaluation results demonstrate that \textbf{current repository-level code generation techniques suffer from significant performance deficiencies}. To further investigate why models fail, we designed novel experiments to annotate the underlying causes of generation errors. The results explicitly show that the majority of methods suffer from "\textbf{difficulty in understanding user requirements}," failing to comprehend their assigned tasks accurately. Moreover, the impact of different repository-level contexts on this issue exhibits significant disparities across different programming languages, suggesting that, in practice, specialized contextual information needs to be designed for different languages.

🔗 View Paper 📄 PDF

30. A Multi-Agent System for Complex Reasoning in Radiology Visual Question Answering

Authors: Ziruo Yi, Jinyu Liu, Ting Xiao, Mark V. Albert • Published: 2025-08-04 • Source: arXiv

Radiology visual question answering (RVQA) provides precise answers to questions about chest X-ray images, alleviating radiologists' workload. While recent methods based on multimodal large language models (MLLMs) and retrieval-augmented generation (RAG) have shown promising progress in RVQA, they still face challenges in factual accuracy, hallucinations, and cross-modal misalignment. We introduce a multi-agent system (MAS) designed to support complex reasoning in RVQA, with specialized agents for context understanding, multimodal reasoning, and answer validation. We evaluate our system on a challenging RVQA set curated via model disagreement filtering, comprising consistently hard cases across multiple MLLMs. Extensive experiments demonstrate the superiority and effectiveness of our system over strong MLLM baselines, with a case study illustrating its reliability and interpretability. This work highlights the potential of multi-agent approaches to support explainable and trustworthy clinical AI applications that require complex reasoning.

🔗 View Paper 📄 PDF

31. Defending Against Knowledge Poisoning Attacks During Retrieval-Augmented Generation

Authors: Kennedy Edemacu, Vinay M. Shashidhar, Micheal Tuape, Dan Abudu, Beakcheol Jang, Jong Wook Kim • Published: 2025-08-04 • Source: arXiv

Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to boost the capabilities of large language models (LLMs) by incorporating external, up-to-date knowledge sources. However, this introduces a potential vulnerability to knowledge poisoning attacks, where attackers can compromise the knowledge source to mislead the generation model. One such attack is the PoisonedRAG in which the injected adversarial texts steer the model to generate an attacker-chosen response to a target question. In this work, we propose novel defense methods, FilterRAG and ML-FilterRAG, to mitigate the PoisonedRAG attack. First, we propose a new property to uncover distinct properties to differentiate between adversarial and clean texts in the knowledge data source. Next, we employ this property to filter out adversarial texts from clean ones in the design of our proposed approaches. Evaluation of these methods using benchmark datasets demonstrate their effectiveness, with performances close to those of the original RAG systems.

🔗 View Paper 📄 PDF

32. Meta-RAG on Large Codebases Using Code Summarization

Authors: Vali Tawosia, Salwa Alamir, Xiaomo Liu, Manuela Veloso • Published: 2025-08-04 • Source: arXiv

Large Language Model (LLM) systems have been at the forefront of applied Artificial Intelligence (AI) research in a multitude of domains. One such domain is software development, where researchers have pushed the automation of a number of code tasks through LLM agents. Software development is a complex ecosystem, that stretches far beyond code implementation and well into the realm of code maintenance. In this paper, we propose a multi-agent system to localize bugs in large pre-existing codebases using information retrieval and LLMs. Our system introduces a novel Retrieval Augmented Generation (RAG) approach, Meta-RAG, where we utilize summaries to condense codebases by an average of 79.8\%, into a compact, structured, natural language representation. We then use an LLM agent to determine which parts of the codebase are critical for bug resolution, i.e. bug localization. We demonstrate the usefulness of Meta-RAG through evaluation with the SWE-bench Lite dataset. Meta-RAG scores 84.67 % and 53.0 % for file-level and function-level correct localization rates, respectively, achieving state-of-the-art performance.

🔗 View Paper 📄 PDF

33. ReMoMask: Retrieval-Augmented Masked Motion Generation

Authors: Zhengdao Li, Siheng Wang, Zeyu Zhang, Hao Tang • Published: 2025-08-04 • Source: arXiv

Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M. Code: https://github.com/AIGeeksGroup/ReMoMask. Website: https://aigeeksgroup.github.io/ReMoMask.

🔗 View Paper 📄 PDF

34. ASINT: Learning AS-to-Organization Mapping from Internet Metadata

Authors: Yongzhe Xu, Weitong Li, Eeshan Umrani, Taejoong Chung • Published: 2025-08-04 • Source: arXiv

Accurately mapping Autonomous Systems (ASNs) to their owning or operating organizations underpins Internet measurement research and security applications. Yet existing approaches commonly rely solely on WHOIS or PeeringDB, missing important relationships (e.g., cross-regional aliases, parent-child ownership) and failing to unify organizations scattered across different RIR identifiers. We introduce ASINT, an end-to-end pipeline that fuses bulk registry data with unstructured Web sources, then employs retrieval-augmented generation (RAG) to guide large language model (LLM) inference. Through a multi-stage procedure, ASINT merges ASNs into "organization families," capturing nuanced ties beyond the scope of simpler heuristics. ASINT maps 111,470 ASNs to 81,233 organization families; compared to both AS2ORG+ and AS-Sibling, ASINT identifies more cross-regional groupings (e.g., operator aliases, rebrands) that other datasets overlook. Moreover, our refined mappings enhance multiple security and measurement tasks: ASINT exposes 27.5% more intra-organizational RPKI misconfigurations, cuts false-positive hijack alarms by 9.4%, and lowers erroneous IP leasing inferences by 5.9%. Finally, ASINT supports periodic updates and cost-sensitive LLM selection, demonstrating that broader Web evidence can provide a more accurate, evolving view of the Internet's organizational structure.

🔗 View Paper 📄 PDF

35. Contextual Graph Transformer: A Small Language Model for Enhanced Engineering Document Information Extraction

Authors: Karan Reddy, Mayukha Pal • Published: 2025-08-04 • Source: arXiv

Standard transformer-based language models, while powerful for general text, often struggle with the fine-grained syntax and entity relationships in complex technical, engineering documents. To address this, we propose the Contextual Graph Transformer (CGT), a hybrid neural architecture that combines Graph Neural Networks (GNNs) and Transformers for domain-specific question answering. CGT constructs a dynamic graph over input tokens using sequential, skip-gram, and semantic similarity edges, which is processed by GATv2Conv layers for local structure learning. These enriched embeddings are then passed to a Transformer encoder to capture global dependencies. Unlike generic large models, technical domains often require specialized language models with stronger contextualization and structure awareness. CGT offers a parameter-efficient solution for such use cases. Integrated into a Retrieval-Augmented Generation (RAG) pipeline, CGT outperforms baselines like GPT-2 and BERT, achieving 24.7% higher accuracy than GPT-2 with 62.4% fewer parameters. This gain stems from CGTs ability to jointly model structural token interactions and long-range semantic coherence. The model is trained from scratch using a two-phase approach: pretraining on general text followed by fine-tuning on domain-specific manuals. This highlights CGTs adaptability to technical language, enabling better grounding, entity tracking, and retrieval-augmented responses in real-world applications.

🔗 View Paper 📄 PDF

🤖 AI Research Papers

🤖 AI-Generated Research Summary

Comprehensive Summary of 35 Research Papers on AI, LLMs, Agents, and Workflows

1. Key Research Trends

2. Breakthrough Findings

3. Methodological Approaches

4. Applications and Use Cases

5. Future Directions

Conclusion