πŸ€– AI Research Papers

August 08, 2025

πŸ€– AI-Generated Research Summary

Comprehensive Summary of Recent Research in AI, LLMs, Agents, and Workflows

This summary synthesizes key insights from 35 recent research papers spanning AI, large language models (LLMs), agents, and workflow automation. The analysis is structured to highlight key research trends, breakthrough findings, methodological approaches, applications and use cases, and future directions.


1. Key Research Trends

a. Rise of Multi-Agent and Collaborative AI Systems

b. Integration of LLMs with External Knowledge and Perception

c. Explainability, Transparency, and Bias Detection

d. Robustness, Continual Learning, and Adaptation

e. Benchmarking and Evaluation

f. Workflow Automation and OS-level Agents


2. Breakthrough Findings


3. Methodological Approaches


4. Applications and Use Cases


5. Future Directions


Conclusion

The reviewed papers collectively signal a shift towards more robust, interpretable, and agentic AI systems that can operate in complex, real-world environments. The integration of LLMs with external knowledge, the rise of multi-agent collaboration, and the focus on explainability and robustness are shaping the next generation of AI research and applications. For practitioners and researchers, these trends highlight the importance of designing AI systems that are not only powerful but also transparent, adaptable, and aligned with human values and workflows.

πŸ“š arXiv (35 papers)
1. BEVCon: Advancing Bird's Eye View Perception with Contrastive Learning
Authors: Ziyang Leng, Jiawei Yang, Zhicheng Ren, Bolei Zhou β€’ Published: 2025-08-06 β€’ Source: arXiv
We present BEVCon, a simple yet effective contrastive learning framework designed to improve Bird's Eye View (BEV) perception in autonomous driving. BEV perception offers a top-down-view representation of the surrounding environment, making it crucial for 3D object detection, segmentation, and trajectory prediction tasks. While prior work has primarily focused on enhancing BEV encoders and task-specific heads, we address the underexplored potential of representation learning in BEV models. BEVCon introduces two contrastive learning modules: an instance feature contrast module for refining BEV features and a perspective view contrast module that enhances the image backbone. The dense contrastive learning designed on top of detection losses leads to improved feature representations across both the BEV encoder and the backbone. Extensive experiments on the nuScenes dataset demonstrate that BEVCon achieves consistent performance gains, achieving up to +2.4% mAP improvement over state-of-the-art baselines. Our results highlight the critical role of representation learning in BEV perception and offer a complementary avenue to conventional task-specific optimizations.
2. A colossal dielectric response of HfxZr1-xO2 nanoparticles
Authors: Oleksandr S. Pylypchuk, Victor V. Vainberg, Vladimir N. Poroshin, Oksana V. Leshchenko, Victor N. Pavlikov, Irina V. Kondakova, Serhii E. Ivanchenko, Lesya P. Yurchenko, Lesya Demchenko, Anna O. Diachenko, Myroslav V. Karpets, Mikhail P. Trubitsyn, Eugene A. Eliseev, Anna N. Morozovska β€’ Published: 2025-08-06 β€’ Source: arXiv
We reveal a colossal dielectric response of small (5 - 10 nm) oxygen-deficient HfxZr1-xO2 nanoparticles (x = 1 - 0.4), prepared by the solid-state organonitrate synthesis. The effective dielectric permittivity of the pressed HfxZr1-xO2 nanopowders has a pronounced maximum at 38 - 88 C, which shape can be fitted by the Curie-Weiss type dependence modified for the diffuse ferroelectric-paraelectric phase transition. The maximal value of the dielectric permittivity increases from 1.5*10^3 (for x = 1) to 1.5*10^5 (for x= 0.4) at low frequencies (~4 Hz); being much smaller, namely changing from 7 (for x = 1) to 20 (for x = 0.4) at high frequencies (~500 kHz). The frequency dispersion of the dielectric permittivity maximum position is almost absent, meanwhile the shape and width of the maximum changes in a complex way with increase in frequency. The temperature dependencies of the dielectric permittivity and resistivity are almost mirror-like turned over in respect to each other, which means that all their features, such as position and shape of maxima, plateau, minima and inflexions, almost coincide after the mirror reflection in respect to the temperature axis. These correlations of resistivity and dielectric permittivity are well-described in the Heywang barrier model applied together with the variable range hopping conduction model in semiconducting ferroelectrics. The ferroelectric-like behavior of the dielectric permittivity is explained by the Landau-Ginzburg-Devonshire approach and density functional theory calculations, which reveal that small HfxZr1-xO2 nanoparticles can become ferroelectric in the presence of oxygen vacancies. Obtained results may be useful for developing silicon-compatible ferroelectric nanomaterials based on HfxZr1-xO2 nanoparticles.
3. Modeling non-Newtonian fluids in a thin domain perforated with cylinders of small diameter
Authors: MarΓ­a Anguiano, Francisco J. SuΓ‘rez-Grau β€’ Published: 2025-08-06 β€’ Source: arXiv
We consider the flow of a generalized Newtonian fluid through a thin porous medium of height $h_\varepsilon$ perforated with $\varepsilon$-periodically distributed solid cylinders of very small diameter $\varepsilon\delta_\varepsilon$, where the small parameters $\varepsilon, \delta_\varepsilon$ and $h_\varepsilon$ are devoted to tend to zero. We assume that the fluid is described by the 3D incompressible Stokes system with a non-linear power law viscosity of flow index $1
4. MienCap: Realtime Performance-Based Facial Animation with Live Mood Dynamics
Authors: Ye Pan, Ruisi Zhang, Jingying Wang, Nengfu Chen, Yilin Qiu, Yu Ding, Kenny Mitchell β€’ Published: 2025-08-06 β€’ Source: arXiv
Our purpose is to improve performance-based animation which can drive believable 3D stylized characters that are truly perceptual. By combining traditional blendshape animation techniques with multiple machine learning models, we present both non-real time and real time solutions which drive character expressions in a geometrically consistent and perceptually valid way. For the non-real time system, we propose a 3D emotion transfer network makes use of a 2D human image to generate a stylized 3D rig parameters. For the real time system, we propose a blendshape adaption network which generates the character rig parameter motions with geometric consistency and temporally stability. We demonstrate the effectiveness of our system by comparing to a commercial product Faceware. Results reveal that ratings of the recognition, intensity, and attractiveness of expressions depicted for animated characters via our systems are statistically higher than Faceware. Our results may be implemented into the animation pipeline, and provide animators with a system for creating the expressions they wish to use more quickly and accurately.
5. Time-dependent response of protoplanetary disk temperature to an FU Ori-type luminosity outburst
Authors: S. I. Laznevoi, V. V. Akimkin, Ya. N. Pavlyuchenkov, V. B. Il'in, Á. KΓ³spΓ‘l, P. ÁbrahΓ‘m β€’ Published: 2025-08-06 β€’ Source: arXiv
Context. The most prominent cases of young star variability are accretion outbursts in FU Ori-type systems. The high power of such outbursts causes dramatic changes in the physical and chemical structure of a surrounding protoplanetary disk. As characteristic thermal timescales in the disk are comparable to the duration of the outburst, the response of its thermal structure is inherently time dependent. Aims. We analyzed how the disk thermal structure evolves under the substantial-yet transient-eating of the outburst. To cover different possible physical mechanisms driving the outburst, we examined two scenarios: one in which the increased accretion rate is confined to a compact sub-au inner region and the other where it affects the entire disk. Methods. To model the disk temperature response to the outburst we performed time-dependent radiation transfer using the HURAKAN code. The disk structure and the luminosity profile roughly correspond to those of the FU Ori system itself, which went into outburst about 90 years ago and reached a luminosity of 450 L_Sun. Results. We find that optically thick disk regions require several years to become fully heated during the outburst and a decade to cool after it. The upper layers and outer parts of the disk, which are optically thin to thermal radiation, are heated and cooled almost instantaneously. This creates an unusual radial temperature profile during the early heating phase with minima at several au both for the fully active and compact active disk scenarios. At the cooling phase, upper layers being colder than the midplane for both scenarios. Near- and mid-infrared SEDs demonstrate a significant and almost instantaneous rise by 1 - 2 orders of magnitude during the outburst, while the millimeter flux shows a change of only a factor of a few, and is slightly delayed with respect to the central region luminosity profile.
6. TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction
Authors: Zewei Zhou, Seth Z. Zhao, Tianhui Cai, Zhiyu Huang, Bolei Zhou, Jiaqi Ma β€’ Published: 2025-08-06 β€’ Source: arXiv
End-to-end training of multi-agent systems offers significant advantages in improving multi-task performance. However, training such models remains challenging and requires extensive manual design and monitoring. In this work, we introduce TurboTrain, a novel and efficient training framework for multi-agent perception and prediction. TurboTrain comprises two key components: a multi-agent spatiotemporal pretraining scheme based on masked reconstruction learning and a balanced multi-task learning strategy based on gradient conflict suppression. By streamlining the training process, our framework eliminates the need for manually designing and tuning complex multi-stage training pipelines, substantially reducing training time and improving performance. We evaluate TurboTrain on a real-world cooperative driving dataset, V2XPnP-Seq, and demonstrate that it further improves the performance of state-of-the-art multi-agent perception and prediction models. Our results highlight that pretraining effectively captures spatiotemporal multi-agent features and significantly benefits downstream tasks. Moreover, the proposed balanced multi-task learning strategy enhances detection and prediction.
7. ANPrompt: Anti-noise Prompt Tuning for Vision-Language Models
Authors: Yansheng Gao, Yufei Zheng, Jinghan Qu, Zixi Zhu, Yukuan Zhang, Shengsheng Wang β€’ Published: 2025-08-06 β€’ Source: arXiv
Prompt tuning has emerged as an efficient and effective technique for adapting vision-language models (VLMs) with low computational overhead. However, existing methods often overlook the vulnerability of prompt-tuned VLMs to weak semantic perturbations-such as subtle image or text noise-that degrade their generalization to unseen classes. To address this limitation, we propose ANPrompt, a novel prompt tuning framework designed to enhance robustness under such perturbations. ANPrompt first constructs weak noise text features by fusing original and noise-perturbed text embeddings, which are then clustered to form noise prompts. These noise prompts are integrated with learnable prompt tokens to generate anti-noise prompts, which are injected into the deeper layers of both image and text encoders. To further capture the noise-aware visual semantics, ANPrompt computes the Noise-Resistant Visual Prompt Prototype (NRVPP) by averaging the output prompt tokens from the vision encoder. Finally, ANPrompt introduces alignment, robustness, and anti-noise objectives by computing a Weak semantic noise Alignment Loss (WALoss) alongside the standard cross-entropy and sim loss. Experiments across 11 benchmarks demonstrate that ANPrompt consistently outperforms existing prompt tuning approaches, achieving superior robustness to semantic noise and improved generalization to novel categories.
8. Robustly Learning Monotone Single-Index Models
Authors: Puqian Wang, Nikos Zarifis, Ilias Diakonikolas, Jelena Diakonikolas β€’ Published: 2025-08-06 β€’ Source: arXiv
We consider the basic problem of learning Single-Index Models with respect to the square loss under the Gaussian distribution in the presence of adversarial label noise. Our main contribution is the first computationally efficient algorithm for this learning task, achieving a constant factor approximation, that succeeds for the class of {\em all} monotone activations with bounded moment of order $2 + \zeta,$ for $\zeta > 0.$ This class in particular includes all monotone Lipschitz functions and even discontinuous functions like (possibly biased) halfspaces. Prior work for the case of unknown activation either does not attain constant factor approximation or succeeds for a substantially smaller family of activations. The main conceptual novelty of our approach lies in developing an optimization framework that steps outside the boundaries of usual gradient methods and instead identifies a useful vector field to guide the algorithm updates by directly leveraging the problem structure, properties of Gaussian spaces, and regularity of monotone functions.
9. PixCuboid: Room Layout Estimation from Multi-view Featuremetric Alignment
Authors: Gustav Hanning, Kalle Γ…strΓΆm, Viktor Larsson β€’ Published: 2025-08-06 β€’ Source: arXiv
Coarse room layout estimation provides important geometric cues for many downstream tasks. Current state-of-the-art methods are predominantly based on single views and often assume panoramic images. We introduce PixCuboid, an optimization-based approach for cuboid-shaped room layout estimation, which is based on multi-view alignment of dense deep features. By training with the optimization end-to-end, we learn feature maps that yield large convergence basins and smooth loss landscapes in the alignment. This allows us to initialize the room layout using simple heuristics. For the evaluation we propose two new benchmarks based on ScanNet++ and 2D-3D-Semantics, with manually verified ground truth 3D cuboids. In thorough experiments we validate our approach and significantly outperform the competition. Finally, while our network is trained with single cuboids, the flexibility of the optimization-based approach allow us to easily extend to multi-room estimation, e.g. larger apartments or offices. Code and model weights are available at https://github.com/ghanning/PixCuboid.
10. Charged Higgs Signatures at Future Electron-Proton Colliders
Authors: Baradhwaj Coleppa, Gokul B. Krishna β€’ Published: 2025-08-06 β€’ Source: arXiv
In this work, we present a detailed collider phenomenology study of the charged Higgs boson within a Beyond the Standard Model (BSM) framework featuring an extended gauge and scalar sector. The charged Higgs can decay via conventional modes, such as $H^- \to \bar{t}b$ and $H^- \to W^-h$, as well as through exotic channels like $H^- \to W'Z$ (or $WZ'$). These decays lead to distinct final-state topologies determined by the nature of the intermediate particles. We perform a comprehensive phenomenological analysis at future electron-proton colliders, namely the LHeC and FCC-eh, considering the luminosity projections provided in their design reports. Our results indicate that the conventional decay modes of the charged Higgs boson can achieve observable sensitivity and even discovery prospects at sufficiently high luminosities. In contrast, the exotic decay channel $H^- \to W'Z$ does not exhibit any viable discovery potential. These findings highlight the complementarity of future electron-proton colliders in probing extended Higgs sectors, particularly through conventional charged Higgs signatures.
11. X-SAM: From Segment Anything to Any Segmentation
Authors: Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, Xiaodan Liang β€’ Published: 2025-08-06 β€’ Source: arXiv
Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at https://github.com/wanghao9610/X-SAM.
12. Coarse and pointwise tangent fields
Authors: Guy C. David, Sylvester Eriksson-Bique, Raanan Schul β€’ Published: 2025-08-06 β€’ Source: arXiv
Alberti, Cs\"ornyei and Preiss introduced a notion of a "pointwise (weak) tangent field" for a subset of Euclidean space -- a field that contains almost every tangent line of every curve passing through the set -- and showed that all area-zero sets in the plane admit one-dimensional tangent fields. We extend their results in two distinct directions. First, a special case of our pointwise result shows that each doubling subset of Hilbert space admits a pointwise tangent field in this sense, with dimension bounded by the Nagata (or Assouad) dimension of the set. Second, inspired by the Analyst's Traveling Salesman Theorem of Jones, we introduce new, "coarse" notions of tangent field for subsets of Hilbert space, which take into account both large and small scale structure. We show that doubling subsets of Hilbert space admit such coarse tangent fields, again with dimension bounded by the Nagata (or Assouad) dimension of the set. For porous sets in the plane, this result can be viewed as a quantitative version of the Alberti--Cs\"ornyei--Preiss result, though our results hold in all (even infinite) dimensions.
13. EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts
Authors: Kushin Mukherjee, Donghao Ren, Dominik Moritz, Yannick Assogba β€’ Published: 2025-08-06 β€’ Source: arXiv
Multimodal vision-language models (VLMs) continue to achieve ever-improving scores on chart understanding benchmarks. Yet, we find that this progress does not fully capture the breadth of visual reasoning capabilities essential for interpreting charts. We introduce EncQA, a novel benchmark informed by the visualization literature, designed to provide systematic coverage of visual encodings and analytic tasks that are crucial for chart understanding. EncQA provides 2,076 synthetic question-answer pairs, enabling balanced coverage of six visual encoding channels (position, length, area, color quantitative, color nominal, and shape) and eight tasks (find extrema, retrieve value, find anomaly, filter values, compute derived value exact, compute derived value relative, correlate values, and correlate values relative). Our evaluation of 9 state-of-the-art VLMs reveals that performance varies significantly across encodings within the same task, as well as across tasks. Contrary to expectations, we observe that performance does not improve with model size for many task-encoding pairs. Our results suggest that advancing chart understanding requires targeted strategies addressing specific visual reasoning gaps, rather than solely scaling up model or dataset size.
14. A Scalable Pretraining Framework for Link Prediction with Efficient Adaptation
Authors: Yu Song, Zhigang Hua, Harry Shomer, Yan Xie, Jingzhe Liu, Bo Long, Hui Liu β€’ Published: 2025-08-06 β€’ Source: arXiv
Link Prediction (LP) is a critical task in graph machine learning. While Graph Neural Networks (GNNs) have significantly advanced LP performance recently, existing methods face key challenges including limited supervision from sparse connectivity, sensitivity to initialization, and poor generalization under distribution shifts. We explore pretraining as a solution to address these challenges. Unlike node classification, LP is inherently a pairwise task, which requires the integration of both node- and edge-level information. In this work, we present the first systematic study on the transferability of these distinct modules and propose a late fusion strategy to effectively combine their outputs for improved performance. To handle the diversity of pretraining data and avoid negative transfer, we introduce a Mixture-of-Experts (MoE) framework that captures distinct patterns in separate experts, facilitating seamless application of the pretrained model on diverse downstream datasets. For fast adaptation, we develop a parameter-efficient tuning strategy that allows the pretrained model to adapt to unseen datasets with minimal computational overhead. Experiments on 16 datasets across two domains demonstrate the effectiveness of our approach, achieving state-of-the-art performance on low-resource link prediction while obtaining competitive results compared to end-to-end trained methods, with over 10,000x lower computational overhead.
15. RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case
Authors: Baihui Xiao, Chengjian Feng, Zhijian Huang, Feng yan, Yujie Zhong, Lin Ma β€’ Published: 2025-08-06 β€’ Source: arXiv
Collecting real-world data for rare high-risk scenarios, long-tailed driving events, and complex interactions remains challenging, leading to poor performance of existing autonomous driving systems in these critical situations. In this paper, we propose RoboTron-Sim that improves real-world driving in critical situations by utilizing simulated hard cases. First, we develop a simulated dataset called Hard-case Augmented Synthetic Scenarios (HASS), which covers 13 high-risk edge-case categories, as well as balanced environmental conditions such as day/night and sunny/rainy. Second, we introduce Scenario-aware Prompt Engineering (SPE) and an Image-to-Ego Encoder (I2E Encoder) to enable multimodal large language models to effectively learn real-world challenging driving skills from HASS, via adapting to environmental deviations and hardware differences between real-world and simulated scenarios. Extensive experiments on nuScenes show that RoboTron-Sim improves driving performance in challenging scenarios by around 50%, achieving state-of-the-art results in real-world open-loop planning. Qualitative results further demonstrate the effectiveness of RoboTron-Sim in better managing rare high-risk driving scenarios. Project page: https://stars79689.github.io/RoboTron-Sim/
16. HiD-VAE: Interpretable Generative Recommendation via Hierarchical and Disentangled Semantic IDs
Authors: Dengzhao Fang, Jingtong Gao, Chengcheng Zhu, Yu Li, Xiangyu Zhao, Yi Chang β€’ Published: 2025-08-06 β€’ Source: arXiv
Recommender systems are indispensable for helping users navigate the immense item catalogs of modern online platforms. Recently, generative recommendation has emerged as a promising paradigm, unifying the conventional retrieve-and-rank pipeline into an end-to-end model capable of dynamic generation. However, existing generative methods are fundamentally constrained by their unsupervised tokenization, which generates semantic IDs suffering from two critical flaws: (1) they are semantically flat and uninterpretable, lacking a coherent hierarchy, and (2) they are prone to representation entanglement (i.e., ``ID collisions''), which harms recommendation accuracy and diversity. To overcome these limitations, we propose HiD-VAE, a novel framework that learns hierarchically disentangled item representations through two core innovations. First, HiD-VAE pioneers a hierarchically-supervised quantization process that aligns discrete codes with multi-level item tags, yielding more uniform and disentangled IDs. Crucially, the trained codebooks can predict hierarchical tags, providing a traceable and interpretable semantic path for each recommendation. Second, to combat representation entanglement, HiD-VAE incorporates a novel uniqueness loss that directly penalizes latent space overlap. This mechanism not only resolves the critical ID collision problem but also promotes recommendation diversity by ensuring a more comprehensive utilization of the item representation space. These high-quality, disentangled IDs provide a powerful foundation for downstream generative models. Extensive experiments on three public benchmarks validate HiD-VAE's superior performance against state-of-the-art methods. The code is available at https://anonymous.4open.science/r/HiD-VAE-84B2.
17. A Reproducible, Scalable Pipeline for Synthesizing Autoregressive Model Literature
Authors: Faruk Alpay, Bugra Kilictas, Hamdi Alakkad β€’ Published: 2025-08-06 β€’ Source: arXiv
The accelerating pace of research on autoregressive generative models has produced thousands of papers, making manual literature surveys and reproduction studies increasingly impractical. We present a fully open-source, reproducible pipeline that automatically retrieves candidate documents from public repositories, filters them for relevance, extracts metadata, hyper-parameters and reported results, clusters topics, produces retrieval-augmented summaries and generates containerised scripts for re-running selected experiments. Quantitative evaluation on 50 manually-annotated papers shows F1 scores above 0.85 for relevance classification, hyper-parameter extraction and citation identification. Experiments on corpora of up to 1000 papers demonstrate near-linear scalability with eight CPU workers. Three case studies -- AWD-LSTM on WikiText-2, Transformer-XL on WikiText-103 and an autoregressive music model on the Lakh MIDI dataset -- confirm that the extracted settings support faithful reproduction, achieving test perplexities within 1--3% of the original reports.
18. Near instantaneous O(1) Analog Solver Circuit for Linear Symmetric Positive-Definite Systems
Authors: Osama Abdelaleim, Arun Prakash, Ayhan Irfanoglu, Veljko Milutinovic β€’ Published: 2025-08-06 β€’ Source: arXiv
Accelerating the solution of linear systems of equations is critical due to their central role in numerous applications, such as scientific simulations, data analytics, and machine learning. This paper presents a general-purpose analog direct solver circuit designed to accelerate the solution of positive definite symmetric linear systems of equations. The proposed design leverages non-inverting operational amplifier configurations to create a negative resistance circuit, effectively modeling any symmetric system. The paper details the principles behind the design, optimizations of the system architecture, and numerical results that demonstrate the robustness of the design. The findings reveal that the proposed system solves diagonally dominant symmetric matrices with O(1) complexity, achieving the theoretical maximum speed as the circuit relies solely on resistors. For non-diagonally dominant symmetric positive-definite systems, the solution speed depends on matrix properties such as eigenvalues and the maximum off-diagonal term, but remains independent of matrix size.
19. Assortativity in geometric and scale-free networks
Authors: Marc Kaufmann, Ulysse Schaller, Thomas BlΓ€sius, Johannes Lengler β€’ Published: 2025-08-06 β€’ Source: arXiv
The assortative behavior of a network is the tendency of similar (or dissimilar) nodes to connect to each other. This tendency can have an influence on various properties of the network, such as its robustness or the dynamics of spreading processes. In this paper, we study degree assortativity both in real-world networks and in several generative models for networks with heavy-tailed degree distribution based on latent spaces. In particular, we study Chung-Lu Graphs and Geometric Inhomogeneous Random Graphs (GIRGs). Previous research on assortativity has primarily focused on measuring the degree assortativity in real-world networks using the Pearson assortativity coefficient, despite reservations against this coefficient. We rigorously confirm these reservations by mathematically proving that the Pearson assortativity coefficient does not measure assortativity in any network with sufficiently heavy-tailed degree distributions, which is typical for real-world networks. Moreover, we find that other single-valued assortativity coefficients also do not sufficiently capture the wiring preferences of nodes, which often vary greatly by node degree. We therefore take a more fine-grained approach, analyzing a wide range of conditional and joint weight and degree distributions of connected nodes, both numerically in real-world networks and mathematically in the generative graph models. We provide several methods of visualizing the results. We show that the generative models are assortativity-neutral, while many real-world networks are not. Therefore, we also propose an extension of the GIRG model which retains the manifold desirable properties induced by the degree distribution and the latent space, but also exhibits tunable assortativity. We analyze the resulting model mathematically, and give a fine-grained quantification of its assortativity.
20. Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration
Authors: Nuo Chen, Yicheng Tong, Jiaying Wu, Minh Duc Duong, Qian Wang, Qingyun Zou, Bryan Hooi, Bingsheng He β€’ Published: 2025-08-06 β€’ Source: arXiv
While AI agents show potential in scientific ideation, most existing frameworks rely on single-agent refinement, limiting creativity due to bounded knowledge and perspective. Inspired by real-world research dynamics, this paper investigates whether structured multi-agent discussions can surpass solitary ideation. We propose a cooperative multi-agent framework for generating research proposals and systematically compare configurations including group size, leaderled versus leaderless structures, and team compositions varying in interdisciplinarity and seniority. To assess idea quality, we employ a comprehensive protocol with agent-based scoring and human review across dimensions such as novelty, strategic vision, and integration depth. Our results show that multi-agent discussions substantially outperform solitary baselines. A designated leader acts as a catalyst, transforming discussion into more integrated and visionary proposals. Notably, we find that cognitive diversity is a primary driver of quality, yet expertise is a non-negotiable prerequisite, as teams lacking a foundation of senior knowledge fail to surpass even a single competent agent. These findings offer actionable insights for designing collaborative AI ideation systems and shed light on how team structure influences creative outcomes.
21. Visual Bias and Interpretability in Deep Learning for Dermatological Image Analysis
Authors: Enam Ahmed Taufik, Abdullah Khondoker, Antara Firoz Parsa, Seraj Al Mahmud Mostafa β€’ Published: 2025-08-06 β€’ Source: arXiv
Accurate skin disease classification is a critical yet challenging task due to high inter-class similarity, intra-class variability, and complex lesion textures. While deep learning-based computer-aided diagnosis (CAD) systems have shown promise in automating dermatological assessments, their performance is highly dependent on image pre-processing and model architecture. This study proposes a deep learning framework for multi-class skin disease classification, systematically evaluating three image pre-processing techniques: standard RGB, CMY color space transformation, and Contrast Limited Adaptive Histogram Equalization (CLAHE). We benchmark the performance of pre-trained convolutional neural networks (DenseNet201, Efficient-NetB5) and transformer-based models (ViT, Swin Transformer, DinoV2 Large) using accuracy and F1-score as evaluation metrics. Results show that DinoV2 with RGB pre-processing achieves the highest accuracy (up to 93%) and F1-scores across all variants. Grad-CAM visualizations applied to RGB inputs further reveal precise lesion localization, enhancing interpretability. These findings underscore the importance of effective pre-processing and model choice in building robust and explainable CAD systems for dermatology.
22. Policy Design in Zero-Trust Distributed Networks: Challenges and Solutions
Authors: Fannya R. Sandjaja, Ayesha A. Majeed, Abdullah Abdullah, Gyan Wickremasinghe, Karen Rafferty, Vishal Sharma β€’ Published: 2025-08-06 β€’ Source: arXiv
Traditional security architectures are becoming more vulnerable to distributed attacks due to significant dependence on trust. This will further escalate when implementing agentic AI within the systems, as more components must be secured over a similar distributed space. These scenarios can be observed in consumer technologies, such as the dense Internet of things (IoT). Here, zero-trust architecture (ZTA) can be seen as a potential solution, which relies on a key principle of not giving users explicit trust, instead always verifying their privileges whenever a request is made. However, the overall security in ZTA is managed through its policies, and unverified policies can lead to unauthorized access. Thus, this paper explores challenges and solutions for ZTA policy design in the context of distributed networks, which is referred to as zero-trust distributed networks (ZTDN). This is followed by a case-study on formal verification of policies using UPPAAL. Subsequently, the importance of accountability and responsibility in the system's security is discussed.
23. RAIDX: A Retrieval-Augmented Generation and GRPO Reinforcement Learning Framework for Explainable Deepfake Detection
Authors: Tianxiao Li, Zhenglin Huang, Haiquan Wen, Yiwei He, Shuchang Lyu, Baoyuan Wu, Guangliang Cheng β€’ Published: 2025-08-06 β€’ Source: arXiv
The rapid advancement of AI-generation models has enabled the creation of hyperrealistic imagery, posing ethical risks through widespread misinformation. Current deepfake detection methods, categorized as face specific detectors or general AI-generated detectors, lack transparency by framing detection as a classification task without explaining decisions. While several LLM-based approaches offer explainability, they suffer from coarse-grained analyses and dependency on labor-intensive annotations. This paper introduces RAIDX (Retrieval-Augmented Image Deepfake Detection and Explainability), a novel deepfake detection framework integrating Retrieval-Augmented Generation (RAG) and Group Relative Policy Optimization (GRPO) to enhance detection accuracy and decision explainability. Specifically, RAIDX leverages RAG to incorporate external knowledge for improved detection accuracy and employs GRPO to autonomously generate fine-grained textual explanations and saliency maps, eliminating the need for extensive manual annotations. Experiments on multiple benchmarks demonstrate RAIDX's effectiveness in identifying real or fake, and providing interpretable rationales in both textual descriptions and saliency maps, achieving state-of-the-art detection performance while advancing transparency in deepfake identification. RAIDX represents the first unified framework to synergize RAG and GRPO, addressing critical gaps in accuracy and explainability. Our code and models will be publicly available.
24. Argumentative Debates for Transparent Bias Detection [Technical Report]
Authors: Hamed Ayoobi, Nico Potyka, Anna Rapberger, Francesca Toni β€’ Published: 2025-08-06 β€’ Source: arXiv
As the use of AI systems in society grows, addressing potential biases that emerge from data or are learned by models is essential to prevent systematic disadvantages against specific groups. Several notions of (un)fairness have been proposed in the literature, alongside corresponding algorithmic methods for detecting and mitigating unfairness, but, with very few exceptions, these tend to ignore transparency. Instead, interpretability and explainability are core requirements for algorithmic fairness, even more so than for other algorithmic solutions, given the human-oriented nature of fairness. In this paper, we contribute a novel interpretable, explainable method for bias detection relying on debates about the presence of bias against individuals, based on the values of protected features for the individuals and others in their neighbourhoods. Our method builds upon techniques from formal and computational argumentation, whereby debates result from arguing about biases within and across neighbourhoods. We provide formal, quantitative, and qualitative evaluations of our method, highlighting its strengths in performance against baselines, as well as its interpretability and explainability.
25. OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use
Authors: Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, Fei Wu β€’ Published: 2025-08-06 β€’ Source: arXiv
The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain.
26. TRAIL: Joint Inference and Refinement of Knowledge Graphs with Large Language Models
Authors: Xinkui Zhao, Haode Li, Yifan Zhang, Guanjie Cheng, Yueshen Xu β€’ Published: 2025-08-06 β€’ Source: arXiv
Recent advances in large language models (LLMs) have unlocked powerful reasoning and decision-making capabilities. However, their inherent dependence on static parametric memory fundamentally limits their adaptability, factual accuracy, and interpretability in knowledge-intensive scenarios. Knowledge graphs (KGs), as structured repositories of explicit relational knowledge, offer a promising approach for augmenting LLMs with external, interpretable memory. Nevertheless, most existing methods that combine LLMs with KGs treat reasoning and knowledge updating as separate processes, resulting in suboptimal utilization of new information and hindering real-time updates. In this work, we propose TRAIL: a novel, unified framework for Thinking, Reasoning, And Incremental Learning that couples joint inference and dynamic KG refinement with large language models. TRAIL enables LLM agents to iteratively explore, update, and refine knowledge graphs during the reasoning process, employing a confidence-driven mechanism for the generation, validation, and pruning of new facts. This plug-and-play architecture facilitates seamless integration with various LLMs, supporting continual adaptation without the need for retraining. Extensive experiments on multiple benchmarks demonstrate that TRAIL outperforms existing KG-augmented and retrieval-augmented LLM baselines by 3% to 13%. More importantly, these results represent a significant step toward developing adaptive, memory-augmented language models capable of continual learning and reliable, transparent reasoning.
27. Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion
Authors: Qingguo Hu, Ante Wang, Jia Song, Delai Qiu, Qingsong Liu, Jinsong Su β€’ Published: 2025-08-06 β€’ Source: arXiv
Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A potential cause is the scarcity of visual knowledge in popular instruction-tuning corpora, resulting in inadequate visual perception and reasoning capabilities. To address this challenge, we introduce a self-improvement framework grounded in a novel visual knowledge-intensive task, \underline{C}ausality-driven \underline{V}isual object \underline{C}ompletion (CVC). This task requires LVLMs to infer the masked object in an image based on its \textit{causal} relationships with the other visible information. We first obtain rich examples cheaply through our automated instance construction pipeline, without relying on sophisticated LVLMs (\textit{e.g.}, GPT-4V) or human assistance. Then, LVLMs effectively self-improve through trial and error learning using these created instances. Our experiments demonstrate substantial gains across four challenging specialized tasks and four widely-used comprehensive benchmarks. Especially on specialized tasks, our method achieves an average improvement of 5.4\% and 4.0\% compared to the corresponding baselines when utilizing LLaVA-1.5-7B and LLaVA-1.5-13B, respectively. The code is available at https://github.com/XMUDeepLIT/CVC.
28. Large Language Models Versus Static Code Analysis Tools: A Systematic Benchmark for Vulnerability Detection
Authors: Damian Gnieciak, Tomasz Szandala β€’ Published: 2025-08-06 β€’ Source: arXiv
Modern software relies on a multitude of automated testing and quality assurance tools to prevent errors, bugs and potential vulnerabilities. This study sets out to provide a head-to-head, quantitative and qualitative evaluation of six automated approaches: three industry-standard rule-based static code-analysis tools (SonarQube, CodeQL and Snyk Code) and three state-of-the-art large language models hosted on the GitHub Models platform (GPT-4.1, Mistral Large and DeepSeek V3). Using a curated suite of ten real-world C# projects that embed 63 vulnerabilities across common categories such as SQL injection, hard-coded secrets and outdated dependencies, we measure classical detection accuracy (precision, recall, F-score), analysis latency, and the developer effort required to vet true positives. The language-based scanners achieve higher mean F-1 scores,0.797, 0.753 and 0.750, than their static counterparts, which score 0.260, 0.386 and 0.546, respectively. LLMs' advantage originates from superior recall, confirming an ability to reason across broader code contexts. However, this benefit comes with substantial trade-offs: DeepSeek V3 exhibits the highest false-positive ratio, all language models mislocate issues at line-or-column granularity due to tokenisation artefacts. Overall, language models successfully rival traditional static analysers in finding real vulnerabilities. Still, their noisier output and imprecise localisation limit their standalone use in safety-critical audits. We therefore recommend a hybrid pipeline: employ language models early in development for broad, context-aware triage, while reserving deterministic rule-based scanners for high-assurance verification. The open benchmark and JSON-based result harness released with this paper lay a foundation for reproducible, practitioner-centric research into next-generation automated code security.
29. Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models
Authors: Md Raisul Kibria, SΓ©bastien Lafond, Janan Arslan β€’ Published: 2025-08-06 β€’ Source: arXiv
Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that the majority of studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. Based on these findings, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible mulitmodal AI systems, with explainability at their core.
30. Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation
Authors: Jinxing Zhou, Yanghao Zhou, Mingfei Han, Tong Wang, Xiaojun Chang, Hisham Cholakkal, Rao Muhammad Anwer β€’ Published: 2025-08-06 β€’ Source: arXiv
Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R\textsuperscript{2}-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R\textsuperscript{2}-AVSBench. Code will be available at https://github.com/jasongief/TGS-Agent.
31. Deep Learning-based Scalable Image-to-3D Facade Parser for Generating Thermal 3D Building Models
Authors: Yinan Yu, Alex Gonzalez-Caceres, Samuel Scheidegger, Sanjay Somanath, Alexander Hollberg β€’ Published: 2025-08-06 β€’ Source: arXiv
Renovating existing buildings is essential for climate impact. Early-phase renovation planning requires simulations based on thermal 3D models at Level of Detail (LoD) 3, which include features like windows. However, scalable and accurate identification of such features remains a challenge. This paper presents the Scalable Image-to-3D Facade Parser (SI3FP), a pipeline that generates LoD3 thermal models by extracting geometries from images using both computer vision and deep learning. Unlike existing methods relying on segmentation and projection, SI3FP directly models geometric primitives in the orthographic image plane, providing a unified interface while reducing perspective distortions. SI3FP supports both sparse (e.g., Google Street View) and dense (e.g., hand-held camera) data sources. Tested on typical Swedish residential buildings, SI3FP achieved approximately 5% error in window-to-wall ratio estimates, demonstrating sufficient accuracy for early-stage renovation analysis. The pipeline facilitates large-scale energy renovation planning and has broader applications in urban development and planning.
32. Continual Multiple Instance Learning for Hematologic Disease Diagnosis
Authors: Zahra Ebrahimi, Raheleh Salehi, Nassir Navab, Carsten Marr, Ario Sadafi β€’ Published: 2025-08-06 β€’ Source: arXiv
The dynamic environment of laboratories and clinics, with streams of data arriving on a daily basis, requires regular updates of trained machine learning models for consistent performance. Continual learning is supposed to help train models without catastrophic forgetting. However, state-of-the-art methods are ineffective for multiple instance learning (MIL), which is often used in single-cell-based hematologic disease diagnosis (e.g., leukemia detection). Here, we propose the first continual learning method tailored specifically to MIL. Our method is rehearsal-based over a selection of single instances from various bags. We use a combination of the instance attention score and distance from the bag mean and class mean vectors to carefully select which samples and instances to store in exemplary sets from previous tasks, preserving the diversity of the data. Using the real-world input of one month of data from a leukemia laboratory, we study the effectiveness of our approach in a class incremental scenario, comparing it to well-known continual learning methods. We show that our method considerably outperforms state-of-the-art methods, providing the first continual learning approach for MIL. This enables the adaptation of models to shifting data distributions over time, such as those caused by changes in disease occurrence or underlying genetic alterations.
33. Deliberative Reasoning Network: An Uncertainty-Driven Paradigm for Belief-Tracked Inference with Pretrained Language Models
Authors: Anran Xu, Jincheng Wang, Baigen Cai, Tao Wen β€’ Published: 2025-08-06 β€’ Source: arXiv
Large language models often fail at logical reasoning when semantic heuristics conflict with decisive evidence - a phenomenon we term cognitive traps. To address this fundamental limitation, we introduce the Deliberative Reasoning Network (DRN), a novel paradigm that reframes logical reasoning from probability maximization to uncertainty minimization. Instead of asking "Which answer is most likely?", DRN asks "Which hypothesis has the most internally consistent evidence?". DRN achieves intrinsic interpretability by explicitly tracking belief states and quantifying epistemic uncertainty for competing hypotheses through an iterative evidence synthesis process. We validate our approach through two complementary architectures - a bespoke discriminative model that embodies the core uncertainty minimization principle, and a lightweight verification module that enhances existing generative LLMs. Evaluated on LCR-1000, our new adversarial reasoning benchmark designed to expose cognitive traps, the bespoke DRN achieves up to 15.2% improvement over standard baselines. When integrated as a parameter-efficient verifier with Mistral-7B, our hybrid system boosts accuracy from 20% to 80% on the most challenging problems. Critically, DRN demonstrates strong zero-shot generalization, improving TruthfulQA performance by 23.6% without additional training, indicating that uncertainty-driven deliberation learns transferable reasoning principles. We position DRN as a foundational, verifiable System 2 reasoning component for building more trustworthy AI systems.
34. Modelling and Classifying the Components of a Literature Review
Authors: Francisco BolaΓ±os, Angelo Salatino, Francesco Osborne, Enrico Motta β€’ Published: 2025-08-06 β€’ Source: arXiv
Previous work has demonstrated that AI methods for analysing scientific literature benefit significantly from annotating sentences in papers according to their rhetorical roles, such as research gaps, results, limitations, extensions of existing methodologies, and others. Such representations also have the potential to support the development of a new generation of systems capable of producing high-quality literature reviews. However, achieving this goal requires the definition of a relevant annotation schema and effective strategies for large-scale annotation of the literature. This paper addresses these challenges by 1) introducing a novel annotation schema specifically designed to support literature review generation and 2) conducting a comprehensive evaluation of a wide range of state-of-the-art large language models (LLMs) in classifying rhetorical roles according to this schema. To this end, we also present Sci-Sentence, a novel multidisciplinary benchmark comprising 700 sentences manually annotated by domain experts and 2,240 sentences automatically labelled using LLMs. We evaluate 37 LLMs on this benchmark, spanning diverse model families and sizes, using both zero-shot learning and fine-tuning approaches. The experiments yield several novel insights that advance the state of the art in this challenging domain. First, the current generation of LLMs performs remarkably well on this task when fine-tuned on high-quality data, achieving performance levels above 96\% F1. Second, while large proprietary models like GPT-4o achieve the best results, some lightweight open-source alternatives also demonstrate excellent performance. Finally, enriching the training data with semi-synthetic examples generated by LLMs proves beneficial, enabling small encoders to achieve robust results and significantly enhancing the performance of several open decoder models.
35. Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
Authors: Zizhan Ma, Wenxuan Wang, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Wenting Chen, Linlin Shen β€’ Published: 2025-08-06 β€’ Source: arXiv
Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs a benchmark's development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck serves as both a diagnostic tool for existing benchmarks and an actionable guideline to foster a more standardized, reliable, and transparent approach to evaluating AI in healthcare.