1. Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
Authors: Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, Guanghui Ren β’
Published: 2025-08-07 β’
Source: arXiv
We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.
2. Partial projected ensembles and spatiotemporal structure of information scrambling
Authors: Saptarshi Mandal, Pieter W. Claeys, Sthitadhi Roy β’
Published: 2025-08-07 β’
Source: arXiv
Thermalisation and information scrambling in out-of-equilibrium quantum many-body systems are deeply intertwined: local subsystems dynamically approach thermal density matrices while their entropies track information spreading. Projected ensembles--ensembles of pure states conditioned on measurement outcomes of complementary subsystems--provide higher-order probes of thermalisation, converging at late times to universal maximum-entropy ensembles. In this work, we introduce the partial projected ensemble (PPE) as a framework to study how the spatiotemporal structure of scrambling is imprinted on projected ensembles. The PPE consists of an ensemble of mixed states induced on a subsystem by measurements on a spatially separated part of its complement, tracing out the remainder, naturally capturing scenarios involving discarded outcomes or noise-induced losses. We show that statistical fluctuations of the PPE faithfully track the causal lightcone of information spreading, revealing how scrambling dynamics are encoded in ensemble structure. In addition, we demonstrate that the probabilities of bit-string probabilities (PoPs) associated with the PPE exhibit distinct dynamical regimes and provide an experimentally accessible probe of scrambling. Both PPE fluctuations and PoPs display exponential sensitivity to the size of the discarded region, reflecting exponential degradation of quantum correlations under erasure. We substantiate these findings using the non-integrable kicked Ising chain, combining numerics in the ergodic regime with exact results at its self-dual point. We extend our analysis to a many-body localised (MBL) regime numerically, along with analytic results for the $\ell$-bit model. The linear and logarithmic lightcones characteristic of ergodic and MBL regimes emerge naturally from PPE dynamics, establishing it as a powerful tool for probing scrambling and deep thermalisation.
3. GAP: Gaussianize Any Point Clouds with Text Guidance
Authors: Weiqi Zhang, Junsheng Zhou, Haotian Geng, Wenyuan Zhang, Yu-Shen Liu β’
Published: 2025-08-07 β’
Source: arXiv
3D Gaussian Splatting (3DGS) has demonstrated its advantages in achieving fast and high-quality rendering. As point clouds serve as a widely-used and easily accessible form of 3D representation, bridging the gap between point clouds and Gaussians becomes increasingly important. Recent studies have explored how to convert the colored points into Gaussians, but directly generating Gaussians from colorless 3D point clouds remains an unsolved challenge. In this paper, we propose GAP, a novel approach that gaussianizes raw point clouds into high-fidelity 3D Gaussians with text guidance. Our key idea is to design a multi-view optimization framework that leverages a depth-aware image diffusion model to synthesize consistent appearances across different viewpoints. To ensure geometric accuracy, we introduce a surface-anchoring mechanism that effectively constrains Gaussians to lie on the surfaces of 3D shapes during optimization. Furthermore, GAP incorporates a diffuse-based inpainting strategy that specifically targets at completing hard-to-observe regions. We evaluate GAP on the Point-to-Gaussian generation task across varying complexity levels, from synthetic point clouds to challenging real-world scans, and even large-scale scenes. Project Page: https://weiqi-zhang.github.io/GAP.
4. On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
Authors: Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, Xu Yang β’
Published: 2025-08-07 β’
Source: arXiv
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.
5. Learning to Reason for Factuality
Authors: Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas OΔuz, Rulin Shao, Gargi Ghosh, Jason Weston, Wen-tau Yih β’
Published: 2025-08-07 β’
Source: arXiv
Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.
6. Test-Time Reinforcement Learning for GUI Grounding via Region Consistency
Authors: Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, Yongliang Shen β’
Published: 2025-08-07 β’
Source: arXiv
Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.
7. Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity
Authors: Yuhan Zhang, Long Zhuo, Ziyang Chu, Tong Wu, Zhibing Li, Liang Pan, Dahua Lin, Ziwei Liu β’
Published: 2025-08-07 β’
Source: arXiv
Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging. Existing methods mainly rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial coherence, material authenticity, and high-fidelity local details. 1) To address these challenges, we introduce Hi3DEval, a hierarchical evaluation framework tailored for 3D generative content. It combines both object-level and part-level evaluation, enabling holistic assessments across multiple dimensions as well as fine-grained quality analysis. Additionally, we extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism, focusing on attributes such as albedo, saturation, and metallicness. 2) To support this framework, we construct Hi3DBench, a large-scale dataset comprising diverse 3D assets and high-quality annotations, accompanied by a reliable multi-agent annotation pipeline. We further propose a 3D-aware automated scoring system based on hybrid 3D representations. Specifically, we leverage video-based representations for object-level and material-subject evaluations to enhance modeling of spatio-temporal consistency and employ pretrained 3D features for part-level perception. Extensive experiments demonstrate that our approach outperforms existing image-based metrics in modeling 3D characteristics and achieves superior alignment with human preference, providing a scalable alternative to manual evaluations. The project page is available at https://zyh482.github.io/Hi3DEval/.
8. The Mpemba Effect in Pure Water Has a Stochastic Origin. Experimental and Theoretical Resolution of the Paradox
Authors: Andrei A. Klimov, Alexei V. Finkelstein β’
Published: 2025-08-07 β’
Source: arXiv
The "Mpemba effect" is the name given to the assertion that hot water freezes quicker than cold water1 or, in a modern and more general form, that the system that is initially more distant from its equilibrium state comes to this state earlier2. This counterintuitive statement seems to breach fundamental thermodynamic and kinetic laws; however, numerous experiments3-10 with classical and quantum systems demonstrate this paradoxical Mpemba effect, leading to extensive discssions in prominent scientific jornals2,5,9,12-14. However, the fundamental physical mechanisms behind this effect have remained elusive14. Here we performed the water freezing experiments under carefully controlled conditions, and found that the Mpemba effect only occurred when the freezer temperature was very close to the temperature of ice nucleation. In this case, the range of freezing times for both hot and cold water was so great that it exceeded the delayed cooling of the initially hotter liquid, and therefore sometimes the hot water froze before the cold water. Our theoretical analysis of this fact shows that the Mpemba paradox associated with water freezing is rooted in the stochastic nature of ice nucleation, typical of first-order phase transitions. We anticipate our assay to be a starting point for reconsidering the famous Mpemba paradox in water and other systems undergoing similar phase transitions.
9. LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model
Authors: Tao Sun, Oliver Liu, JinJin Li, Lan Ma β’
Published: 2025-08-07 β’
Source: arXiv
Multimodal generative AI usually involves generating image or text responses given inputs in another modality. The evaluation of image-text relevancy is essential for measuring response quality or ranking candidate responses. In particular, binary relevancy evaluation, i.e., ``Relevant'' vs. ``Not Relevant'', is a fundamental problem. However, this is a challenging task considering that texts have diverse formats and the definition of relevancy varies in different scenarios. We find that Multimodal Large Language Models (MLLMs) are an ideal choice to build such evaluators, as they can flexibly handle complex text formats and take in additional task information. In this paper, we present LLaVA-RE, a first attempt for binary image-text relevancy evaluation with MLLM. It follows the LLaVA architecture and adopts detailed task instructions and multimodal in-context samples. In addition, we propose a novel binary relevancy data set that covers various tasks. Experimental results validate the effectiveness of our framework.
10. Asymptotically-tight packing and covering with transversal bases in Rota's basis conjecture
Authors: Richard Montgomery, Lisa Sauermann β’
Published: 2025-08-07 β’
Source: arXiv
In 1989, Rota conjectured that, given any $n$ bases $B_1,\dots,B_n$ of a vector space of dimension $n$, or more generally a matroid of rank $n$, it is possible to rearrange these into $n$ disjoint transversal bases. Here, a transversal basis is a basis consisting of exactly one element from each of the original bases $B_1,\dots,B_n$. Two natural approaches to this conjecture are, to ask in this setting a) how many disjoint transversal bases can we find and b) how few transversal bases do we need to cover all the elements of $B_1,\dots,B_n$? In this paper, we give asymptotically-tight answers to both of these questions. For a), we show that there are always $(1-o(1))n$ disjoint transversal bases, improving a result of Buci\'c, Kwan, Pokrovskiy, and Sudakov that $(1/2-o(1))n$ disjoint transversal bases always exist. For b), we show that $B_1\cup\dots \cup B_n$ can be covered by $(1+o(1))n$ transversal bases, improving a result of Aharoni and Berger using instead $2n$ transversal bases, and a subsequent result of the Polymath project on Rota's basis conjecture using $2n-2$ transversal bases.
11. Unveiling the Lithium-Ion Transport Mechanism in Li2ZrCl6 Solid-State Electrolyte via Deep Learning-Accelerated Molecular Dynamics Simulations
Authors: Hanzeng Guo, Volodymyr Koverga, Selva Chandrasekaran Selvaraj, Anh T. Ngo β’
Published: 2025-08-07 β’
Source: arXiv
Lithium zirconium chlorides (LZCs) present a promising class of cost-effective solid electrolyte for next-generation all-solid-state batteries. The unique crystal structure of LZCs plays a crucial role in facilitating lithium-ion mobility, which is central to their electrochemical performance. To understand the underlying mechanism governing ion transport, we employed deep learning-accelerated molecular dynamics simulation on Li2ZrCl6 (trigonal {\alpha}- and monoclinic \b{eta}-LZC), focusing specifically on the zirconium coordination environment. Our results reveal that disordered {\alpha}-LZC exhibits the highest ionic conductivity, while \b{eta}-LZC demonstrates significantly lower conductivity, closely aligning with experimental findings. Detailed analysis shows substantial differences in lithium-ion dynamics: {\alpha}-LZC phases display pronounced collective diffusion driven anisotropic interlayer transport, whereas lithium mobility in \b{eta}-LZC is largely determined by isotropic translations and individual diffusion dominated by intralayer migration. Across all phases, lithium migration proceeds via site-to-site hopping mechanism, where variations in site residence times critically impact the overall ionic conductivity. Local structure organizations analysis confirms that particular zirconium arrangements in LZC phases create varied ion channel energy barriers, influencing dynamic behaviors: In {\alpha}-LZC phases, the interlayer hopping barrier is lower than the intralayer barrier, facilitating faster ion transport. Disordered {\alpha}-LZC, with its loose zirconium arrangement, presents the lowest energy barrier, enhancing conductivity. Conversely, \b{eta}-LZC features a higher overall barrier, with intralayer hopping favored over interlayer, resulting in slower ion migration.
12. Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis
Authors: Kunyu Feng, Yue Ma, Xinhua Zhang, Boshi Liu, Yikuang Yuluo, Yinhan Zhang, Runtao Liu, Hongyu Liu, Zhiyuan Qin, Shanhui Mo, Qifeng Chen, Zeyu Wang β’
Published: 2025-08-07 β’
Source: arXiv
With the growing demands of AI-generated content (AIGC), the need for high-quality, diverse, and scalable data has become increasingly crucial. However, collecting large-scale real-world data remains costly and time-consuming, hindering the development of downstream applications. While some works attempt to collect task-specific data via a rendering process, most approaches still rely on manual scene construction, limiting their scalability and accuracy. To address these challenges, we propose Follow-Your-Instruction, a Multimodal Large Language Model (MLLM)-driven framework for automatically synthesizing high-quality 2D, 3D, and 4D data. Our \textbf{Follow-Your-Instruction} first collects assets and their associated descriptions through multimodal inputs using the MLLM-Collector. Then it constructs 3D layouts, and leverages Vision-Language Models (VLMs) for semantic refinement through multi-view scenes with the MLLM-Generator and MLLM-Optimizer, respectively. Finally, it uses MLLM-Planner to generate temporally coherent future frames. We evaluate the quality of the generated data through comprehensive experiments on the 2D, 3D, and 4D generative tasks. The results show that our synthetic data significantly boosts the performance of existing baseline models, demonstrating Follow-Your-Instruction's potential as a scalable and effective data engine for generative intelligence.
13. Discrepancy-Aware Contrastive Adaptation in Medical Time Series Analysis
Authors: Yifan Wang, Hongfeng Ai, Ruiqi Li, Maowei Jiang, Ruiyuan Kang, Jiahua Dong, Cheng Jiang, Chenzhong Li β’
Published: 2025-08-07 β’
Source: arXiv
In medical time series disease diagnosis, two key challenges are identified. First, the high annotation cost of medical data leads to overfitting in models trained on label-limited, single-center datasets. To address this, we propose incorporating external data from related tasks and leveraging AE-GAN to extract prior knowledge, providing valuable references for downstream tasks. Second, many existing studies employ contrastive learning to derive more generalized medical sequence representations for diagnostic tasks, usually relying on manually designed diverse positive and negative sample pairs. However, these approaches are complex, lack generalizability, and fail to adaptively capture disease-specific features across different conditions. To overcome this, we introduce LMCF (Learnable Multi-views Contrastive Framework), a framework that integrates a multi-head attention mechanism and adaptively learns representations from different views through inter-view and intra-view contrastive learning strategies. Additionally, the pre-trained AE-GAN is used to reconstruct discrepancies in the target data as disease probabilities, which are then integrated into the contrastive learning process. Experiments on three target datasets demonstrate that our method consistently outperforms other seven baselines, highlighting its significant impact on healthcare applications such as the diagnosis of myocardial infarction, Alzheimer's disease, and Parkinson's disease. We release the source code at xxxxx.
14. MV-Debate: Multi-view Agent Debate with Dynamic Reflection Gating for Multimodal Harmful Content Detection in Social Media
Authors: Rui Lu, Jinhe Bi, Yunpu Ma, Feng Xiao, Yuntao Du, Yijun Tian β’
Published: 2025-08-07 β’
Source: arXiv
Social media has evolved into a complex multimodal environment where text, images, and other signals interact to shape nuanced meanings, often concealing harmful intent. Identifying such intent, whether sarcasm, hate speech, or misinformation, remains challenging due to cross-modal contradictions, rapid cultural shifts, and subtle pragmatic cues. To address these challenges, we propose MV-Debate, a multi-view agent debate framework with dynamic reflection gating for unified multimodal harmful content detection. MV-Debate assembles four complementary debate agents, a surface analyst, a deep reasoner, a modality contrast, and a social contextualist, to analyze content from diverse interpretive perspectives. Through iterative debate and reflection, the agents refine responses under a reflection-gain criterion, ensuring both accuracy and efficiency. Experiments on three benchmark datasets demonstrate that MV-Debate significantly outperforms strong single-model and existing multi-agent debate baselines. This work highlights the promise of multi-agent debate in advancing reliable social intent detection in safety-critical online contexts.
15. Adapting Vision-Language Models Without Labels: A Comprehensive Survey
Authors: Hao Dong, Lijun Sheng, Jian Liang, Ran He, Eleni Chatzi, Olga Fink β’
Published: 2025-08-07 β’
Source: arXiv
Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks. However, their performance often remains suboptimal when directly applied to specific downstream scenarios without task-specific adaptation. To enhance their utility while preserving data efficiency, recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data. Despite the growing interest in this area, there remains a lack of a unified, task-oriented survey dedicated to unsupervised VLM adaptation. To bridge this gap, we present a comprehensive and structured overview of the field. We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms: Data-Free Transfer (no data), Unsupervised Domain Transfer (abundant data), Episodic Test-Time Adaptation (batch data), and Online Test-Time Adaptation (streaming data). Within this framework, we analyze core methodologies and adaptation strategies associated with each paradigm, aiming to establish a systematic understanding of the field. Additionally, we review representative benchmarks across diverse applications and highlight open challenges and promising directions for future research. An actively maintained repository of relevant literature is available at https://github.com/tim-learn/Awesome-LabelFree-VLMs.
16. PRvL: Quantifying the Capabilities and Risks of Large Language Models for PII Redaction
Authors: Leon Garza, Anantaa Kotal, Aritran Piplai, Lavanya Elluri, Prajit Das, Aman Chadha β’
Published: 2025-08-07 β’
Source: arXiv
Redacting Personally Identifiable Information (PII) from unstructured text is critical for ensuring data privacy in regulated domains. While earlier approaches have relied on rule-based systems and domain-specific Named Entity Recognition (NER) models, these methods fail to generalize across formats and contexts. Recent advances in Large Language Models (LLMs) offer a promising alternative, yet the effect of architectural and training choices on redaction performance remains underexplored. LLMs have demonstrated strong performance in tasks that require contextual language understanding, including the redaction of PII in free-form text. Prior work suggests that with appropriate adaptation, LLMs can become effective contextual privacy learners. However, the consequences of architectural and training choices for PII Redaction remain underexplored. In this work, we present a comprehensive analysis of LLMs as privacy-preserving PII Redaction systems. We evaluate a range of LLM architectures and training strategies for their effectiveness in PII Redaction. Our analysis measures redaction performance, semantic preservation, and PII leakage, and compares these outcomes against latency and computational cost. The results provide practical guidance for configuring LLM-based redactors that are accurate, efficient, and privacy-aware. To support reproducibility and real-world deployment, we release PRvL, an open-source suite of fine-tuned models, and evaluation tools for general-purpose PII Redaction. PRvL is built entirely on open-source LLMs and supports multiple inference settings for flexibility and compliance. It is designed to be easily customized for different domains and fully operable within secure, self-managed environments. This enables data owners to perform redactions without relying on third-party services or exposing sensitive content beyond their own infrastructure.
17. Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation
Authors: Albert Yu, Chengshu Li, Luca Macesanu, Arnav Balaji, Ruchira Ray, Raymond Mooney, Roberto MartΓn-MartΓn β’
Published: 2025-08-07 β’
Source: arXiv
Effective robotic systems for long-horizon human-robot collaboration must adapt to a wide range of human partners, whose physical behavior, willingness to assist, and understanding of the robot's capabilities may change over time. This demands a tightly coupled communication loop that grants both agents the flexibility to propose, accept, or decline requests as they coordinate toward completing the task effectively. We apply a Mixed-Initiative dialog paradigm to Collaborative human-roBot teaming and propose MICoBot, a system that handles the common scenario where both agents, using natural language, take initiative in formulating, accepting, or rejecting proposals on who can best complete different steps of a task. To handle diverse, task-directed dialog, and find successful collaborative strategies that minimize human effort, MICoBot makes decisions at three levels: (1) a meta-planner considers human dialog to formulate and code a high-level collaboration strategy, (2) a planner optimally allocates the remaining steps to either agent based on the robot's capabilities (measured by a simulation-pretrained affordance model) and the human's estimated availability to help, and (3) an action executor decides the low-level actions to perform or words to say to the human. Our extensive evaluations in simulation and real-world -- on a physical robot with 18 unique human participants over 27 hours -- demonstrate the ability of our method to effectively collaborate with diverse human users, yielding significantly improved task success and user experience than a pure LLM baseline and other agent allocation models. See additional videos and materials at https://robin-lab.cs.utexas.edu/MicoBot/.
18. When Deepfake Detection Meets Graph Neural Network:a Unified and Lightweight Learning Framework
Authors: Haoyu Liu, Chaoyu Gong, Mengke He, Jiate Li, Kai Han, Siqiang Luo β’
Published: 2025-08-07 β’
Source: arXiv
The proliferation of generative video models has made detecting AI-generated and manipulated videos an urgent challenge. Existing detection approaches often fail to generalize across diverse manipulation types due to their reliance on isolated spatial, temporal, or spectral information, and typically require large models to perform well. This paper introduces SSTGNN, a lightweight Spatial-Spectral-Temporal Graph Neural Network framework that represents videos as structured graphs, enabling joint reasoning over spatial inconsistencies, temporal artifacts, and spectral distortions. SSTGNN incorporates learnable spectral filters and temporal differential modeling into a graph-based architecture, capturing subtle manipulation traces more effectively. Extensive experiments on diverse benchmark datasets demonstrate that SSTGNN not only achieves superior performance in both in-domain and cross-domain settings, but also offers strong robustness against unseen manipulations. Remarkably, SSTGNN accomplishes these results with up to 42.4$\times$ fewer parameters than state-of-the-art models, making it highly lightweight and scalable for real-world deployment.
19. Leveraging AI to Accelerate Clinical Data Cleaning: A Comparative Study of AI-Assisted vs. Traditional Methods
Authors: Matthew Purri, Amit Patel, Erik Deurrell β’
Published: 2025-08-07 β’
Source: arXiv
Clinical trial data cleaning represents a critical bottleneck in drug development, with manual review processes struggling to manage exponentially increasing data volumes and complexity. This paper presents Octozi, an artificial intelligence-assisted platform that combines large language models with domain-specific heuristics to transform clinical data review. In a controlled experimental study with experienced clinical reviewers (n=10), we demonstrate that AI assistance increased data cleaning throughput by 6.03-fold while simultaneously decreasing cleaning errors from 54.67% to 8.48% (a 6.44-fold improvement). Crucially, the system reduced false positive queries by 15.48-fold, minimizing unnecessary site burden. These improvements were consistent across reviewers regardless of experience level, suggesting broad applicability. Our findings indicate that AI-assisted approaches can address fundamental inefficiencies in clinical trial operations, potentially accelerating drug development timelines and reducing costs while maintaining regulatory compliance. This work establishes a framework for integrating AI into safety-critical clinical workflows and demonstrates the transformative potential of human-AI collaboration in pharmaceutical clinical trials.
20. Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation
Authors: Roshita Bhonsle, Rishav Dutta, Sneha Vavilapalli, Harsh Seth, Abubakarr Jaye, Yapei Chang, Mukund Rungta, Emmanuel Aboah Boateng, Sadid Hasan, Ehi Nosakhare, Soundar Srinivasan β’
Published: 2025-08-07 β’
Source: arXiv
The increasing adoption of foundation models as agents across diverse domains necessitates a robust evaluation framework. Current methods, such as LLM-as-a-Judge, focus only on final outputs, overlooking the step-by-step reasoning that drives agentic decision-making. Meanwhile, existing Agent-as-a-Judge systems, where one agent evaluates another's task completion, are typically designed for narrow, domain-specific settings. To address this gap, we propose a generalizable, modular framework for evaluating agent task completion independent of the task domain. The framework emulates human-like evaluation by decomposing tasks into sub-tasks and validating each step using available information, such as the agent's output and reasoning. Each module contributes to a specific aspect of the evaluation process, and their outputs are aggregated to produce a final verdict on task completion. We validate our framework by evaluating the Magentic-One Actor Agent on two benchmarks, GAIA and BigCodeBench. Our Judge Agent predicts task success with closer agreement to human evaluations, achieving 4.76% and 10.52% higher alignment accuracy, respectively, compared to the GPT-4o based LLM-as-a-Judge baseline. This demonstrates the potential of our proposed general-purpose evaluation framework.
21. AutoIAD: Manager-Driven Multi-Agent Collaboration for Automated Industrial Anomaly Detection
Authors: Dongwei Ji, Bingzhang Hu, Yi Zhou β’
Published: 2025-08-07 β’
Source: arXiv
Industrial anomaly detection (IAD) is critical for manufacturing quality control, but conventionally requires significant manual effort for various application scenarios. This paper introduces AutoIAD, a multi-agent collaboration framework, specifically designed for end-to-end automated development of industrial visual anomaly detection. AutoIAD leverages a Manager-Driven central agent to orchestrate specialized sub-agents (including Data Preparation, Data Loader, Model Designer, Trainer) and integrates a domain-specific knowledge base, which intelligently handles the entire pipeline using raw industrial image data to develop a trained anomaly detection model. We construct a comprehensive benchmark using MVTec AD datasets to evaluate AutoIAD across various LLM backends. Extensive experiments demonstrate that AutoIAD significantly outperforms existing general-purpose agentic collaboration frameworks and traditional AutoML frameworks in task completion rate and model performance (AUROC), while effectively mitigating issues like hallucination through iterative refinement. Ablation studies further confirm the crucial roles of the Manager central agent and the domain knowledge base module in producing robust and high-quality IAD solutions.
22. Towards Human-Centric Evaluation of Interaction-Aware Automated Vehicle Controllers: A Framework and Case Study
Authors: Federico ScarΓ¬, Olger Siebinga, Arkady Zgonnikov β’
Published: 2025-08-07 β’
Source: arXiv
As automated vehicles (AVs) increasingly integrate into mixed-traffic environments, evaluating their interaction with human-driven vehicles (HDVs) becomes critical. In most research focused on developing new AV control algorithms (controllers), the performance of these algorithms is assessed solely based on performance metrics such as collision avoidance or lane-keeping efficiency, while largely overlooking the human-centred dimensions of interaction with HDVs. This paper proposes a structured evaluation framework that addresses this gap by incorporating metrics grounded in the human-robot interaction literature. The framework spans four key domains: a) interaction effect, b) interaction perception, c) interaction effort, and d) interaction ability. These domains capture both the performance of the AV and its impact on human drivers around it. To demonstrate the utility of the framework, we apply it to a case study evaluating how a state-of-the-art AV controller interacts with human drivers in a merging scenario in a driving simulator. Measuring HDV-HDV interactions as a baseline, this study included one representative metric per domain: a) perceived safety, b) subjective ratings, specifically how participants perceived the other vehicle's driving behaviour (e.g., aggressiveness or predictability) , c) driver workload, and d) merging success. The results showed that incorporating metrics covering all four domains in the evaluation of AV controllers can illuminate critical differences in driver experience when interacting with AVs. This highlights the need for a more comprehensive evaluation approach. Our framework offers researchers, developers, and policymakers a systematic method for assessing AV behaviour beyond technical performance, fostering the development of AVs that are not only functionally capable but also understandable, acceptable, and safe from a human perspective.
23. Deconstructing the Crystal Ball: From Ad-Hoc Prediction to Principled Startup Evaluation with the SAISE Framework
Authors: Seyed Mohammad Ali Jafari, Ali Mobini Dehkordi, Ehsan Chitsaz, Yadollah Yaghoobzadeh β’
Published: 2025-08-07 β’
Source: arXiv
The integration of Artificial Intelligence (AI) into startup evaluation represents a significant technological shift, yet the academic research underpinning this transition remains methodologically fragmented. Existing studies often employ ad-hoc approaches, leading to a body of work with inconsistent definitions of success, atheoretical features, and a lack of rigorous validation. This fragmentation severely limits the comparability, reliability, and practical utility of current predictive models. To address this critical gap, this paper presents a comprehensive systematic literature review of 57 empirical studies. We deconstruct the current state-of-the-art by systematically mapping the features, algorithms, data sources, and evaluation practices that define the AI-driven startup prediction landscape. Our synthesis reveals a field defined by a central paradox: a strong convergence on a common toolkit -- venture databases and tree-based ensembles -- but a stark divergence in methodological rigor. We identify four foundational weaknesses: a fragmented definition of "success," a divide between theory-informed and data-driven feature engineering, a chasm between common and best-practice model validation, and a nascent approach to data ethics and explainability. In response to these findings, our primary contribution is the proposal of the Systematic AI-driven Startup Evaluation (SAISE) Framework. This novel, five-stage prescriptive roadmap is designed to guide researchers from ad-hoc prediction toward principled evaluation. By mandating a coherent, end-to-end methodology that emphasizes stage-aware problem definition, theory-informed data synthesis, principled feature engineering, rigorous validation, and risk-aware interpretation, the SAISE framework provides a new standard for conducting more comparable, robust, and practically relevant research in this rapidly maturing domain
24. Large Language Models Transform Organic Synthesis From Reaction Prediction to Automation
Authors: Kartar Kumar Lohana Tharwani, Rajesh Kumar, Sumita, Numan Ahmed, Yong Tang β’
Published: 2025-08-07 β’
Source: arXiv
Large language models (LLMs) are beginning to reshape how chemists plan and run reactions in organic synthesis. Trained on millions of reported transformations, these text-based models can propose synthetic routes, forecast reaction outcomes and even instruct robots that execute experiments without human supervision. Here we survey the milestones that turned LLMs from speculative tools into practical lab partners. We show how coupling LLMs with graph neural networks, quantum calculations and real-time spectroscopy shrinks discovery cycles and supports greener, data-driven chemistry. We discuss limitations, including biased datasets, opaque reasoning and the need for safety gates that prevent unintentional hazards. Finally, we outline community initiatives open benchmarks, federated learning and explainable interfaces that aim to democratize access while keeping humans firmly in control. These advances chart a path towards rapid, reliable and inclusive molecular innovation powered by artificial intelligence and automation.
25. Physical Adversarial Camouflage through Gradient Calibration and Regularization
Authors: Jiawei Liang, Siyuan Liang, Jianjie Huang, Chenxi Si, Ming Zhang, Xiaochun Cao β’
Published: 2025-08-07 β’
Source: arXiv
The advancement of deep object detectors has greatly affected safety-critical fields like autonomous driving. However, physical adversarial camouflage poses a significant security risk by altering object textures to deceive detectors. Existing techniques struggle with variable physical environments, facing two main challenges: 1) inconsistent sampling point densities across distances hinder the gradient optimization from ensuring local continuity, and 2) updating texture gradients from multiple angles causes conflicts, reducing optimization stability and attack effectiveness. To address these issues, we propose a novel adversarial camouflage framework based on gradient optimization. First, we introduce a gradient calibration strategy, which ensures consistent gradient updates across distances by propagating gradients from sparsely to unsampled texture points. Additionally, we develop a gradient decorrelation method, which prioritizes and orthogonalizes gradients based on loss values, enhancing stability and effectiveness in multi-angle optimization by eliminating redundant or conflicting updates. Extensive experimental results on various detection models, angles and distances show that our method significantly exceeds the state of the art, with an average increase in attack success rate (ASR) of 13.46% across distances and 11.03% across angles. Furthermore, empirical evaluation in real-world scenarios highlights the need for more robust system design.
26. Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms
Authors: Jie Xiao, Shaoduo Gan, Changyuan Fan, Qingnan Ren, Alfred Long, Yuchen Zhang, Rymon Yu, Eric Yang, Lynn Ai β’
Published: 2025-08-07 β’
Source: arXiv
Modern RL-based post-training for large language models (LLMs) co-locate trajectory sampling and policy optimisation on the same GPU cluster, forcing the system to switch between inference and training workloads. This serial context switching violates the single-program-multiple-data (SPMD) assumption underlying today's distributed training systems. We present Echo, the RL system that cleanly decouples these two phases across heterogeneous "inference" and "training" swarms while preserving statistical efficiency. Echo introduces two lightweight synchronization protocols: a sequential pull mode that refreshes sampler weights on every API call for minimal bias, and an asynchronous push-pull mode that streams version-tagged rollouts through a replay buffer to maximise hardware utilisation. Training three representative RL workloads with Qwen3-4B, Qwen2.5-7B and Qwen3-32B on a geographically distributed cluster, Echo matches a fully co-located Verl baseline in convergence speed and final reward while off-loading trajectory generation to commodity edge hardware. These promising results demonstrate that large-scale RL for LLMs could achieve datacentre-grade performance using decentralised, heterogeneous resources.
27. CT-GRAPH: Hierarchical Graph Attention Network for Anatomy-Guided CT Report Generation
Authors: Hamza Kalisch, Fabian HΓΆrst, Jens Kleesiek, Ken Herrmann, Constantin Seibold β’
Published: 2025-08-07 β’
Source: arXiv
As medical imaging is central to diagnostic processes, automating the generation of radiology reports has become increasingly relevant to assist radiologists with their heavy workloads. Most current methods rely solely on global image features, failing to capture fine-grained organ relationships crucial for accurate reporting. To this end, we propose CT-GRAPH, a hierarchical graph attention network that explicitly models radiological knowledge by structuring anatomical regions into a graph, linking fine-grained organ features to coarser anatomical systems and a global patient context. Our method leverages pretrained 3D medical feature encoders to obtain global and organ-level features by utilizing anatomical masks. These features are further refined within the graph and then integrated into a large language model to generate detailed medical reports. We evaluate our approach for the task of report generation on the large-scale chest CT dataset CT-RATE. We provide an in-depth analysis of pretrained feature encoders for CT report generation and show that our method achieves a substantial improvement of absolute 7.9\% in F1 score over current state-of-the-art methods. The code is publicly available at https://github.com/hakal104/CT-GRAPH.
28. Affecta-Context: The Context-Guided Behavior Adaptation Framework
Authors: Morten Roed Frederiksen, Kasper StΓΈy β’
Published: 2025-08-07 β’
Source: arXiv
This paper presents Affecta-context, a general framework to facilitate behavior adaptation for social robots. The framework uses information about the physical context to guide its behaviors in human-robot interactions. It consists of two parts: one that represents encountered contexts and one that learns to prioritize between behaviors through human-robot interactions. As physical contexts are encountered the framework clusters them by their measured physical properties. In each context, the framework learns to prioritize between behaviors to optimize the physical attributes of the robot's behavior in line with its current environment and the preferences of the users it interacts with. This paper illlustrates the abilities of the Affecta-context framework by enabling a robot to autonomously learn the prioritization of discrete behaviors. This was achieved by training across 72 interactions in two different physical contexts with 6 different human test participants. The paper demonstrates the trained Affecta-context framework by verifying the robot's ability to generalize over the input and to match its behaviors to a previously unvisited physical context.
29. PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation
Authors: Kang Liu, Zhuoqi Ma, Zikang Fang, Yunan Li, Kun Xie, Qiguang Miao β’
Published: 2025-08-07 β’
Source: arXiv
Chest X-ray report generation aims to reduce radiologists' workload by automatically producing high-quality preliminary reports. A critical yet underexplored aspect of this task is the effective use of patient-specific prior knowledge -- including clinical context (e.g., symptoms, medical history) and the most recent prior image -- which radiologists routinely rely on for diagnostic reasoning. Most existing methods generate reports from single images, neglecting this essential prior information and thus failing to capture diagnostic intent or disease progression. To bridge this gap, we propose PriorRG, a novel chest X-ray report generation framework that emulates real-world clinical workflows via a two-stage training pipeline. In Stage 1, we introduce a prior-guided contrastive pre-training scheme that leverages clinical context to guide spatiotemporal feature extraction, allowing the model to align more closely with the intrinsic spatiotemporal semantics in radiology reports. In Stage 2, we present a prior-aware coarse-to-fine decoding for report generation that progressively integrates patient-specific prior knowledge with the vision encoder's hidden states. This decoding allows the model to align with diagnostic focus and track disease progression, thereby enhancing the clinical accuracy and fluency of the generated reports. Extensive experiments on MIMIC-CXR and MIMIC-ABN datasets demonstrate that PriorRG outperforms state-of-the-art methods, achieving a 3.6% BLEU-4 and 3.8% F1 score improvement on MIMIC-CXR, and a 5.9% BLEU-1 gain on MIMIC-ABN. Code and checkpoints will be released upon acceptance.
30. VS-LLM: Visual-Semantic Depression Assessment based on LLM for Drawing Projection Test
Authors: Meiqi Wu, Yaxuan Kang, Xuchen Li, Shiyu Hu, Xiaotang Chen, Yunfeng Kang, Weiqiang Wang, Kaiqi Huang β’
Published: 2025-08-07 β’
Source: arXiv
The Drawing Projection Test (DPT) is an essential tool in art therapy, allowing psychologists to assess participants' mental states through their sketches. Specifically, through sketches with the theme of "a person picking an apple from a tree (PPAT)", it can be revealed whether the participants are in mental states such as depression. Compared with scales, the DPT can enrich psychologists' understanding of an individual's mental state. However, the interpretation of the PPAT is laborious and depends on the experience of the psychologists. To address this issue, we propose an effective identification method to support psychologists in conducting a large-scale automatic DPT. Unlike traditional sketch recognition, DPT more focus on the overall evaluation of the sketches, such as color usage and space utilization. Moreover, PPAT imposes a time limit and prohibits verbal reminders, resulting in low drawing accuracy and a lack of detailed depiction. To address these challenges, we propose the following efforts: (1) Providing an experimental environment for automated analysis of PPAT sketches for depression assessment; (2) Offering a Visual-Semantic depression assessment based on LLM (VS-LLM) method; (3) Experimental results demonstrate that our method improves by 17.6% compared to the psychologist assessment method. We anticipate that this work will contribute to the research in mental state assessment based on PPAT sketches' elements recognition. Our datasets and codes are available at https://github.com/wmeiqi/VS-LLM.
31. Towards Embodied Agentic AI: Review and Classification of LLM- and VLM-Driven Robot Autonomy and Interaction
Authors: Sahar Salimpour, Lei Fu, Farhad Keramat, Leonardo Militano, Giovanni Toffetti, Harry Edelman, Jorge PeΓ±a Queralta β’
Published: 2025-08-07 β’
Source: arXiv
Foundation models, including large language models (LLMs) and vision-language models (VLMs), have recently enabled novel approaches to robot autonomy and human-robot interfaces. In parallel, vision-language-action models (VLAs) or large behavior models (BLMs) are increasing the dexterity and capabilities of robotic systems. This survey paper focuses on those words advancing towards agentic applications and architectures. This includes initial efforts exploring GPT-style interfaces to tooling, as well as more complex system where AI agents are coordinators, planners, perception actors, or generalist interfaces. Such agentic architectures allow robots to reason over natural language instructions, invoke APIs, plan task sequences, or assist in operations and diagnostics. In addition to peer-reviewed research, due to the fast-evolving nature of the field, we highlight and include community-driven projects, ROS packages, and industrial frameworks that show emerging trends. We propose a taxonomy for classifying model integration approaches and present a comparative analysis of the role that agents play in different solutions in today's literature.
32. Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue
Authors: Sukannya Purkayastha, Nils Dycke, Anne Lauscher, Iryna Gurevych β’
Published: 2025-08-07 β’
Source: arXiv
Meta-reviewing is a pivotal stage in the peer-review process, serving as the final step in determining whether a paper is recommended for acceptance. Prior research on meta-reviewing has treated this as a summarization problem over review reports. However, complementary to this perspective, meta-reviewing is a decision-making process that requires weighing reviewer arguments and placing them within a broader context. Prior research has demonstrated that decision-makers can be effectively assisted in such scenarios via dialogue agents. In line with this framing, we explore the practical challenges for realizing dialog agents that can effectively assist meta-reviewers. Concretely, we first address the issue of data scarcity for training dialogue agents by generating synthetic data using Large Language Models (LLMs) based on a self-refinement strategy to improve the relevance of these dialogues to expert domains. Our experiments demonstrate that this method produces higher-quality synthetic data and can serve as a valuable resource towards training meta-reviewing assistants. Subsequently, we utilize this data to train dialogue agents tailored for meta-reviewing and find that these agents outperform \emph{off-the-shelf} LLM-based assistants for this task. Finally, we apply our agents in real-world meta-reviewing scenarios and confirm their effectiveness in enhancing the efficiency of meta-reviewing.\footnote{Code and Data: https://github.com/UKPLab/arxiv2025-meta-review-as-dialog
33. RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding
Authors: Tianchen Fang, Guiru Liu β’
Published: 2025-08-07 β’
Source: arXiv
Medical image understanding plays a crucial role in enabling automated diagnosis and data-driven clinical decision support. However, its progress is impeded by two primary challenges: the limited availability of high-quality annotated medical data and an overreliance on global image features, which often miss subtle but clinically significant pathological regions. To address these issues, we introduce RegionMed-CLIP, a region-aware multimodal contrastive learning framework that explicitly incorporates localized pathological signals along with holistic semantic representations. The core of our method is an innovative region-of-interest (ROI) processor that adaptively integrates fine-grained regional features with the global context, supported by a progressive training strategy that enhances hierarchical multimodal alignment. To enable large-scale region-level representation learning, we construct MedRegion-500k, a comprehensive medical image-text corpus that features extensive regional annotations and multilevel clinical descriptions. Extensive experiments on image-text retrieval, zero-shot classification, and visual question answering tasks demonstrate that RegionMed-CLIP consistently exceeds state-of-the-art vision language models by a wide margin. Our results highlight the critical importance of region-aware contrastive pre-training and position RegionMed-CLIP as a robust foundation for advancing multimodal medical image understanding.
34. Deep Learning Based Dynamic Environment Reconstruction for Vehicular ISAC Scenarios
Authors: Junzhe Song, Ruisi He, Mi Yang, Zhengyu Zhang, Bingcheng Liu, Jiahui Han, Haoxiang Zhang, Bo Ai β’
Published: 2025-08-07 β’
Source: arXiv
Integrated Sensing and Communication (ISAC) technology plays a critical role in future intelligent transportation systems, by enabling vehicles to perceive and reconstruct the surrounding environment through reuse of wireless signals, thereby reducing or even eliminating the need for additional sensors such as LiDAR or radar. However, existing ISAC based reconstruction methods often lack the ability to track dynamic scenes with sufficient accuracy and temporal consistency, limiting the real world applicability. To address this limitation, we propose a deep learning based framework for vehicular environment reconstruction by using ISAC channels. We first establish a joint channel environment dataset based on multi modal measurements from real world urban street scenarios. Then, a multistage deep learning network is developed to reconstruct the environment. Specifically, a scene decoder identifies the environmental context such as buildings, trees and so on; a cluster center decoder predicts coarse spatial layouts by localizing dominant scattering centers; a point cloud decoder recovers fine grained geometry and structure of surrounding environments. Experimental results demonstrate that the proposed method achieves high-quality dynamic environment reconstruction with a Chamfer Distance of 0.29 and F Score@1% of 0.87. In addition, complexity analysis demonstrates the efficiency and practical applicability of the method in real time scenarios. This work provides a pathway toward low cost environment reconstruction based on ISAC for future intelligent transportation.
35. QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering
Authors: Zhuohang Jiang, Pangjing Wu, Xu Yuan, Wenqi Fan, Qing Li β’
Published: 2025-08-07 β’
Source: arXiv
Retrieval-Augmented Generation (RAG) has been introduced to mitigate hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge into the generation process, and it has become a widely adopted approach for knowledge-intensive Visual Question Answering (VQA). However, existing RAG methods typically retrieve from either text or images in isolation, limiting their ability to address complex queries that require multi-hop reasoning or up-to-date factual knowledge. To address this limitation, we propose QA-Dragon, a Query-Aware Dynamic RAG System for Knowledge-Intensive VQA. Specifically, QA-Dragon introduces a domain router to identify the query's subject domain for domain-specific reasoning, along with a search router that dynamically selects optimal retrieval strategies. By orchestrating both text and image search agents in a hybrid setup, our system supports multimodal, multi-turn, and multi-hop reasoning, enabling it to tackle complex VQA tasks effectively. We evaluate our QA-Dragon on the Meta CRAG-MM Challenge at KDD Cup 2025, where it significantly enhances the reasoning performance of base models under challenging scenarios. Our framework achieves substantial improvements in both answer accuracy and knowledge overlap scores, outperforming baselines by 5.06% on the single-source task, 6.35% on the multi-source task, and 5.03% on the multi-turn task.