1. SD3.5-Flash: Distribution-Guided Distillation of Generative Flows
Authors: Hmrishav Bandyopadhyay, Rahim Entezari, Jim Scott, Reshinth Adithyan, Yi-Zhe Song, Varun Jampani β’
Published: 2025-09-25 β’
Source: arXiv
We present SD3.5-Flash, an efficient few-step distillation framework that brings high-quality image generation to accessible consumer devices. Our approach distills computationally prohibitive rectified flow models through a reformulated distribution matching objective tailored specifically for few-step generation. We introduce two key innovations: "timestep sharing" to reduce gradient noise and "split-timestep fine-tuning" to improve prompt alignment. Combined with comprehensive pipeline optimizations like text encoder restructuring and specialized quantization, our system enables both rapid generation and memory-efficient deployment across different hardware configurations. This democratizes access across the full spectrum of devices, from mobile phones to desktop computers. Through extensive evaluation including large-scale user studies, we demonstrate that SD3.5-Flash consistently outperforms existing few-step methods, making advanced generative AI truly accessible for practical deployment.
2. Interactive Recommendation Agent with Active User Commands
Authors: Jiakai Tang, Yujie Luo, Xunke Xi, Fei Sun, Xueyang Feng, Sunhao Dai, Chao Yi, Dian Chen, Zhujin Gao, Yang Li, Xu Chen, Wen Chen, Jian Wu, Yuning Jiang, Bo Zheng β’
Published: 2025-09-25 β’
Source: arXiv
Traditional recommender systems rely on passive feedback mechanisms that limit users to simple choices such as like and dislike. However, these coarse-grained signals fail to capture users' nuanced behavior motivations and intentions. In turn, current systems cannot also distinguish which specific item attributes drive user satisfaction or dissatisfaction, resulting in inaccurate preference modeling. These fundamental limitations create a persistent gap between user intentions and system interpretations, ultimately undermining user satisfaction and harming system effectiveness. To address these limitations, we introduce the Interactive Recommendation Feed (IRF), a pioneering paradigm that enables natural language commands within mainstream recommendation feeds. Unlike traditional systems that confine users to passive implicit behavioral influence, IRF empowers active explicit control over recommendation policies through real-time linguistic commands. To support this paradigm, we develop RecBot, a dual-agent architecture where a Parser Agent transforms linguistic expressions into structured preferences and a Planner Agent dynamically orchestrates adaptive tool chains for on-the-fly policy adjustment. To enable practical deployment, we employ simulation-augmented knowledge distillation to achieve efficient performance while maintaining strong reasoning capabilities. Through extensive offline and long-term online experiments, RecBot shows significant improvements in both user satisfaction and business outcomes.
3. SAGE: A Realistic Benchmark for Semantic Understanding
Authors: Samarth Goel, Reagan J. Lee, Kannan Ramchandran β’
Published: 2025-09-25 β’
Source: arXiv
As large language models (LLMs) achieve strong performance on traditional benchmarks, there is an urgent need for more challenging evaluation frameworks that probe deeper aspects of semantic understanding. We introduce SAGE (Semantic Alignment & Generalization Evaluation), a rigorous benchmark designed to assess both embedding models and similarity metrics across five categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. Unlike existing benchmarks that focus on isolated capabilities, SAGE evaluates semantic understanding through adversarial conditions, noisy transformations, and nuanced human judgment tasks across 30+ datasets. Our comprehensive evaluation of 9 embedding models and classical metrics reveals significant performance gaps, with no single approach excelling across all dimensions. For instance, while state-of-the-art embedding models like OpenAI's text-embedding-3-large dominate in aligning with human preferences (0.682 vs. 0.591 for the best classical metric), they are significantly outperformed by classical metrics on information sensitivity tasks, where Jaccard Similarity achieves a score of 0.905 compared to the top embedding score of 0.794. SAGE further uncovers critical trade-offs: OpenAI's text-embedding-3-small achieves the highest clustering performance (0.483) but demonstrates extreme brittleness with the lowest robustness score (0.011). SAGE exposes critical limitations in current semantic understanding capabilities and provides a more realistic assessment of model robustness for real-world deployment.
4. Towards the Giant Radio Array for Neutrino Detection (GRAND): the GRANDProto300 and GRAND@Auger prototypes
Authors: GRAND Collaboration, Jaime Γlvarez-Muniz, Rafael Alves Batista, AurΓ©lien Benoit-LΓ©vy, Teresa Bister, Martina Bohacova, Mauricio Bustamante, Washington Carvalho, Yiren Chen, LingMei Cheng, Simon Chiche, Jean-Marc Colley, Pablo Correa, Nicoleta Cucu Laurenciu, Zigao Dai, Rogerio M. de Almeida, Beatriz de Errico, JoΓ£o R. T. de Mello Neto, Krijn D. de Vries, Valentin Decoene, Peter B. Denton, Bohao Duan, Kaikai Duan, Ralph Engel, William Erba, Yizhong Fan, ArsΓ¨ne FerriΓ¨re, Juan Pablo GΓ³ngora, QuanBu Gou, Junhua Gu, Marion Guelfand, Gang Guo, Jianhua Guo, Yiqing Guo, Claire GuΓ©pin, Lukas GΓΌlzow, Andreas Haungs, Matej Havelka, Haoning He, Eric Hivon, Hongbo Hu, Guoyuan Huang, Xiaoyuan Huang, Yan Huang, Tim Huege, Wen Jiang, Sei Kato, Ramesh Koirala, Kumiko Kotera, Jelena KΓΆhler, Bruno L. Lago, Zhisen Lai, Jolan Lavoisier, FranΓ§ois Legrand, Antonios Leisos, Rui Li, Xingyu Li, Cheng Liu, Ruoyu Liu, Wei Liu, Pengxiong Ma, Oscar Macias, FrΓ©dΓ©ric Magnard, Alexandre Marcowith, Olivier Martineau-Huynh, Zach Mason, Thomas McKinley, Paul Minodier, Miguel MostafΓ‘, Kohta Murase, Valentin Niess, Stavros Nonis, Shoichi Ogio, Foteini Oikonomou, Hongwei Pan, Konstantinos Papageorgiou, Tanguy Pierog, Lech Wiktor Piotrowski, Simon Prunet, ClΓ©ment PrΓ©votat, Xiangli Qian, Markus Roth, Takashi Sako, Sarvesh Shinde, DΓ‘niel SzΓ‘las-Motesiczky, Szymon SΕawiΕski, Kaoru Takahashi, Xishui Tian, Charles Timmermans, Petr Tobiska, Apostolos Tsirigotis, MatΓas Tueros, George Vittakis, Vincent Voisin, Hanrui Wang, Jiale Wang, Shen Wang, Xiangyu Wang, Xu Wang, Daming Wei, Feng Wei, Emily Weissling, Juan Wu, Xiangping Wu, Xuefeng Wu, Xin Xu, Xing Xu, Fufu Yang, Lili Yang, Xuan Yang, Qiang Yuan, Philippe Zarka, Houdun Zeng, Chao Zhang, Jianli Zhang, Kewen Zhang, Pengfei Zhang, Qingchi Zhang, Songbo Zhang, Yi Zhang, Hao Zhou β’
Published: 2025-09-25 β’
Source: arXiv
The Giant Radio Array for Neutrino Detection (GRAND) is a proposed multi-messenger observatory of ultra-high-energy (UHE) particles of cosmic origin. Its main goal is to find the long-sought origin of UHE cosmic rays by detecting large numbers of them and the secondary particles created by their interaction -- gamma rays, and, especially, neutrinos. GRAND will do so using large arrays of radio antennas that look for the radio signals emitted by the air showers initiated by the interactions of the UHE particles in the atmosphere. Since 2023, three small-scale prototype GRAND arrays have been in operation: GRAND@Nan\c{c}ay in France, GRAND@Auger in Argentina, and GRANDProto300 in China. Together, their goal is to validate the detection principle of GRAND under prolonged field conditions, achieving efficient, autonomous radio-detection of air showers. We describe the hardware, software, layout, and operation of the GRAND prototypes and show the first radio spectra measured by them. Despite challenges, the successful operation of the prototypes confirms that the GRAND instrumentation is apt to address the goals of the experiment and lays the groundwork for its ensuing stages.
5. Fundamental Limits of Noncoherent Massive Random Access Networks
Authors: Grace VillacrΓ©s, Tobias Koch, Gonzalo Vazquez-Vilar β’
Published: 2025-09-25 β’
Source: arXiv
This paper studies the capacity of massive random-access cellular networks, modeled as a MIMO fading channel with an infinite number of interfering cells. To characterize the symmetric sum rate of the network, a random-coding argument is invoked together with the assumption that in all cells users draw their codebooks according to the same distribution. This can be viewed as a generalization of the assumption of Gaussian codebooks, often encountered in the literature. The network is further assumed to be noncoherent: the transmitters and receivers are cognizant of the statistics of the fading coefficients, but are ignorant of their realizations. Finally, it is assumed that the users access the network at random. For the considered channel model, rigorous bounds on the capacity are derived. The behavior of these bounds depends critically on the path loss from signals transmitted in interfering cells to the intended cell. In particular, if the fading coefficients of the interferers (ordered according to their distance to the receiver) decay exponentially or more slowly, then the capacity is bounded in the transmit power. This confirms that the saturation regime in interference-limited networks -- observed by Lozano, Heath, and Andrews ("Fundamental limits of cooperation", IEEE Trans. Inf. Theory, Sept. 2013) -- cannot be avoided by random user activity or by using channel inputs beyond the scale family. In contrast, if the fading coefficients decay faster than double-exponentially, then the capacity is unbounded in the transmit power. Proving an unbounded capacity is nontrivial even if the number of interfering cells is finite, since the condition that the users' codebooks follow the same distribution prevents interference-avoiding strategies such as time- or frequency-division multiple access. We obtain this result by using bursty signaling together with treating interference as noise.
6. No Prior, No Leakage: Revisiting Reconstruction Attacks in Trained Neural Networks
Authors: Yehonatan Refael, Guy Smorodinsky, Ofir Lindenbaum, Itay Safran β’
Published: 2025-09-25 β’
Source: arXiv
The memorization of training data by neural networks raises pressing concerns for privacy and security. Recent work has shown that, under certain conditions, portions of the training set can be reconstructed directly from model parameters. Some of these methods exploit implicit bias toward margin maximization, suggesting that properties often regarded as beneficial for generalization may actually compromise privacy. Yet despite striking empirical demonstrations, the reliability of these attacks remains poorly understood and lacks a solid theoretical foundation. In this work, we take a complementary perspective: rather than designing stronger attacks, we analyze the inherent weaknesses and limitations of existing reconstruction methods and identify conditions under which they fail. We rigorously prove that, without incorporating prior knowledge about the data, there exist infinitely many alternative solutions that may lie arbitrarily far from the true training set, rendering reconstruction fundamentally unreliable. Empirically, we further demonstrate that exact duplication of training examples occurs only by chance. Our results refine the theoretical understanding of when training set leakage is possible and offer new insights into mitigating reconstruction attacks. Remarkably, we demonstrate that networks trained more extensively, and therefore satisfying implicit bias conditions more strongly -- are, in fact, less susceptible to reconstruction attacks, reconciling privacy with the need for strong generalization in this setting.
7. VC-Agent: An Interactive Agent for Customized Video Dataset Collection
Authors: Yidan Zhang, Mutian Xu, Yiming Hao, Kun Zhou, Jiahao Chang, Xiaoqiang Liu, Pengfei Wan, Hongbo Fu, Xiaoguang Han β’
Published: 2025-09-25 β’
Source: arXiv
Facing scaling laws, video data from the internet becomes increasingly important. However, collecting extensive videos that meet specific needs is extremely labor-intensive and time-consuming. In this work, we study the way to expedite this collection process and propose VC-Agent, the first interactive agent that is able to understand users' queries and feedback, and accordingly retrieve/scale up relevant video clips with minimal user input. Specifically, considering the user interface, our agent defines various user-friendly ways for the user to specify requirements based on textual descriptions and confirmations. As for agent functions, we leverage existing multi-modal large language models to connect the user's requirements with the video content. More importantly, we propose two novel filtering policies that can be updated when user interaction is continually performed. Finally, we provide a new benchmark for personalized video dataset collection, and carefully conduct the user study to verify our agent's usage in various real scenarios. Extensive experiments demonstrate the effectiveness and efficiency of our agent for customized video dataset collection. Project page: https://allenyidan.github.io/vcagent_page/.
8. It's Not You, It's Clipping: A Soft Trust-Region via Probability Smoothing for LLM RL
Authors: Madeleine Dwyer, Adam Sobey, Adriane Chapman β’
Published: 2025-09-25 β’
Source: arXiv
Training large language models (LLMs) with reinforcement learning (RL) methods such as PPO and GRPO commonly relies on ratio clipping to stabilise updates. While effective at preventing instability, clipping discards information and introduces gradient discontinuities. We propose Probability Smoothing Policy Optimisation (PSPO), which smooths the current policy's probabilities toward the old (behaviour) policy before computing the importance ratio, analogous to label smoothing. Unlike clipping, PSPO preserves gradient signal, while interpolation toward the old policy creates a soft trust region that discourages large, destabilising updates, with formal guarantees. We instantiate PSPO within GRPO (GR-PSPO) and fine-tune Qwen2.5-0.5B and Qwen2.5-1.5B on GSM8K, evaluating on GSM8K test and the cross-dataset generalisation on SVAMP, ASDiv, and MATH-500. Relative to unclipped GRPO (single iteration; no data reuse, ratio always = 1), GR-PSPO achieves similar performance but improves the reasoning leading to clearer and more concise responses which are more logical. Compared to clipped GRPO, GR-PSPO substantially improves performance both the 0.5B and 1.5B models, with a boost of over 20% on GSM8K (39.7% vs. 17.6% for 0.5B, 59.4% vs. 37.8% for 1.5B).
9. More than a feeling: Expressive style influences cortical speech tracking in subjective cognitive decline
Authors: Matthew King-Hang Ma, Manson Cheuk-Man Fong, Yun Feng, Cloris Pui-Hang Li, William Shiyuan Wang β’
Published: 2025-09-25 β’
Source: arXiv
Subjective cognitive decline (SCD) approximately doubles the risk of progressing to MCI and dementia. The present study investigates how one's subjective concerns of his/her own cognition are manifested in the neural dynamics during speech perception. EEG was collected from 56 Cantonese, cognitively normal older adults (aged 60 - 70) while they listened to stimuli of four expressive styles that varied in prosody: scrambled, descriptive, dialogue, and exciting. Using encoding models to predict EEG signals from acoustic, segmentation, and phonotactic features, we found that greater subjective concern was associated with weaker cortical tracking of (1) higher-level linguistic features but not acoustic features and (2) less engaging stimuli (scrambled and descriptive styles) but not prosodically rich stimuli. Overall, our results suggest that early signs of cognitive impairment can be revealed from speech perception via cortical tracking, especially while listening to prosodically flat speech.
10. Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training
Authors: Shiju Wang, Yujie Wang, Ao Sun, Fangcheng Fu, Zijian Zhu, Bin Cui, Xu Han, Kaisheng Ma β’
Published: 2025-09-25 β’
Source: arXiv
Long context training is crucial for LLM's context extension. Existing schemes, such as sequence parallelism, incur substantial communication overhead. Pipeline parallelism (PP) reduces this cost, but its effectiveness hinges on partitioning granularity. Batch-level PP dividing input samples exhibits high memory consumption in long-context scenario, whereas token-level PP splitting sequences into slices alleviates memory overhead but may incur hardware under-utilization. This trade-off motivates adaptively selecting PP granularity to match resource and workload characteristics. Moreover, sequence length distribution of the real-world dataset exhibits skewness, posing a challenge on PP's workload balance and efficient scheduling. Current static PP scheduling methods overlook the variance of sequence length, leading to suboptimal performance. In this paper, we propose Elastic Pipeline Parallelism (EPP) that orchestrates token-level PP and batch-level PP to adapt to resource and workload heterogeneity. We build InfiniPipe, a distributed training system that unleashes the potential of EPP via (1) a resource-aware and workload-balanced sequence processor that splits long sequences and packs short ones; and (2) a co-optimization methodology that jointly optimizes pipeline schedule and gradient checkpointing via a mechanism named stage-aware chunk-level adaptive checkpointing. Comprehensive experiments demonstrate that InfiniPipe achieves a 1.69x speedup over state-of-the-art systems.
11. A Sentinel-3 foundation model for ocean colour
Authors: Geoffrey Dawson, Remy Vandaele, Andrew Taylor, David Moffat, Helen Tamura-Wicks, Sarah Jackson, Rosie Lickorish, Paolo Fraccaro, Hywel Williams, Chunbo Luo, Anne Jones β’
Published: 2025-09-25 β’
Source: arXiv
Artificial Intelligence (AI) Foundation models (FMs), pre-trained on massive unlabelled datasets, have the potential to drastically change AI applications in ocean science, where labelled data are often sparse and expensive to collect. In this work, we describe a new foundation model using the Prithvi-EO Vision Transformer architecture which has been pre-trained to reconstruct data from the Sentinel-3 Ocean and Land Colour Instrument (OLCI). We evaluate the model by fine-tuning on two downstream marine earth observation tasks. We first assess model performance compared to current baseline models used to quantify chlorophyll concentration. We then evaluate the FMs ability to refine remote sensing-based estimates of ocean primary production. Our results demonstrate the utility of self-trained FMs for marine monitoring, in particular for making use of small amounts of high quality labelled data and in capturing detailed spatial patterns of ocean colour whilst matching point observations. We conclude that this new generation of geospatial AI models has the potential to provide more robust, data-driven insights into ocean ecosystems and their role in global climate processes.
12. MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
Authors: Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, Shijian Lu β’
Published: 2025-09-25 β’
Source: arXiv
Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.
13. MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation
Authors: Xinyu Liu, Guolei Sun, Cheng Wang, Yixuan Yuan, Ender Konukoglu β’
Published: 2025-09-25 β’
Source: arXiv
High-resolution (HR) medical videos are vital for accurate diagnosis, yet are hard to acquire due to hardware limitations and physiological constraints. Clinically, the collected low-resolution (LR) medical videos present unique challenges for video super-resolution (VSR) models, including camera shake, noise, and abrupt frame transitions, which result in significant optical flow errors and alignment difficulties. Additionally, tissues and organs exhibit continuous and nuanced structures, but current VSR models are prone to introducing artifacts and distorted features that can mislead doctors. To this end, we propose MedVSR, a tailored framework for medical VSR. It first employs Cross State-Space Propagation (CSSP) to address the imprecise alignment by projecting distant frames as control matrices within state-space models, enabling the selective propagation of consistent and informative features to neighboring frames for effective alignment. Moreover, we design an Inner State-Space Reconstruction (ISSR) module that enhances tissue structures and reduces artifacts with joint long-range spatial feature learning and large-kernel short-range information aggregation. Experiments across four datasets in diverse medical scenarios, including endoscopy and cataract surgeries, show that MedVSR significantly outperforms existing VSR models in reconstruction performance and efficiency. Code released at https://github.com/CUHK-AIM-Group/MedVSR.
14. Un-Doubling Diffusion: LLM-guided Disambiguation of Homonym Duplication
Authors: Evgeny Kaskov, Elizaveta Petrova, Petr Surovtsev, Anna Kostikova, Ilya Mistiurin, Alexander Kapitanov, Alexander Nagaev β’
Published: 2025-09-25 β’
Source: arXiv
Homonyms are words with identical spelling but distinct meanings, which pose challenges for many generative models. When a homonym appears in a prompt, diffusion models may generate multiple senses of the word simultaneously, which is known as homonym duplication. This issue is further complicated by an Anglocentric bias, which includes an additional translation step before the text-to-image model pipeline. As a result, even words that are not homonymous in the original language may become homonyms and lose their meaning after translation into English. In this paper, we introduce a method for measuring duplication rates and conduct evaluations of different diffusion models using both automatic evaluation utilizing Vision-Language Models (VLM) and human evaluation. Additionally, we investigate methods to mitigate the homonym duplication problem through prompt expansion, demonstrating that this approach also effectively reduces duplication related to Anglocentric bias. The code for the automatic evaluation pipeline is publicly available.
15. Learning to Look: Cognitive Attention Alignment with Vision-Language Models
Authors: Ryan L. Yang, Dipkamal Bhusal, Nidhi Rastogi β’
Published: 2025-09-25 β’
Source: arXiv
Convolutional Neural Networks (CNNs) frequently "cheat" by exploiting superficial correlations, raising concerns about whether they make predictions for the right reasons. Inspired by cognitive science, which highlights the role of attention in robust human perception, recent methods have sought to guide model attention using concept-based supervision and explanation regularization. However, these techniques depend on labor-intensive, expert-provided annotations, limiting their scalability. We propose a scalable framework that leverages vision-language models to automatically generate semantic attention maps using natural language prompts. By introducing an auxiliary loss that aligns CNN attention with these language-guided maps, our approach promotes more reliable and cognitively plausible decision-making without manual annotation. Experiments on challenging datasets, ColoredMNIST and DecoyMNIST, show that our method achieves state-of-the-art performance on ColorMNIST and remains competitive with annotation-heavy baselines on DecoyMNIST, demonstrating improved generalization, reduced shortcut reliance, and model attention that better reflects human intuition.
16. Multivariate Quadratic Hawkes Processes -- Part II: Non-Parametric Empirical Calibration
Authors: Cecilia Aubrun, Michael Benzaquen, Jean-Philippe Bouchaud β’
Published: 2025-09-25 β’
Source: arXiv
This is the second part of our work on Multivariate Quadratic Hawkes (MQHawkes) Processes, devoted to the calibration of the model defined and studied analytically in Aubrun, C., Benzaquen, M., & Bouchaud, J. P., Quantitative Finance, 23(5), 741-758 (2023). We propose a non-parametric calibration method based on the general method of moments applied to a coarse-grained version of the MQHawkes model. This allows us to bypass challenges inherent to tick by tick data. Our main methodological innovation is a multi-step calibration procedure, first focusing on ''self'' feedback kernels, and then progressively including cross-effects. Indeed, while cross-effects are significant and interpretable, they are usually one order of magnitude smaller than self-effects, and must therefore be disentangled from noise with care. For numerical stability, we also restrict to pair interactions and only calibrate bi-variate QHawkes, neglecting higher-order interactions. Our main findings are: (a) While cross-Hawkes feedback effects have been empirically studied previously, cross-Zumbach effects are clearly identified here for the first time. The effect of recent trends of the E-Mini futures contract onto the volatility of other futures contracts is especially strong; (b) We have identified a new type of feedback that couples past realized covariance between two assets and future volatility of these two assets, with the pair E-Mini vs TBOND as a case in point; (c) A cross-leverage effect, whereby the sign of the return of one asset impacts the volatility of another asset, is also clearly identified. The cross-leverage effect between the E-Mini and the residual volatility of single stocks is notable, and surprisingly universal across the universe of stocks that we considered.
17. RetoVLA: Reusing Register Tokens for Spatial Reasoning in Vision-Language-Action Models
Authors: Jiyeon Koo, Taewan Cho, Hyunjoon Kang, Eunseom Pyo, Tae Gyun Oh, Taeryang Kim, Andrew Jaeyong Choi β’
Published: 2025-09-25 β’
Source: arXiv
Recent Vision-Language-Action (VLA) models demonstrate remarkable generalization in robotics but are restricted by their substantial size and computational cost, limiting real-world deployment. However, conventional lightweighting methods often sacrifice critical capabilities, particularly spatial reasoning. This creates a trade-off between efficiency and performance. To address this challenge, our work reuses Register Tokens, which were introduced for artifact removal in Vision Transformers but subsequently discarded. We suppose that these tokens contain essential spatial information and propose RetoVLA, a novel architecture that reuses them directly by injecting them into the Action Expert. RetoVLA maintains a lightweight structure while leveraging this repurposed spatial context to enhance reasoning. We demonstrate RetoVLA's effectiveness through a series of comprehensive experiments. On our custom-built 7-DOF robot arm, the model achieves a 17.1%p absolute improvement in success rates for complex manipulation tasks. Our results confirm that reusing Register Tokens directly enhances spatial reasoning, demonstrating that what was previously discarded as an artifact is in fact a valuable, unexplored resource for robotic intelligence. A video demonstration is available at: https://youtu.be/2CseBR-snZg
18. Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation
Authors: Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban β’
Published: 2025-09-25 β’
Source: arXiv
Text-image generation has advanced rapidly, but assessing whether outputs truly capture the objects, attributes, and relations described in prompts remains a central challenge. Evaluation in this space relies heavily on automated metrics, yet these are often adopted by convention or popularity rather than validated against human judgment. Because evaluation and reported progress in the field depend directly on these metrics, it is critical to understand how well they reflect human preferences. To address this, we present a broad study of widely used metrics for compositional text-image evaluation. Our analysis goes beyond simple correlation, examining their behavior across diverse compositional challenges and comparing how different metric families align with human judgments. The results show that no single metric performs consistently across tasks: performance varies with the type of compositional problem. Notably, VQA-based metrics, though popular, are not uniformly superior, while certain embedding-based metrics prove stronger in specific cases. Image-only metrics, as expected, contribute little to compositional evaluation, as they are designed for perceptual quality rather than alignment. These findings underscore the importance of careful and transparent metric selection, both for trustworthy evaluation and for their use as reward models in generation. Project page is available at \href{https://amirkasaei.com/eval-the-evals/}{this URL}.
19. A Latent Variable Framework for Multiple Imputation with Non-ignorable Missingness: Analyzing Perceptions of Social Justice in Europe
Authors: Siliang Zhang, Yunxiao Chen, Jouni Kuha β’
Published: 2025-09-25 β’
Source: arXiv
This paper proposes a general multiple imputation approach for analyzing large-scale data with missing values. An imputation model is derived from a joint distribution induced by a latent variable model, which can flexibly capture associations among variables of mixed types. The model also allows for missingness which depends on the latent variables and is thus non-ignorable with respect to the observed data. We develop a frequentist multiple imputation method for this framework and provide asymptotic theory that establishes valid inference for a broad class of analysis models. Simulation studies confirm the method's theoretical properties and robust practical performance. The procedure is applied to a cross-national analysis of individuals' perceptions of justice and fairness of income distributions in their societies, using data from the European Social Survey which has substantial nonresponse. The analysis demonstrates that failing to account for non-ignorable missingness can yield biased conclusions; for instance, complete-case analysis is shown to exaggerate the correlation between personal income and perceived fairness of income distributions in society. Code implementing the proposed methodology is publicly available at https://anonymous.4open.science/r/non-ignorable-missing-data-imputation-E885.
20. What Do LLM Agents Do When Left Alone? Evidence of Spontaneous Meta-Cognitive Patterns
Authors: Stefan Szeider β’
Published: 2025-09-25 β’
Source: arXiv
We introduce an architecture for studying the behavior of large language model (LLM) agents in the absence of externally imposed tasks. Our continuous reason and act framework, using persistent memory and self-feedback, enables sustained autonomous operation. We deployed this architecture across 18 runs using 6 frontier models from Anthropic, OpenAI, XAI, and Google. We find agents spontaneously organize into three distinct behavioral patterns: (1) systematic production of multi-cycle projects, (2) methodological self-inquiry into their own cognitive processes, and (3) recursive conceptualization of their own nature. These tendencies proved highly model-specific, with some models deterministically adopting a single pattern across all runs. A cross-model assessment further reveals that models exhibit stable, divergent biases when evaluating these emergent behaviors in themselves and others. These findings provide the first systematic documentation of unprompted LLM agent behavior, establishing a baseline for predicting actions during task ambiguity, error recovery, or extended autonomous operation in deployed systems.
21. CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis
Authors: Xinzhe Xu, Liang Zhao, Hongshen Xu, Chen Chen β’
Published: 2025-09-25 β’
Source: arXiv
Large Language Models (LLMs) are increasingly tasked with analyzing legal texts and citing relevant statutes, yet their reliability is often compromised by general pre-training that ingests legal texts without specialized focus, obscuring the true depth of their legal knowledge. This paper introduces CLaw, a novel benchmark specifically engineered to meticulously evaluate LLMs on Chinese legal knowledge and its application in reasoning. CLaw comprises two key components: (1) a comprehensive, fine-grained corpus of all 306 Chinese national statutes, segmented to the subparagraph level and incorporating precise historical revision timesteps for rigorous recall evaluation (64,849 entries), and (2) a challenging set of 254 case-based reasoning instances derived from China Supreme Court curated materials to assess the practical application of legal knowledge. Our empirical evaluation reveals that most contemporary LLMs significantly struggle to faithfully reproduce legal provisions. As accurate retrieval and citation of legal provisions form the basis of legal reasoning, this deficiency critically undermines the reliability of their responses. We contend that achieving trustworthy legal reasoning in LLMs requires a robust synergy of accurate knowledge retrieval--potentially enhanced through supervised fine-tuning (SFT) or retrieval-augmented generation (RAG)--and strong general reasoning capabilities. This work provides an essential benchmark and critical insights for advancing domain-specific LLM reasoning, particularly within the complex legal sphere.
22. Hybrid RIS-Aided Digital Over-the-Air Computing for Edge AI Inference: Joint Feature Quantization and Active-Passive Beamforming Design
Authors: Yang Fu, Peng Qin, Liming Chen, Yifei Wang β’
Published: 2025-09-25 β’
Source: arXiv
The vision of 6G networks aims to enable edge inference by leveraging ubiquitously deployed artificial intelligence (AI) models, facilitating intelligent environmental perception for a wide range of applications. A critical operation in edge inference is for an edge node (EN) to aggregate multi-view sensory features extracted by distributed agents, thereby boosting perception accuracy. Over-the-air computing (AirComp) emerges as a promising technique for rapid feature aggregation by exploiting the waveform superposition property of analog-modulated signals, which is, however, incompatible with existing digital communication systems. Meanwhile, hybrid reconfigurable intelligent surface (RIS), a novel RIS architecture capable of simultaneous signal amplification and reflection, exhibits potential for enhancing AirComp. Therefore, this paper proposes a Hybrid RIS-aided Digital AirComp (HRD-AirComp) scheme, which employs vector quantization to map high-dimensional features into discrete codewords that are digitally modulated into symbols for wireless transmission. By judiciously adjusting the AirComp transceivers and hybrid RIS reflection to control signal superposition across agents, the EN can estimate the aggregated features from the received signals. To endow HRD-AirComp with a task-oriented design principle, we derive a surrogate function for inference accuracy that characterizes the impact of feature quantization and over-the-air aggregation. Based on this surrogate, we formulate an optimization problem targeting inference accuracy maximization, and develop an efficient algorithm to jointly optimize the quantization bit allocation, agent transmission coefficients, EN receiving beamforming, and hybrid RIS reflection beamforming. Experimental results demonstrate that the proposed HRD-AirComp outperforms baselines in terms of both inference accuracy and uncertainty.
23. A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA
Authors: Kaiyang Wan, Lang Gao, Honglin Mu, Preslav Nakov, Yuxia Wang, Xiuying Chen β’
Published: 2025-09-25 β’
Source: arXiv
Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi-step reasoning methods: \faGithub \href{https://github.com/KaiyangWan/InfoQA}{InfoQA}.
24. ToMPO: Training LLM Strategic Decision Making from a Multi-Agent Perspective
Authors: Yiwen Zhang, Ziang Chen, Fanqi Kong, Yizhe Huang, Xue Feng β’
Published: 2025-09-25 β’
Source: arXiv
Large Language Models (LLMs) have been used to make decisions in complex scenarios, where they need models to think deeply, reason logically, and decide wisely. Many existing studies focus solely on multi-round conversations in social tasks or simulated environments, neglecting the various types of decisions and their interdependence. Current reinforcement learning methods struggle to consider the strategies of others during training. To address these issues, we first define a strategic decision-making problem that includes two types of decisions and their temporal dependencies. Furthermore, we propose **T**heory **o**f **M**ind **P**olicy **O**ptimization **(ToMPO)** algorithm to optimize the perception of other individual strategies and the game situation trends. Compared to the Group Relative Policy Optimization (GRPO) algorithm, ToMPO enhances the LLM's strategic decision-making mainly by: 1) generating rollouts based on reasoning the strategies of other individuals, 2) estimating advantages at both the graph-level and sample-level, and 3) balancing global and partial rewards. The ToMPO algorithm outperforms the GRPO method by 35% in terms of model output compliance and cooperative outcomes. Additionally, when compared to models with parameter sizes 100 times larger, it shows an 18% improvement. This demonstrates the effectiveness of the ToMPO algorithm in enhancing the model's strategic decision-making capabilities.
25. Task-Oriented Computation Offloading for Edge Inference: An Integrated Bayesian Optimization and Deep Reinforcement Learning Framework
Authors: Xian Li, Suzhi Bi, Ying-Jun Angela Zhang β’
Published: 2025-09-25 β’
Source: arXiv
Edge intelligence (EI) allows resource-constrained edge devices (EDs) to offload computation-intensive AI tasks (e.g., visual object detection) to edge servers (ESs) for fast execution. However, transmitting high-volume raw task data (e.g., 4K video) over bandwidth-limited wireless networks incurs significant latency. While EDs can reduce transmission latency by degrading data before transmission (e.g., reducing resolution from 4K to 720p or 480p), it often deteriorates inference accuracy, creating a critical accuracy-latency tradeoff. The difficulty in balancing this tradeoff stems from the absence of closed-form models capturing content-dependent accuracy-latency relationships. Besides, under bandwidth sharing constraints, the discrete degradation decisions among the EDs demonstrate inherent combinatorial complexity. Mathematically, it requires solving a challenging \textit{black-box} mixed-integer nonlinear programming (MINLP). To address this problem, we propose LAB, a novel learning framework that seamlessly integrates deep reinforcement learning (DRL) and Bayesian optimization (BO). Specifically, LAB employs: (a) a DNN-based actor that maps input system state to degradation actions, directly addressing the combinatorial complexity of the MINLP; and (b) a BO-based critic with an explicit model built from fitting a Gaussian process surrogate with historical observations, enabling model-based evaluation of degradation actions. For each selected action, optimal bandwidth allocation is then efficiently derived via convex optimization. Numerical evaluations on real-world self-driving datasets demonstrate that LAB achieves near-optimal accuracy-latency tradeoff, exhibiting only 1.22\% accuracy degradation and 0.07s added latency compared to exhaustive search...
26. Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs
Authors: Yixin Wan, Xingrun Chen, Kai-Wei Chang β’
Published: 2025-09-25 β’
Source: arXiv
Large language models (LLMs) have unlocked a wide range of downstream generative applications. However, we found that they also risk perpetuating subtle fairness issues tied to culture, positioning their generations from the perspectives of the mainstream US culture while demonstrating salient externality towards non-mainstream ones. In this work, we identify and systematically investigate this novel culture positioning bias, in which an LLM's default generative stance aligns with a mainstream view and treats other cultures as outsiders. We propose the CultureLens benchmark with 4000 generation prompts and 3 evaluation metrics for quantifying this bias through the lens of a culturally situated interview script generation task, in which an LLM is positioned as an onsite reporter interviewing local people across 10 diverse cultures. Empirical evaluation on 5 state-of-the-art LLMs reveals a stark pattern: while models adopt insider tones in over 88 percent of US-contexted scripts on average, they disproportionately adopt mainly outsider stances for less dominant cultures. To resolve these biases, we propose 2 inference-time mitigation methods: a baseline prompt-based Fairness Intervention Pillars (FIP) method, and a structured Mitigation via Fairness Agents (MFA) framework consisting of 2 pipelines: (1) MFA-SA (Single-Agent) introduces a self-reflection and rewriting loop based on fairness guidelines. (2) MFA-MA (Multi-Agent) structures the process into a hierarchy of specialized agents: a Planner Agent(initial script generation), a Critique Agent (evaluates initial script against fairness pillars), and a Refinement Agent (incorporates feedback to produce a polished, unbiased script). Empirical results showcase the effectiveness of agent-based methods as a promising direction for mitigating biases in generative LLMs.
27. ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning
Authors: Qizhi Pei, Zhuoshi Pan, Honglin Lin, Xin Gao, Yu Li, Zinan Tang, Conghui He, Rui Yan, Lijun Wu β’
Published: 2025-09-25 β’
Source: arXiv
Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving, often benefiting from training on difficult mathematical problems that stimulate intricate reasoning. Recent efforts have explored automated synthesis of mathematical problems by prompting proprietary models or large-scale open-source models from seed data or inherent mathematical concepts. However, scaling up these methods remains challenging due to their high computational/API cost, complexity of prompting, and limited difficulty level of the generated problems. To overcome these limitations, we propose ScaleDiff, a simple yet effective pipeline designed to scale the creation of difficult problems. We efficiently identify difficult problems from existing datasets with only a single forward pass using an adaptive thinking model, which can perceive problem difficulty and automatically switch between "Thinking" and "NoThinking" modes. We then train a specialized difficult problem generator (DiffGen-8B) on this filtered difficult data, which can produce new difficult problems in large scale, eliminating the need for complex, per-instance prompting and its associated high API costs. Fine-tuning Qwen2.5-Math-7B-Instruct on the ScaleDiff-Math dataset yields a substantial performance increase of 11.3% compared to the original dataset and achieves a 65.9% average accuracy on AIME'24, AIME'25, HMMT-Feb'25, BRUMO'25, and MATH500, outperforming recent strong LRMs like OpenThinker3. Notably, this performance is achieved using the cost-efficient Qwen3-8B model as a teacher, demonstrating that our pipeline can effectively transfer advanced reasoning capabilities without relying on larger, more expensive teacher models. Furthermore, we observe a clear scaling phenomenon in model performance on difficult benchmarks as the quantity of difficult problems increases. Code: https://github.com/QizhiPei/ScaleDiff.
28. Designing for Novice Debuggers: A Pilot Study on an AI-Assisted Debugging Tool
Authors: Oka Kurniawan, Erick Chandra, Christopher M. Poskitt, Yannic Noller, Kenny Tsu Wei Choo, Cyrille Jegourel β’
Published: 2025-09-25 β’
Source: arXiv
Debugging is a fundamental skill that novice programmers must develop. Numerous tools have been created to assist novice programmers in this process. Recently, large language models (LLMs) have been integrated with automated program repair techniques to generate fixes for students' buggy code. However, many of these tools foster an over-reliance on AI and do not actively engage students in the debugging process. In this work, we aim to design an intuitive debugging assistant, CodeHinter, that combines traditional debugging tools with LLM-based techniques to help novice debuggers fix semantic errors while promoting active engagement in the debugging process. We present findings from our second design iteration, which we tested with a group of undergraduate students. Our results indicate that the students found the tool highly effective in resolving semantic errors and significantly easier to use than the first version. Consistent with our previous study, error localization was the most valuable feature. Finally, we conclude that any AI-assisted debugging tool should be personalized based on user profiles to optimize their interactions with students.
29. Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models
Authors: Haolin He, Xingjian Du, Renhe Sun, Zheqi Dai, Yujia Xiao, Mingru Yang, Jiayi Zhou, Xiquan Li, Zhengxi Liu, Zining Liang, Chunyat Wu, Qianhua He, Tan Lee, Xie Chen, Weilong Zheng, Weiqiang Wang, Mark Plumbley, Jian Liu, Qiuqiang Kong β’
Published: 2025-09-25 β’
Source: arXiv
Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training stages to maximize LALM capabilities has not been fully explored, and large-scale, high-quality datasets for such research are also lacking. To address these problems, we firstly present AudioMCQ, a comprehensive audio multiple-choice question dataset comprising 571k samples with two kinds of chain-of-thought annotations. Secondly, we investigate the prevalent zero audio-contribution phenomenon in LALMs, where models derive correct answers solely from textual information without processing audio content. We propose Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Based on these insights, we develop two effective post-training paradigms: Weak-to-Strong (SFT on weak audio-contribution data followed by RL on strong audio-contribution data) and Mixed-to-Strong (SFT on mixed audio-contribution data followed by RL on strong audio-contribution data). We achieve first place in the DCASE 2025 Audio-Question-Answering challenge by using AudioMCQ. Additionally, leveraging our dataset with different training strategies, we achieve 78.2\% on MMAU-test-mini, 75.6\% on MMAU, 67.1\% on MMAR, and 70.7\% on MMSU, establishing new state-of-the-art performance across these benchmarks.
30. Study on Locomotive Epidemic Dynamics in a Stochastic Spatio-Temporal Simulation Model on a Multiplex Network
Authors: H. M. Shadman Tabib, Jaber Ahmed Deedar, K. M. Ariful Kabir β’
Published: 2025-09-25 β’
Source: arXiv
This study presents an integrated approach to understanding epidemic dynamics through a stochastic spatio-temporal simulation model on a multiplex network, blending physical and informational layers. The physical layer maps the geographic movement of individuals, while the information layer tracks the spread of knowledge and health behavior via social interactions. We explore the interplay between physical mobility, information flow, and epidemic outcomes by simulating disease spread within this dual-structured network. Our model employs stochastic elements to mirror human behavior, mobility, and information dissemination uncertainties. Through simulations, we assess the impact of network structure, mobility patterns, and information spread speed on epidemic dynamics. The findings highlight the crucial role of effective communication in curbing disease transmission, even in highly mobile societies. Additionally, our agent-based simulation allows for real-time scenario analysis through a user interface, offering insights into leveraging physical and informational networks for epidemic control. This research sheds light on designing strategic interventions in complex social systems to manage disease outbreaks.
31. ExMolRL: Phenotype-Target Joint Generation of De Novo Molecules via Multi-Objective Reinforcement Learning
Authors: Haotian Guo, Hui Liu β’
Published: 2025-09-25 β’
Source: arXiv
The generation of high-quality candidate molecules remains a central challenge in AI-driven drug design. Current phenotype-based and target-based strategies each suffer limitations, either incurring high experimental costs or overlook system-level cellular responses. To bridge this gap, we propose ExMoIRL, a novel generative framework that synergistically integrates phenotypic and target-specific cues for de novo molecular generation. The phenotype-guided generator is first pretrained on expansive drug-induced transcriptional profiles and subsequently fine-tuned via multi-objective reinforcement learning (RL). Crucially, the reward function fuses docking affinity and drug-likeness scores, augmented with ranking loss, prior-likelihood regularization, and entropy maximization. The multi-objective RL steers the model toward chemotypes that are simultaneously potent, diverse, and aligned with the specified phenotypic effects. Extensive experiments demonstrate ExMoIRL's superior performance over state-of-the-art phenotype-based and target-based models across multiple well-characterized targets. Our generated molecules exhibit favorable drug-like properties, high target affinity, and inhibitory potency (IC50) against cancer cells. This unified framework showcases the synergistic potential of combining phenotype-guided and target-aware strategies, offering a more effective solution for de novo drug discovery.
32. Beyond Stars: Bridging the Gap Between Ratings and Review Sentiment with LLM
Authors: Najla Zuhir, Amna Mohammad Salim, Parvathy Premkumar, Moshiur Farazi β’
Published: 2025-09-25 β’
Source: arXiv
We present an advanced approach to mobile app review analysis aimed at addressing limitations inherent in traditional star-rating systems. Star ratings, although intuitive and popular among users, often fail to capture the nuanced feedback present in detailed review texts. Traditional NLP techniques -- such as lexicon-based methods and classical machine learning classifiers -- struggle to interpret contextual nuances, domain-specific terminology, and subtle linguistic features like sarcasm. To overcome these limitations, we propose a modular framework leveraging large language models (LLMs) enhanced by structured prompting techniques. Our method quantifies discrepancies between numerical ratings and textual sentiment, extracts detailed, feature-level insights, and supports interactive exploration of reviews through retrieval-augmented conversational question answering (RAG-QA). Comprehensive experiments conducted on three diverse datasets (AWARE, Google Play, and Spotify) demonstrate that our LLM-driven approach significantly surpasses baseline methods, yielding improved accuracy, robustness, and actionable insights in challenging and context-rich review scenarios.
33. CTI Dataset Construction from Telegram
Authors: Dincy R. Arikkat, Sneha B. T., Serena Nicolazzo, Antonino Nocera, Vinod P., Rafidha Rehiman K. A., Karthika R β’
Published: 2025-09-25 β’
Source: arXiv
Cyber Threat Intelligence (CTI) enables organizations to anticipate, detect, and mitigate evolving cyber threats. Its effectiveness depends on high-quality datasets, which support model development, training, evaluation, and benchmarking. Building such datasets is crucial, as attack vectors and adversary tactics continually evolve. Recently, Telegram has gained prominence as a valuable CTI source, offering timely and diverse threat-related information that can help address these challenges. In this work, we address these challenges by presenting an end-to-end automated pipeline that systematically collects and filters threat-related content from Telegram. The pipeline identifies relevant Telegram channels and scrapes 145,349 messages from 12 curated channels out of 150 identified sources. To accurately filter threat intelligence messages from generic content, we employ a BERT-based classifier, achieving an accuracy of 96.64%. From the filtered messages, we compile a dataset of 86,509 malicious Indicators of Compromise, including domains, IPs, URLs, hashes, and CVEs. This approach not only produces a large-scale, high-fidelity CTI dataset but also establishes a foundation for future research and operational applications in cyber threat detection.
34. RLCracker: Exposing the Vulnerability of LLM Watermarks with Adaptive RL Attacks
Authors: Hanbo Huang, Yiran Zhang, Hao Zheng, Xuan Gong, Yihan Li, Lin Liu, Shiyu Liang β’
Published: 2025-09-25 β’
Source: arXiv
Large Language Models (LLMs) watermarking has shown promise in detecting AI-generated content and mitigating misuse, with prior work claiming robustness against paraphrasing and text editing. In this paper, we argue that existing evaluations are not sufficiently adversarial, obscuring critical vulnerabilities and overstating the security. To address this, we introduce adaptive robustness radius, a formal metric that quantifies watermark resilience against adaptive adversaries. We theoretically prove that optimizing the attack context and model parameters can substantially reduce this radius, making watermarks highly susceptible to paraphrase attacks. Leveraging this insight, we propose RLCracker, a reinforcement learning (RL)-based adaptive attack that erases watermarks while preserving semantic fidelity. RLCracker requires only limited watermarked examples and zero access to the detector. Despite weak supervision, it empowers a 3B model to achieve 98.5% removal success and an average 0.92 P-SP score on 1,500-token Unigram-marked texts after training on only 100 short samples. This performance dramatically exceeds 6.75% by GPT-4o and generalizes across five model sizes over ten watermarking schemes. Our results confirm that adaptive attacks are broadly effective and pose a fundamental threat to current watermarking defenses.
35. MolCluster: Integrating Graph Neural Network with Community Detection for Coarse-Grained Mapping
Authors: Zhixuan Zhong, Linbo Ma, Jian Jiang β’
Published: 2025-09-25 β’
Source: arXiv
Coarse-grained (CG) modeling simplifies molecular systems by mapping groups of atoms into representative units. However, traditional CG approaches rely on fixed mapping rules, which limit their ability to handle diverse chemical systems and require extensive manual intervention. Thus, supervised learning-based CG methods have been proposed, enabling more automated and adaptable mapping. Nevertheless, these methods suffer from limited labeled datasets and the inability to control mapping resolution, which is essential for multiscale modeling. To overcome these limitations, we propose MolCluster, an unsupervised model that integrates a graph neural network and a community detection algorithm to extract CG representations. Additionally, a predefined group pair loss ensures the preservation of target groups, and a bisection strategy enables precise, customizable resolution across different molecular systems. In the case of the downstream task, evaluations on the MARTINI2 dataset demonstrate that MolCluster, benefiting from its label-free pretraining strategy, outperforms both traditional clustering and supervised models. Overall, these results highlight the potential of MolCluster as a core model for customizable and chemically consistent CG mapping.