1. FaceAnonyMixer: Cancelable Faces via Identity Consistent Latent Space Mixing
Authors: Mohammed Talha Alam, Fahad Shamshad, Fakhri Karray, Karthik Nandakumar β’
Published: 2025-08-07 β’
Source: arXiv
Advancements in face recognition (FR) technologies have amplified privacy concerns, necessitating methods that protect identity while maintaining recognition utility. Existing face anonymization methods typically focus on obscuring identity but fail to meet the requirements of biometric template protection, including revocability, unlinkability, and irreversibility. We propose FaceAnonyMixer, a cancelable face generation framework that leverages the latent space of a pre-trained generative model to synthesize privacy-preserving face images. The core idea of FaceAnonyMixer is to irreversibly mix the latent code of a real face image with a synthetic code derived from a revocable key. The mixed latent code is further refined through a carefully designed multi-objective loss to satisfy all cancelable biometric requirements. FaceAnonyMixer is capable of generating high-quality cancelable faces that can be directly matched using existing FR systems without requiring any modifications. Extensive experiments on benchmark datasets demonstrate that FaceAnonyMixer delivers superior recognition accuracy while providing significantly stronger privacy protection, achieving over an 11% gain on commercial API compared to recent cancelable biometric methods. Code is available at: https://github.com/talha-alam/faceanonymixer.
2. KuaiLive: A Real-time Interactive Dataset for Live Streaming Recommendation
Authors: Changle Qu, Sunhao Dai, Ke Guo, Liqin Zhao, Yanan Niu, Xiao Zhang, Jun Xu β’
Published: 2025-08-07 β’
Source: arXiv
Live streaming platforms have become a dominant form of online content consumption, offering dynamically evolving content, real-time interactions, and highly engaging user experiences. These unique characteristics introduce new challenges that differentiate live streaming recommendation from traditional recommendation settings and have garnered increasing attention from industry in recent years. However, research progress in academia has been hindered by the lack of publicly available datasets that accurately reflect the dynamic nature of live streaming environments. To address this gap, we introduce KuaiLive, the first real-time, interactive dataset collected from Kuaishou, a leading live streaming platform in China with over 400 million daily active users. The dataset records the interaction logs of 23,772 users and 452,621 streamers over a 21-day period. Compared to existing datasets, KuaiLive offers several advantages: it includes precise live room start and end timestamps, multiple types of real-time user interactions (click, comment, like, gift), and rich side information features for both users and streamers. These features enable more realistic simulation of dynamic candidate items and better modeling of user and streamer behaviors. We conduct a thorough analysis of KuaiLive from multiple perspectives and evaluate several representative recommendation methods on it, establishing a strong benchmark for future research. KuaiLive can support a wide range of tasks in the live streaming domain, such as top-K recommendation, click-through rate prediction, watch time prediction, and gift price prediction. Moreover, its fine-grained behavioral data also enables research on multi-behavior modeling, multi-task learning, and fairness-aware recommendation. The dataset and related resources are publicly available at https://imgkkk574.github.io/KuaiLive.
3. MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes
Authors: Henghui Ding, Kaining Ying, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang, Philip H. S. Torr, Song Bai β’
Published: 2025-08-07 β’
Source: arXiv
Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on existing benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To advance VOS toward more realistic environments, coMplex video Object SEgmentation (MOSEv1) was introduced to facilitate VOS research in complex scenes. Building on the strengths and limitations of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and over 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces significantly greater scene complexity, including more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), scenarios requiring external knowledge, etc. We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and find similar declines, demonstrating that MOSEv2 presents challenges across tasks. These results highlight that despite high accuracy on existing datasets, current VOS methods still struggle under real-world complexities. MOSEv2 is publicly available at https://MOSE.video.
4. H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages
Authors: Mehrdad Zakershahrak, Samira Ghodratnama β’
Published: 2025-08-07 β’
Source: arXiv
Byte-level language models eliminate fragile tokenizers but face computational challenges in morphologically-rich languages (MRLs), where words span many bytes. We propose H-NET++, a hierarchical dynamic-chunking model that learns linguistically-informed segmentation through end-to-end training. Key innovations include: (1) a lightweight Transformer context-mixer (1.9M parameters) for cross-chunk attention, (2) a two-level latent hyper-prior for document-level consistency, (3) specialized handling of orthographic artifacts (e.g. Persian ZWNJ), and (4) curriculum-based training with staged sequence lengths. On a 1.4B-token Persian corpus, H-NET++ achieves state-of-the-art results: 0.159 BPB reduction versus BPE-based GPT-2-fa (12% better compression), 5.4pp gain on ParsGLUE, 53% improved robustness to ZWNJ corruption, and 73.8% F1 on gold morphological boundaries. Our learned chunks align with Persian morphology without explicit supervision, demonstrating that hierarchical dynamic chunking provides an effective tokenizer-free solution for MRLs while maintaining computational efficiency.
5. LiDO: Discovery of a 10:1 Resonator with a Novel Libration State
Authors: Rosemary E. Pike, Ruth Murray-Clay, Kathryn Volk, Mike Alexandersen, Mark Comte, Samantha M. Lawler, Ying-Tung Chen, Arcelia Hermosillo Ruiz, Cameron Semenchuck, Cameron Collyer, J. J. Kavelaars, Lowell Peltier β’
Published: 2025-08-07 β’
Source: arXiv
The Large inclination Distant Objects LiDO survey has discovered the first securely classified object in the 10:1 mean motion resonance of Neptune. This object, 2020 VN40, is short-term stable in the 10:1 resonance, but not stable on Gyr timescales. 2020 VN40 is likely part of the scattering sticking population, and temporarily resides in the 10:1 resonance at ~139.5 au. This discovery confirms that this distant resonance is populated, as a single detection is likely to be indicative of a large population that is difficult to detect due to observational biases. This object has an inclination of 33.4 degrees, and n-body integrations of orbital clones of 2020 VN40 have revealed some unexpected evolutions. While clones of 2020 VN40 show resonant libration around the expected resonance centers of approximately 90, 180, and 270 degrees, for a restricted range of inclination and eccentricity values some clones librate around a resonant argument of 0 degrees. As this occurs for the slightly lower-eccentricity portions of the evolution, this behavior can also be quite stable. Our initial exploration suggests that this libration around a center of 0 degrees is a generic effect for highly inclined objects in n:1 resonances because the nature of their resonant interaction with Neptune becomes a strong function of their argument of pericenter, omega. At large inclination, the resonant islands shift as omega precesses, switching the center of symmetric libration to 0 degrees for omega=90 degrees and omega=270 degrees. 2020 VN40 provides interesting insight into the evolution of the large-inclination resonators, which become more common at increasing semi-major axis.
6. How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations
Authors: Brandon Jaipersaud, David Krueger, Ekdeep Singh Lubana β’
Published: 2025-08-07 β’
Source: arXiv
Large Language Models (LLMs) have started to demonstrate the ability to persuade humans, yet our understanding of how this dynamic transpires is limited. Recent work has used linear probes, lightweight tools for analyzing model representations, to study various LLM skills such as the ability to model user sentiment and political perspective. Motivated by this, we apply probes to study persuasion dynamics in natural, multi-turn conversations. We leverage insights from cognitive science to train probes on distinct aspects of persuasion: persuasion success, persuadee personality, and persuasion strategy. Despite their simplicity, we show that they capture various aspects of persuasion at both the sample and dataset levels. For instance, probes can identify the point in a conversation where the persuadee was persuaded or where persuasive success generally occurs across the entire dataset. We also show that in addition to being faster than expensive prompting-based approaches, probes can do just as well and even outperform prompting in some settings, such as when uncovering persuasion strategy. This suggests probes as a plausible avenue for studying other complex behaviours such as deception and manipulation, especially in multi-turn settings and large-scale dataset analysis where prompting-based methods would be computationally inefficient.
7. Latent Space Diffusion for Topology Optimization
Authors: Aaron Lutheran, Srijan Das, Alireza Tabarraei β’
Published: 2025-08-07 β’
Source: arXiv
Topology optimization enables the automated design of efficient structures by optimally distributing material within a defined domain. However, traditional gradient-based methods often scale poorly with increasing resolution and dimensionality due to the need for repeated finite element analyses and sensitivity evaluations. In this work, we propose a novel framework that combines latent diffusion models (LDMs) with variational autoencoders (VAEs) to enable fast, conditional generation of optimized topologies. Unlike prior approaches, our method conditions the generative process on physically meaningful fields, specifically von Mises stress, strain energy density, volume fraction, and loading information, embedded as dense input channels. To further guide the generation process, we introduce auxiliary loss functions that penalize floating material, load imbalance, and volume fraction deviation, thereby encouraging physically realistic and manufacturable designs. Numerical experiments on a large synthetic dataset demonstrate that our VAE-LDM framework outperforms existing diffusion-based methods in compliance accuracy, volume control, and structural connectivity, providing a robust and scalable alternative to conventional
8. TrajEvo: Trajectory Prediction Heuristics Design via LLM-driven Evolution
Authors: Zhikai Zhao, Chuanbo Hua, Federico Berto, Kanghoon Lee, Zihan Ma, Jiachen Li, Jinkyoo Park β’
Published: 2025-08-07 β’
Source: arXiv
Trajectory prediction is a critical task in modeling human behavior, especially in safety-critical domains such as social robotics and autonomous vehicle navigation. Traditional heuristics based on handcrafted rules often lack accuracy and generalizability. Although deep learning approaches offer improved performance, they typically suffer from high computational cost, limited explainability, and, importantly, poor generalization to out-of-distribution (OOD) scenarios. In this paper, we introduce TrajEvo, a framework that leverages Large Language Models (LLMs) to automatically design trajectory prediction heuristics. TrajEvo employs an evolutionary algorithm to generate and refine prediction heuristics from past trajectory data. We propose two key innovations: Cross-Generation Elite Sampling to encourage population diversity, and a Statistics Feedback Loop that enables the LLM to analyze and improve alternative predictions. Our evaluations demonstrate that TrajEvo outperforms existing heuristic methods across multiple real-world datasets, and notably surpasses both heuristic and deep learning methods in generalizing to an unseen OOD real-world dataset. TrajEvo marks a promising step toward the automated design of fast, explainable, and generalizable trajectory prediction heuristics. We release our source code to facilitate future research at https://github.com/ai4co/trajevo.
9. OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks
Authors: Zixuan Wang, Dingming Li, Hongxing Li, Shuo Chen, Yuchen Yan, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang β’
Published: 2025-08-07 β’
Source: arXiv
Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains. Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations. These findings demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, establishing OmniEAR as a rigorous benchmark for evaluating and advancing embodied AI systems. Our code and data are included in the supplementary materials and will be open-sourced upon acceptance.
10. Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models
Authors: Haitao Hong, Yuchen Yan, Xingyu Wu, Guiyang Hou, Wenqi Zhang, Weiming Lu, Yongliang Shen, Jun Xiao β’
Published: 2025-08-07 β’
Source: arXiv
Large language models (LLMs) have demonstrated remarkable performance in reasoning tasks, where reinforcement learning (RL) serves as a key algorithm for enhancing their reasoning capabilities. Currently, there are two mainstream reward paradigms: model-based rewards and rule-based rewards. However, both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking. To address these issues, we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework that jointly optimizes both the policy model and the reward model. Cooper leverages the high precision of rule-based rewards when identifying correct responses, and dynamically constructs and selects positive-negative sample pairs for continued training the reward model. This design enhances robustness and mitigates the risk of reward hacking. To further support Cooper, we introduce a hybrid annotation strategy that efficiently and accurately generates training data for the reward model. We also propose a reference-based reward modeling paradigm, where the reward model takes a reference answer as input. Based on this design, we train a reward model named VerifyRM, which achieves higher accuracy on VerifyBench compared to other models of the same size. We conduct reinforcement learning using both VerifyRM and Cooper. Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate that dynamically updating reward model is an effective way to combat reward hacking, providing a reference for better integrating reward models into RL.
11. Mind the Gap: From Resolving Theoretical Foundations of Chiral(ity)-Induced Spin Selectivity to Pioneering Implementations in Quantum Sensing
Authors: Yan Xi Foo, Aisha Kermiche, Farhan T. Chowdhury, Clarice D. Aiello, Luke D. Smith β’
Published: 2025-08-07 β’
Source: arXiv
The chiral(ity)-induced spin selectivity (CISS) effect, where electrons passing through a chiral medium acquire significant spin-polarization at ambient temperatures, has been widely observed experimentally, yet its theoretical foundations remain actively debated. Open questions persist regarding whether CISS originates from helical geometry or more general chirality, and whether a unified mechanism can account for phenomena across solid-state and soft-matter systems, mesoscopic films, and single molecules. Clarifying the interrelations between existing models is essential to determine if a universal picture of CISS can be found or whether system-specific models are required, and if so, where their common starting point should lie for a workable classification of CISS manifestations. Despite this theoretical fragmentation, recent studies of CISS effects in electron transfer systems, magnetic field sensitivity and coherence of radical pair reactions, polarized electroluminescence in chiral hybrid perovskites, DNA-based biosensors, and enantioselective detection, highlight its broad conceptual relevance and potential applications in spintronics, molecular sensors, and quantum information processing. In this review, we help bridge the gap between theory, experiment, and implementation, with a particular focus on prospects for quantum sensing and metrology. We outline fundamental frameworks of CISS, clarifying what constitutes the `chiral', the `induced', and the `spin-selectivity' that makes up CISS, before going on to survey key model realizations and their assumptions. We examine some of the emerging quantum sensing applications and assess the model-specific implications, in particular exemplifying these in the context of spin-correlated radical pairs, which offer a promising, tunable, and biomimetic platform for emerging molecular quantum technologies.
12. Ultra-Large-Scale Compilation and Manipulation of Quantum Circuits with Pandora
Authors: Ioana Moflic, Alexandru Paler β’
Published: 2025-08-07 β’
Source: arXiv
There is an enormous gap between what quantum circuit sizes can be compiled and manipulated with the current generation of quantum software and the sizes required by practical applications such as quantum chemistry or Shor's algorithm. We present Pandora, an efficient, open-source, multithreaded, high-performance-computing-enabled tool based on circuit rewrites. Pandora can be used for quantum circuit equivalence checking, full compilations of large circuits, and scalable, streaming quantum resource estimation frameworks. Pandora can easily handle billions of gates and can stream circuit partitions in resource estimation pipelines at very high rates. We utilized Pandora for full compilations of Fermi-Hubbard 100x100 and 1024-bit Shor's algorithm circuits. Compared to TKET and Qiskit, we determine a performance advantage for manipulating circuits of more than 10000 gates. For equivalence checking tasks, Pandora outperforms MQT.QCEC on specific circuits that have more than 32 qubits. The performance and versatility of Pandora open novel paths in quantum software.
13. Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
Authors: Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, Hao Li β’
Published: 2025-08-07 β’
Source: arXiv
Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states. However, empowering a unified model to achieve that is non-trivial, given the high computational cost and the burden of training. To address this, Uni-CoT introduces a novel two-level reasoning paradigm: A Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask execution. This design significantly reduces the computational overhead. Furthermore, we introduce a structured training paradigm that combines interleaved image-text supervision for macro-level CoT with multi-task objectives for micro-level CoT. Together, these innovations allow Uni-CoT to perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our design, all experiments can be efficiently completed using only 8 A100 GPUs with 80GB VRAM each. Experimental results on reasoning-driven image generation benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT demonstrates SOTA performance and strong generalization, establishing Uni-CoT as a promising solution for multi-modal reasoning. Project Page and Code: https://sais-fuxi.github.io/projects/uni-cot/
14. Data Analysis and Modeling for Transitioning Between Laboratory Methods for Detecting SARS-CoV-2 in Wastewater
Authors: Maria M. Warns, Leah Mrowiec, Christopher Owen, Adam Horton, Chi-Yu Lin, Modou Lamin Jarju, Niall M. Mangan, Aaron Packman, Katelyn Plaisier Leisman, Abhilasha Shrestha, Rachel Poretsky β’
Published: 2025-08-07 β’
Source: arXiv
Wastewater surveillance has proven to be a useful tool to monitor pathogens such as SARS-CoV-2 as it is a nonintrusive way to survey the potential disease burden of the population contributing to a sewershed. With the expansion of this field since the beginning of the COVID-19 pandemic, laboratory methods to process wastewater and quantify pathogen nucleic acid levels have improved as technologies changed, efforts expanded in size and scope, and supply chain issues were resolved. Maintaining data continuity is crucial for labs undergoing method transitions to accurately assess infectious disease levels over time and compare measured RNA concentrations to public health data. Despite the dynamic nature of laboratory methods and the necessity to ensure uninterrupted data, to our knowledge there has not been a study that unites two datasets from different lab methods for pathogen quantification from environmental samples. Here, we describe a lab transition from SARS-CoV-2 RNA quantification using a low-throughput, manual filtration-based wastewater concentration and RNA extraction followed by qPCR to a high-throughput, automated magnetic bead-based concentration and extraction followed by dPCR. During the two-month transition period, wastewater samples from across the Chicago metropolitan area were processed with both methods in parallel. We evaluated a variety of regression models to relate the RNA measurements from both methods and found a log-log model was most appropriate after removing outliers and discrepancy points to improve model performance. We also evaluated the consequences of assigning values to samples that were below the detection limit. Our study demonstrates that data continuity can be maintained throughout a transition of laboratory methods if there is a sufficient period of overlap between the methods for an appropriate model to be constructed to relate the datasets.
15. Modular Architecture for High-Performance and Low Overhead Data Transfers
Authors: Rasman Mubtasim Swargo, Engin Arslan, Md Arifuzzaman β’
Published: 2025-08-07 β’
Source: arXiv
High-performance applications necessitate rapid and dependable transfer of massive datasets across geographically dispersed locations. Traditional file transfer tools often suffer from resource underutilization and instability because of fixed configurations or monolithic optimization methods. We propose AutoMDT, a novel modular data transfer architecture that employs a deep reinforcement learning based agent to simultaneously optimize concurrency levels for read, network, and write operations. Our solution incorporates a lightweight network-system simulator, enabling offline training of a Proximal Policy Optimization (PPO) agent in approximately 45 minutes on average, thereby overcoming the impracticality of lengthy online training in production networks. AutoMDT's modular design decouples I/O and network tasks, allowing the agent to capture complex buffer dynamics precisely and to adapt quickly to changing system and network conditions. Evaluations on production-grade testbeds show that AutoMDT achieves up to 8x faster convergence and a 68% reduction in transfer completion times compared with state-of-the-art solutions.
16. Model-based framework for automated quantification of error sources in quantum state tomography
Authors: Junpei Oba, Hsin-Pin Lo, Yasuhiro Yamada, Takayuki Matsui, Takuya Ikuta, Yuya Yonezu, Toshimori Honjo, Seiji Kajita, Hiroki Takesue β’
Published: 2025-08-07 β’
Source: arXiv
High-quality quantum state generation is essential for advanced quantum information processing, including quantum communication, quantum sensing, and quantum computing. In practice, various error sources degrade the quality of quantum states, and quantum state tomography (QST) is a standard diagnostic tool. However, in QST, multiple error sources gather in a single density matrix, making it difficult to identify individual error sources. To address this problem, we propose an automated method for quantifying error sources by combining simulation and parameter optimization to reproduce the experimental density matrix. We focus on the experimental generation of time-bin entangled photon pairs, for which we model the relevant error sources and simulate the density matrix with adjustable model parameters, thereby optimizing the parameters and minimizing the trace distance to the experimental data. Optimization of the parameters reduced the trace distance from 0.177 to 0.024, indicating that our modeled error sources explain 86% of the errors. Reducing the predicted error sources improves the state quality, consistent with our predictions and thus validating the proposed method. In addition, the modular structure of this framework makes it applicable to other quantum platforms, such as superconducting qubits, atoms, and solid-state spins.
17. Streamlining Admission with LOR Insights: AI-Based Leadership Assessment in Online Master's Program
Authors: Meryem Yilmaz Soylu, Adrian Gallard, Jeonghyun Lee, Gayane Grigoryan, Rushil Desai, Stephen Harmon β’
Published: 2025-08-07 β’
Source: arXiv
Letters of recommendation (LORs) provide valuable insights into candidates' capabilities and experiences beyond standardized test scores. However, reviewing these text-heavy materials is time-consuming and labor-intensive. To address this challenge and support the admission committee in providing feedback for students' professional growth, our study introduces LORI: LOR Insights, a novel AI-based detection tool for assessing leadership skills in LORs submitted by online master's program applicants. By employing natural language processing and leveraging large language models using RoBERTa and LLAMA, we seek to identify leadership attributes such as teamwork, communication, and innovation. Our latest RoBERTa model achieves a weighted F1 score of 91.6%, a precision of 92.4%, and a recall of 91.6%, showing a strong level of consistency in our test data. With the growing importance of leadership skills in the STEM sector, integrating LORI into the graduate admissions process is crucial for accurately assessing applicants' leadership capabilities. This approach not only streamlines the admissions process but also automates and ensures a more comprehensive evaluation of candidates' capabilities.
18. RankArena: A Unified Platform for Evaluating Retrieval, Reranking and RAG with Human and LLM Feedback
Authors: Abdelrahman Abdallah, Mahmoud Abdalla, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, Adam Jatowt β’
Published: 2025-08-07 β’
Source: arXiv
Evaluating the quality of retrieval-augmented generation (RAG) and document reranking systems remains challenging due to the lack of scalable, user-centric, and multi-perspective evaluation tools. We introduce RankArena, a unified platform for comparing and analysing the performance of retrieval pipelines, rerankers, and RAG systems using structured human and LLM-based feedback as well as for collecting such feedback. RankArena supports multiple evaluation modes: direct reranking visualisation, blind pairwise comparisons with human or LLM voting, supervised manual document annotation, and end-to-end RAG answer quality assessment. It captures fine-grained relevance feedback through both pairwise preferences and full-list annotations, along with auxiliary metadata such as movement metrics, annotation time, and quality ratings. The platform also integrates LLM-as-a-judge evaluation, enabling comparison between model-generated rankings and human ground truth annotations. All interactions are stored as structured evaluation datasets that can be used to train rerankers, reward models, judgment agents, or retrieval strategy selectors. Our platform is publicly available at https://rankarena.ngrok.io/, and the Demo video is provided https://youtu.be/jIYAP4PaSSI.
19. LAG: Logic-Augmented Generation from a Cartesian Perspective
Authors: Yilin Xiao, Chuang Zhou, Qinggang Zhang, Su Dong, Shengyuan Chen, Xiao Huang β’
Published: 2025-08-07 β’
Source: arXiv
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet exhibit critical limitations in knowledge-intensive tasks, often generating hallucinations when faced with questions requiring specialized expertise. While retrieval-augmented generation (RAG) mitigates this by integrating external knowledge, it struggles with complex reasoning scenarios due to its reliance on direct semantic retrieval and lack of structured logical organization. Inspired by Cartesian principles from \textit{Discours de la m\'ethode}, this paper introduces Logic-Augmented Generation (LAG), a novel paradigm that reframes knowledge augmentation through systematic question decomposition and dependency-aware reasoning. Specifically, LAG first decomposes complex questions into atomic sub-questions ordered by logical dependencies. It then resolves these sequentially, using prior answers to guide context retrieval for subsequent sub-questions, ensuring stepwise grounding in logical chain. To prevent error propagation, LAG incorporates a logical termination mechanism that halts inference upon encountering unanswerable sub-questions and reduces wasted computation on excessive reasoning. Finally, it synthesizes all sub-resolutions to generate verified responses. Experiments on four benchmark datasets demonstrate that LAG significantly enhances reasoning robustness, reduces hallucination, and aligns LLM problem-solving with human cognition, offering a principled alternative to existing RAG systems.
20. GRAIL:Learning to Interact with Large Knowledge Graphs for Retrieval Augmented Reasoning
Authors: Ge Chang, Jinbo Su, Jiacheng Liu, Pengfei Yang, Yuhao Shang, Huiwen Zheng, Hongli Ma, Yan Liang, Yuanchun Li, Yunxin Liu β’
Published: 2025-08-07 β’
Source: arXiv
Large Language Models (LLMs) integrated with Retrieval-Augmented Generation (RAG) techniques have exhibited remarkable performance across a wide range of domains. However, existing RAG approaches primarily operate on unstructured data and demonstrate limited capability in handling structured knowledge such as knowledge graphs. Meanwhile, current graph retrieval methods fundamentally struggle to capture holistic graph structures while simultaneously facing precision control challenges that manifest as either critical information gaps or excessive redundant connections, collectively undermining reasoning performance. To address this challenge, we propose GRAIL: Graph-Retrieval Augmented Interactive Learning, a framework designed to interact with large-scale graphs for retrieval-augmented reasoning. Specifically, GRAIL integrates LLM-guided random exploration with path filtering to establish a data synthesis pipeline, where a fine-grained reasoning trajectory is automatically generated for each task. Based on the synthesized data, we then employ a two-stage training process to learn a policy that dynamically decides the optimal actions at each reasoning step. The overall objective of precision-conciseness balance in graph retrieval is decoupled into fine-grained process-supervised rewards to enhance data efficiency and training stability. In practical deployment, GRAIL adopts an interactive retrieval paradigm, enabling the model to autonomously explore graph paths while dynamically balancing retrieval breadth and precision. Extensive experiments have shown that GRAIL achieves an average accuracy improvement of 21.01% and F1 improvement of 22.43% on three knowledge graph question-answering datasets. Our source code and datasets is available at https://github.com/Changgeww/GRAIL.
21. InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities
Authors: Shuo Cai, Su Lu, Qi Zhou, Kejing Yang, Zhijie Sang, Congkai Xie, Hongxia Yang β’
Published: 2025-08-07 β’
Source: arXiv
Large language models (LLMs) have exhibited impressive reasoning abilities on a wide range of complex tasks. However, enhancing these capabilities through post-training remains resource intensive, particularly in terms of data and computational cost. Although recent efforts have sought to improve sample efficiency through selective data curation, existing methods often rely on heuristic or task-specific strategies that hinder scalability. In this work, we introduce InfiAlign, a scalable and sample-efficient post-training framework that integrates supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to align LLMs for enhanced reasoning. At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources. When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks. Additional improvements are obtained through the application of DPO, with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks. Our results highlight the effectiveness of combining principled data selection with full-stage post-training, offering a practical solution for aligning large reasoning models in a scalable and data-efficient manner. The model checkpoints are available at https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT.
22. Let's Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes
Authors: Zachary Robertson, Sanmi Koyejo β’
Published: 2025-08-07 β’
Source: arXiv
We develop mechanisms for evaluating AI systems without ground truth by exploiting a connection between gaming resistance and output quality. The data processing inequality ensures post-hoc attempts to game a metric degrades both information content and task performance. We prove that f-mutual information measures are the unique gaming resistant mechanisms under natural conditions, with the overseer acting as an agent. While Shannon mutual information faces exponential sample complexity, bounded measures like total variation distance remain tractable. Empirically, across ten domains from translation to peer review, all information-theoretic mechanisms achieve perfect discrimination (d > 0.5) between faithful and strategic agents. In contrast, LLM judges exhibit systematic evaluation inversion, preferring fabricated content over accurate summaries. Our mechanisms show 10-100x better robustness to adversarial manipulation than current practices. We also find performance follows an inverted-U curve with compression ratio, peaking at 10:1 where agent responses exhibit optimal information diversity (3 effective dimensions), giving a bias-variance perspective on when our approach is expected to be most effective.
23. Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?
Authors: Matteo Prandi, Vincenzo Suriani, Federico Pierucci, Marcello Galisai, Daniele Nardi, Piercosma Bisconti β’
Published: 2025-08-07 β’
Source: arXiv
The rapid advancement of General Purpose AI (GPAI) models necessitates robust evaluation frameworks, especially with emerging regulations like the EU AI Act and its associated Code of Practice (CoP). Current AI evaluation practices depend heavily on established benchmarks, but these tools were not designed to measure the systemic risks that are the focus of the new regulatory landscape. This research addresses the urgent need to quantify this "benchmark-regulation gap." We introduce Bench-2-CoP, a novel, systematic framework that uses validated LLM-as-judge analysis to map the coverage of 194,955 questions from widely-used benchmarks against the EU AI Act's taxonomy of model capabilities and propensities. Our findings reveal a profound misalignment: the evaluation ecosystem is overwhelmingly focused on a narrow set of behavioral propensities, such as "Tendency to hallucinate" (53.7% of the corpus) and "Discriminatory bias" (28.9%), while critical functional capabilities are dangerously neglected. Crucially, capabilities central to loss-of-control scenarios, including evading human oversight, self-replication, and autonomous AI development, receive zero coverage in the entire benchmark corpus. This translates to a near-total evaluation gap for systemic risks like "Loss of Control" (0.4% coverage) and "Cyber Offence" (0.8% coverage). This study provides the first comprehensive, quantitative analysis of this gap, offering critical insights for policymakers to refine the CoP and for developers to build the next generation of evaluation tools, ultimately fostering safer and more compliant AI.
24. LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Authors: Ming Zhang, Yujiong Shen, Jingyi Deng, Yuhui Wang, Yue Zhang, Junzhe Wang, Shichun Liu, Shihan Dou, Huayu Sha, Qiyuan Peng, Changhao Jiang, Jingqi Tong, Yilong Wu, Zhihao Zhang, Mingqi Wu, Zhiheng Xi, Mingxu Chai, Tao Liang, Zhihui Fei, Zhen Wang, Mingyang Wan, Guojun Ma, Tao Gui, Qi Zhang, Xuanjing Huang β’
Published: 2025-08-07 β’
Source: arXiv
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-3, a framework for dynamic evaluation of LLMs. LLMEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 20-month longitudinal study of nearly 50 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-3 offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.
25. Whose Truth? Pluralistic Geo-Alignment for (Agentic) AI
Authors: Krzysztof Janowicz, Zilong Liu, Gengchen Mai, Zhangyu Wang, Ivan Majic, Alexandra Fortacz, Grant McKenzie, Song Gao β’
Published: 2025-08-07 β’
Source: arXiv
AI (super) alignment describes the challenge of ensuring (future) AI systems behave in accordance with societal norms and goals. While a quickly evolving literature is addressing biases and inequalities, the geographic variability of alignment remains underexplored. Simply put, what is considered appropriate, truthful, or legal can differ widely across regions due to cultural norms, political realities, and legislation. Alignment measures applied to AI/ML workflows can sometimes produce outcomes that diverge from statistical realities, such as text-to-image models depicting balanced gender ratios in company leadership despite existing imbalances. Crucially, some model outputs are globally acceptable, while others, e.g., questions about Kashmir, depend on knowing the user's location and their context. This geographic sensitivity is not new. For instance, Google Maps renders Kashmir's borders differently based on user location. What is new is the unprecedented scale and automation with which AI now mediates knowledge, expresses opinions, and represents geographic reality to millions of users worldwide, often with little transparency about how context is managed. As we approach Agentic AI, the need for spatio-temporally aware alignment, rather than one-size-fits-all approaches, is increasingly urgent. This paper reviews key geographic research problems, suggests topics for future work, and outlines methods for assessing alignment sensitivity.
26. LLM-based Multi-Agent Copilot for Quantum Sensor
Authors: Rong Sha, Binglin Wang, Jun Yang, Xiaoxiao Ma, Chengkun Wu, Liang Yan, Chao Zhou, Jixun Liu, Guochao Wang, Shuhua Yan, Lingxiao Zhu β’
Published: 2025-08-07 β’
Source: arXiv
Large language models (LLM) exhibit broad utility but face limitations in quantum sensor development, stemming from interdisciplinary knowledge barriers and involving complex optimization processes. Here we present QCopilot, an LLM-based multi-agent framework integrating external knowledge access, active learning, and uncertainty quantification for quantum sensor design and diagnosis. Comprising commercial LLMs with few-shot prompt engineering and vector knowledge base, QCopilot employs specialized agents to adaptively select optimization methods, automate modeling analysis, and independently perform problem diagnosis. Applying QCopilot to atom cooling experiments, we generated 10${}^{\rm{8}}$ sub-$\rm{\mu}$K atoms without any human intervention within a few hours, representing $\sim$100$\times$ speedup over manual experimentation. Notably, by continuously accumulating prior knowledge and enabling dynamic modeling, QCopilot can autonomously identify anomalous parameters in multi-parameter experimental settings. Our work reduces barriers to large-scale quantum sensor deployment and readily extends to other quantum information systems.
27. DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning
Authors: Xinrun Xu, Pi Bu, Ye Wang, BΓΆrje F. Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Zhiming Ding, Bo Zheng β’
Published: 2025-08-07 β’
Source: arXiv
Although Vision Language Models (VLMs) exhibit strong perceptual abilities and impressive visual reasoning, they struggle with attention to detail and precise action planning in complex, dynamic environments, leading to subpar performance. Real-world tasks typically require complex interactions, advanced spatial reasoning, long-term planning, and continuous strategy refinement, usually necessitating understanding the physics rules of the target scenario. However, evaluating these capabilities in real-world scenarios is often prohibitively expensive. To bridge this gap, we introduce DeepPHY, a novel benchmark framework designed to systematically evaluate VLMs' understanding and reasoning about fundamental physical principles through a series of challenging simulated environments. DeepPHY integrates multiple physical reasoning environments of varying difficulty levels and incorporates fine-grained evaluation metrics. Our evaluation finds that even state-of-the-art VLMs struggle to translate descriptive physical knowledge into precise, predictive control.
28. Artificial Intelligence-Based Classification of Spitz Tumors
Authors: Ruben T. Lucassen, Marjanna Romers, Chiel F. Ebbelaar, Aia N. Najem, Donal P. Hayes, Antien L. Mooyaart, Sara Roshani, Liliane C. D. Wynaendts, Nikolas Stathonikos, Gerben E. Breimer, Anne M. L. Jansen, Mitko Veta, Willeke A. M. Blokx β’
Published: 2025-08-07 β’
Source: arXiv
Spitz tumors are diagnostically challenging due to overlap in atypical histological features with conventional melanomas. We investigated to what extent AI models, using histological and/or clinical features, can: (1) distinguish Spitz tumors from conventional melanomas; (2) predict the underlying genetic aberration of Spitz tumors; and (3) predict the diagnostic category of Spitz tumors. The AI models were developed and validated using a dataset of 393 Spitz tumors and 379 conventional melanomas. Predictive performance was measured using the AUROC and the accuracy. The performance of the AI models was compared with that of four experienced pathologists in a reader study. Moreover, a simulation experiment was conducted to investigate the impact of implementing AI-based recommendations for ancillary diagnostic testing on the workflow of the pathology department. The best AI model based on UNI features reached an AUROC of 0.95 and an accuracy of 0.86 in differentiating Spitz tumors from conventional melanomas. The genetic aberration was predicted with an accuracy of 0.55 compared to 0.25 for randomly guessing. The diagnostic category was predicted with an accuracy of 0.51, where random chance-level accuracy equaled 0.33. On all three tasks, the AI models performed better than the four pathologists, although differences were not statistically significant for most individual comparisons. Based on the simulation experiment, implementing AI-based recommendations for ancillary diagnostic testing could reduce material costs, turnaround times, and examinations. In conclusion, the AI models achieved a strong predictive performance in distinguishing between Spitz tumors and conventional melanomas. On the more challenging tasks of predicting the genetic aberration and the diagnostic category of Spitz tumors, the AI models performed better than random chance.
29. Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025
Authors: Samy Ateia, Udo Kruschwitz β’
Published: 2025-08-07 β’
Source: arXiv
Agentic Retrieval Augmented Generation (RAG) and 'deep research' systems aim to enable autonomous search processes where Large Language Models (LLMs) iteratively refine outputs. However, applying these systems to domain-specific professional search, such as biomedical research, presents challenges, as automated systems may reduce user involvement and misalign with expert information needs. Professional search tasks often demand high levels of user expertise and transparency. The BioASQ CLEF 2025 challenge, using expert-formulated questions, can serve as a platform to study these issues. We explored the performance of current reasoning and nonreasoning LLMs like Gemini-Flash 2.0, o3-mini, o4-mini and DeepSeek-R1. A key aspect of our methodology was a self-feedback mechanism where LLMs generated, evaluated, and then refined their outputs for query expansion and for multiple answer types (yes/no, factoid, list, ideal). We investigated whether this iterative self-correction improves performance and if reasoning models are more capable of generating useful feedback. Preliminary results indicate varied performance for the self-feedback strategy across models and tasks. This work offers insights into LLM self-correction and informs future work on comparing the effectiveness of LLM-generated feedback with direct human expert input in these search systems.
30. Building Effective Safety Guardrails in AI Education Tools
Authors: Hannah-Beth Clark, Laura Benton, Emma Searle, Margaux Dowland, Matthew Gregory, Will Gayne, John Roberts β’
Published: 2025-08-07 β’
Source: arXiv
There has been rapid development in generative AI tools across the education sector, which in turn is leading to increased adoption by teachers. However, this raises concerns regarding the safety and age-appropriateness of the AI-generated content that is being created for use in classrooms. This paper explores Oak National Academy's approach to addressing these concerns within the development of the UK Government's first publicly available generative AI tool - our AI-powered lesson planning assistant (Aila). Aila is intended to support teachers planning national curriculum-aligned lessons that are appropriate for pupils aged 5-16 years. To mitigate safety risks associated with AI-generated content we have implemented four key safety guardrails - (1) prompt engineering to ensure AI outputs are generated within pedagogically sound and curriculum-aligned parameters, (2) input threat detection to mitigate attacks, (3) an Independent Asynchronous Content Moderation Agent (IACMA) to assess outputs against predefined safety categories, and (4) taking a human-in-the-loop approach, to encourage teachers to review generated content before it is used in the classroom. Through our on-going evaluation of these safety guardrails we have identified several challenges and opportunities to take into account when implementing and testing safety guardrails. This paper highlights ways to build more effective safety guardrails in generative AI education tools including the on-going iteration and refinement of guardrails, as well as enabling cross-sector collaboration through sharing both open-source code, datasets and learnings.
31. Implementation and Application of Multi-Format 3D Data Integration in a Cross-Device Commercial Metaverse Platform
Authors: Masanori Ibara, Yuichi Hiroi, Takushi Kamegai, Takefumi Hiraki β’
Published: 2025-08-07 β’
Source: arXiv
Traditionally, specialized 3D design data, such as BIM and CAD, have been accessible only to a select group of experts, creating significant barriers that prevent general users from participating in decision-making processes. This paper provides a systematic overview of practical insights for utilizing 3D data in industrial and architectural domains by presenting implementation cases of the industrial metaverse on Cluster, a commercial cross-device metaverse platform. This paper analyzes the characteristics and constraints of major data formats in the industrial and architectural fields and organizes integration workflows for the metaverse. Through application cases utilizing 3D data across multiple domains, we present practical examples of collaborative decision-making support enabled by the fusion of metaverse and digital twin technologies. Specifically, we demonstrate that multi-device access and simultaneous multi-user participation capabilities foster democratic environments in the industrial metaverse, which are challenging to achieve with conventional, expert-dependent systems.
32. Everything You Need to Know About CS Education: Open Results from a Survey of More Than 18,000 Participants
Authors: Katsiaryna Dzialets, Aleksandra Makeeva, Ilya Vlasov, Anna Potriasaeva, Aleksei Rostovskii, Yaroslav Golubev, Anastasiia Birillo β’
Published: 2025-08-07 β’
Source: arXiv
Computer science education is a dynamic field with many aspects that influence the learner's path. While these aspects are usually studied in depth separately, it is also important to carry out broader large-scale studies that touch on many topics, because they allow us to put different results into each other's perspective. Past large-scale surveys have provided valuable insights, however, the emergence of new trends (e.g., AI), new learning formats (e.g., in-IDE learning), and the increasing learner diversity highlight the need for an updated comprehensive study. To address this, we conducted a survey with 18,032 learners from 173 countries, ensuring diverse representation and exploring a wide range of topics - formal education, learning formats, AI usage, challenges, motivation, and more. This paper introduces the results of this survey as an open dataset, describes our methodology and the survey questions, and highlights, as a motivating example, three possible research directions within this data: challenges in learning, emerging formats, and insights into the in-IDE format. The dataset aims to support further research and foster advancements in computer education.
33. Understanding and Mitigating Errors of LLM-Generated RTL Code
Authors: Jiazheng Zhang, Cheng Liu, Huawei Li β’
Published: 2025-08-07 β’
Source: arXiv
Despite the promising potential of large language model (LLM) based register-transfer-level (RTL) code generation, the overall success rate remains unsatisfactory. Errors arise from various factors, with limited understanding of specific failure causes hindering improvement. To address this, we conduct a comprehensive error analysis and manual categorization. Our findings reveal that most errors stem not from LLM reasoning limitations, but from insufficient RTL programming knowledge, poor understanding of circuit concepts, ambiguous design descriptions, or misinterpretation of complex multimodal inputs. Leveraging in-context learning, we propose targeted error correction techniques. Specifically, we construct a domain-specific knowledge base and employ retrieval-augmented generation (RAG) to supply necessary RTL knowledge. To mitigate ambiguity errors, we introduce design description rules and implement a rule-checking mechanism. For multimodal misinterpretation, we integrate external tools to convert inputs into LLM-compatible meta-formats. For remaining errors, we adopt an iterative debugging loop (simulation-error localization-correction). Integrating these techniques into an LLM-based framework significantly improves performance. We incorporate these error correction techniques into a foundational LLM-based RTL code generation framework, resulting in significantly improved performance. Experimental results show that our enhanced framework achieves 91.0\% accuracy on the VerilogEval benchmark, surpassing the baseline code generation approach by 32.7\%, demonstrating the effectiveness of our methods.
34. Congestion Mitigation Path Planning for Large-Scale Multi-Agent Navigation in Dense Environments
Authors: Takuro Kato, Keisuke Okumura, Yoko Sasaki, Naoya Yokomachi β’
Published: 2025-08-07 β’
Source: arXiv
In high-density environments where numerous autonomous agents move simultaneously in a distributed manner, streamlining global flows to mitigate local congestion is crucial to maintain overall navigation efficiency. This paper introduces a novel path-planning problem, congestion mitigation path planning (CMPP), which embeds congestion directly into the cost function, defined by the usage of incoming edges along agents' paths. CMPP assigns a flow-based multiplicative penalty to each vertex of a sparse graph, which grows steeply where frequently-traversed paths intersect, capturing the intuition that congestion intensifies where many agents enter the same area from different directions. Minimizing the total cost yields a set of coarse-level, time-independent routes that autonomous agents can follow while applying their own local collision avoidance. We formulate the problem and develop two solvers: (i) an exact mixed-integer nonlinear programming solver for small instances, and (ii) a scalable two-layer search algorithm, A-CMTS, which quickly finds suboptimal solutions for large-scale instances and iteratively refines them toward the optimum. Empirical studies show that augmenting state-of-the-art collision-avoidance planners with CMPP significantly reduces local congestion and enhances system throughput in both discrete- and continuous-space scenarios. These results indicate that CMPP improves the performance of multi-agent systems in real-world applications such as logistics and autonomous-vehicle operations.
35. CodeBoost: Boosting Code LLMs by Squeezing Knowledge from Code Snippets with RL
Authors: Sijie Wang, Quanjiang Guo, Kai Zhao, Yawei Zhang, Xin Li, Xiang Li, Siqi Li, Rui She, Shangshu Yu, Wee Peng Tay β’
Published: 2025-08-07 β’
Source: arXiv
Code large language models (LLMs) have become indispensable tools for building efficient and automated coding pipelines. Existing models are typically post-trained using reinforcement learning (RL) from general-purpose LLMs using "human instruction-final answer" pairs, where the instructions are usually from manual annotations. However, collecting high-quality coding instructions is both labor-intensive and difficult to scale. On the other hand, code snippets are abundantly available from various sources. This imbalance presents a major bottleneck in instruction-based post-training. We propose CodeBoost, a post-training framework that enhances code LLMs purely from code snippets, without relying on human-annotated instructions. CodeBoost introduces the following key components: (1) maximum-clique curation, which selects a representative and diverse training corpus from code; (2) bi-directional prediction, which enables the model to learn from both forward and backward prediction objectives; (3) error-aware prediction, which incorporates learning signals from both correct and incorrect outputs; (4) heterogeneous augmentation, which diversifies the training distribution to enrich code semantics; and (5) heterogeneous rewarding, which guides model learning through multiple reward types including format correctness and execution feedback from both successes and failures. Extensive experiments across several code LLMs and benchmarks verify that CodeBoost consistently improves performance, demonstrating its effectiveness as a scalable and effective training pipeline.