πŸ€– AI Research Papers

August 07, 2025

πŸ€– AI-Generated Research Summary

Comprehensive Summary of 35 Recent Papers on AI, LLMs, Agents, and Workflows


1. Key Research Trends

a. Multi-Agent and Agentic AI Systems - Increasing focus on multi-agent collaboration (Papers 2, 12, 29) for complex tasks, including document understanding, test-time scaling, and molecular optimization. - Emergence of agentic AI maturity models (Paper 28) and frameworks for agent placement/migration in edge environments (Paper 6).

b. Large Language Models (LLMs) and Their Applications - LLMs are being leveraged for workflow automation (Paper 10), algorithmic discovery (Paper 19), fairness and bias mitigation (Paper 21), and self-improvement (Paper 34). - LLM distillation and cross-encoder techniques (Paper 4) are used for domain-specific tasks like ad keyphrase recommendation.

c. AI for Scientific Discovery and Optimization - Application of AI and LLMs in scientific domains: orbital transfers (Paper 3), gravitational-wave detection (Paper 19), protein modeling (Paper 15), and satellite image enhancement (Paper 31). - Automated algorithmic discovery and physics-aware generative models are gaining traction.

d. Fairness, Explainability, and Societal Impact - Growing attention to fairness in NLP (Paper 21), private counterfactual explanations (Paper 11), and frameworks for societal impact assessment (Paper 35).

e. Multimodal and Multilingual AI - Advances in multimodal AI for video generation (Paper 30), visual document understanding (Paper 2), and educational tools (Paper 32). - Efforts to build multilingual resources for underrepresented languages (Paper 17).

f. Workflow and Process Automation - LLMs and agents are being used to automate and standardize workflows in cybersecurity (Paper 10), software engineering (Paper 16), and robotics (Paper 26).


2. Breakthrough Findings


3. Methodological Approaches


4. Applications and Use Cases


5. Future Directions


Conclusion

This collection of papers highlights a vibrant and rapidly evolving AI landscape, with significant advances in multi-agent systems, LLM applications, scientific discovery, fairness, and workflow automation. The field is moving towards more collaborative, adaptive, and explainable AI systems that are both resource-efficient and societally responsible. Researchers and practitioners should focus on scalable agentic architectures, self-improving models, and integrated, domain-aware solutions to address the next generation of AI challenges.

πŸ“š arXiv (35 papers)
1. LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation
Authors: Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, Ziwei Liu β€’ Published: 2025-08-05 β€’ Source: arXiv
Controllable ultra-long video generation is a fundamental yet challenging task. Although existing methods are effective for short clips, they struggle to scale due to issues such as temporal inconsistency and visual degradation. In this paper, we initially investigate and identify three key factors: separate noise initialization, independent control signal normalization, and the limitations of single-modality guidance. To address these issues, we propose LongVie, an end-to-end autoregressive framework for controllable long video generation. LongVie introduces two core designs to ensure temporal consistency: 1) a unified noise initialization strategy that maintains consistent generation across clips, and 2) global control signal normalization that enforces alignment in the control space throughout the entire video. To mitigate visual degradation, LongVie employs 3) a multi-modal control framework that integrates both dense (e.g., depth maps) and sparse (e.g., keypoints) control signals, complemented by 4) a degradation-aware training strategy that adaptively balances modality contributions over time to preserve visual quality. We also introduce LongVGenBench, a comprehensive benchmark consisting of 100 high-resolution videos spanning diverse real-world and synthetic environments, each lasting over one minute. Extensive experiments show that LongVie achieves state-of-the-art performance in long-range controllability, consistency, and quality.
2. Prospects of a New $L_5$ Trojan Flyby Target for the Lucy Mission
Authors: Luis E. Salazar Manzano, David W. Gerdes, Kevin J. Napier, Hsing Wen Lin, Fred C. Adams, Tessa Frincke, Simone Marchi, Keith S. Noll, John Spencer β€’ Published: 2025-08-05 β€’ Source: arXiv
NASA's Lucy spacecraft is en route to conduct the first close encounter with Jupiter's Trojans. While most scheduled flybys lie in the $L_4$ cloud, the only $L_5$ target is the Patroclus-Menoetius binary. Since each flyby offers unique insights into target and population properties unattainable from Earth, we examine the feasibility of including an additional, yet unknown, $L_5$ target while minimizing the impact on Lucy's primary mission. We use the background $L_5$ Trojans brighter than the completeness limit to model their absolute magnitude, spatial, and orbital distributions. A semi-analytical approach estimates the number of Trojans accessible to Lucy for a given $\Delta v$ budget in both pre- and post-Patroclus scenarios. Our results indicate that, while it is unlikely that any suitable Trojan lies on Lucy's nominal path, a moderate $\Delta v$ investment ($35-50\,\mathrm{m/s}$) could enable a sub-kilometer ($500-700\,\mathrm{m}$) flyby prior to the Patroclus encounter. Post-Patroclus, the likelihood of a similar flyby is $\sim60\%$ for $\Delta v\sim$ 50 m/s. Simulations with synthetic Trojans reveal that potential targets cluster near the node opposite to the encounter window, producing an optimal search period in late 2026 for both scenarios. Surveying the densest $10\%$ of this region would require under 5 nights with Subaru/HSC or under 2 nights with Rubin, using shift-and-stack techniques. A successful sub-kilometric flyby would expand Lucy's Trojan target size range and provide new constraints on collisional evolution and the long-standing asymmetry in the $L_4/L_5$ clouds. This nodal-clustering strategy could guide target searches in future Lucy extensions or other planetary flyby missions.
3. Self-Questioning Language Models
Authors: Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, Deepak Pathak β€’ Published: 2025-08-05 β€’ Source: arXiv
Can large language models improve without external data -- by generating their own questions and answers? We hypothesize that a pre-trained language model can improve its reasoning skills given only a single prompt specifying the topic (e.g., algebra word problems) and asking the model to generate its own questions. To do this, we propose Self-Questioning Language Models (SQLM): an asymmetric self-play framework where a proposer is given the topic and generates a question for a solver, who tries to answer it. Both the proposer and solver are trained via reinforcement learning. The proposer receives a reward if the problem is not too easy or too difficult, and the solver receives a reward based on majority voting, a proxy for correctness in the absence of ground-truth answers. For coding, the proposer can instead generate unit tests which are used for verification. We study this asymmetric self-play framework on three benchmarks: three-digit multiplication, algebra problems from the OMEGA benchmark, and programming problems from Codeforces. By continually generating more interesting problems and attempting to solve them, language models can improve on downstream benchmarks without access to any curated training datasets.
4. What If, But Privately: Private Counterfactual Retrieval
Authors: Shreya Meel, Mohamed Nomeir, Pasan Dissanayake, Sanghamitra Dutta, Sennur Ulukus β€’ Published: 2025-08-05 β€’ Source: arXiv
Transparency and explainability are two important aspects to be considered when employing black-box machine learning models in high-stake applications. Providing counterfactual explanations is one way of catering this requirement. However, this also poses a threat to the privacy of the institution that is providing the explanation, as well as the user who is requesting it. In this work, we are primarily concerned with the user's privacy who wants to retrieve a counterfactual instance, without revealing their feature vector to the institution. Our framework retrieves the exact nearest neighbor counterfactual explanation from a database of accepted points while achieving perfect, information-theoretic, privacy for the user. First, we introduce the problem of private counterfactual retrieval (PCR) and propose a baseline PCR scheme that keeps the user's feature vector information-theoretically private from the institution. Building on this, we propose two other schemes that reduce the amount of information leaked about the institution database to the user, compared to the baseline scheme. Second, we relax the assumption of mutability of all features, and consider the setting of immutable PCR (I-PCR). Here, the user retrieves the nearest counterfactual without altering a private subset of their features, which constitutes the immutable set, while keeping their feature vector and immutable set private from the institution. For this, we propose two schemes that preserve the user's privacy information-theoretically, but ensure varying degrees of database privacy. Third, we extend our PCR and I-PCR schemes to incorporate user's preference on transforming their attributes, so that a more actionable explanation can be received. Finally, we present numerical results to support our theoretical findings, and compare the database leakage of the proposed schemes.
5. Streaming Generated Gaussian Process Experts for Online Learning and Control
Authors: Zewen Yang, Dongfa Zhang, Xiaobing Dai, Fengyi Yu, Chi Zhang, Bingkun Huang, Hamid Sadeghian, Sami Haddadin β€’ Published: 2025-08-05 β€’ Source: arXiv
Gaussian Processes (GPs), as a nonparametric learning method, offer flexible modeling capabilities and calibrated uncertainty quantification for function approximations. Additionally, GPs support online learning by efficiently incorporating new data with polynomial-time computation, making them well-suited for safety-critical dynamical systems that require rapid adaptation. However, the inference and online updates of exact GPs, when processing streaming data, incur cubic computation time and quadratic storage memory complexity, limiting their scalability to large datasets in real-time settings. In this paper, we propose a \underline{s}treaming \underline{k}ernel-induced progressivel\underline{y} generated expert framework of \underline{G}aussian \underline{p}rocesses (SkyGP) that addresses both computational and memory constraints by maintaining a bounded set of experts, while inheriting the learning performance guarantees from exact Gaussian processes. Furthermore, two SkyGP variants are introduced, each tailored to a specific objective, either maximizing prediction accuracy (SkyGP-Dense) or improving computational efficiency (SkyGP-Fast). The effectiveness of SkyGP is validated through extensive benchmarks and real-time control experiments demonstrating its superior performance compared to state-of-the-art approaches.
6. FairLangProc: A Python package for fairness in NLP
Authors: Arturo PΓ©rez-Peralta, Sandra BenΓ­tez-PeΓ±a, Rosa E. Lillo β€’ Published: 2025-08-05 β€’ Source: arXiv
The rise in usage of Large Language Models to near ubiquitousness in recent years has risen societal concern about their applications in decision-making contexts, such as organizational justice or healthcare. This, in turn, poses questions about the fairness of these models in critical settings, which leads to the developement of different procedures to address bias in Natural Language Processing. Although many datasets, metrics and algorithms have been proposed to measure and mitigate harmful prejudice in Natural Language Processing, their implementation is diverse and far from centralized. As a response, this paper presents FairLangProc, a comprehensive Python package providing a common implementation of some of the more recent advances in fairness in Natural Language Processing providing an interface compatible with the famous Hugging Face transformers library, aiming to encourage the widespread use and democratization of bias mitigation techniques. The implementation can be found on https://github.com/arturo-perez-peralta/FairLangProc.
7. Inland-LOAM: Voxel-Based Structural Semantic Mapping for Inland Waterways
Authors: Zhongbi Luo, Yunjia Wang, Jan Swevers, Peter Slaets, Herman Bruyninckx β€’ Published: 2025-08-05 β€’ Source: arXiv
Accurate geospatial information is crucial for safe, autonomous Inland Waterway Transport (IWT), as existing charts (IENC) lack real-time detail and conventional LiDAR SLAM fails in waterway environments. These challenges lead to vertical drift and non-semantic maps, hindering autonomous navigation. This paper introduces Inland-LOAM, a LiDAR SLAM framework for waterways. It uses an improved feature extraction and a water surface planar constraint to mitigate vertical drift. A novel pipeline transforms 3D point clouds into structured 2D semantic maps using voxel-based geometric analysis, enabling real-time computation of navigational parameters like bridge clearances. An automated module extracts shorelines and exports them into a lightweight, IENC-compatible format. Evaluations on a real-world dataset show Inland-LOAM achieves superior localization accuracy over state-of-the-art methods. The generated semantic maps and shorelines align with real-world conditions, providing reliable data for enhanced situational awareness. The code and dataset will be publicly available
8. Graded chain conditions and graded Jacobson radical of groupoid graded modules
Authors: Zaqueu Cristiano, Wellington Marques de Souza, Javier SΓ‘nchez β€’ Published: 2025-08-05 β€’ Source: arXiv
In this work, we continue to lay the groundwork for the theory of groupoid graded rings and modules. The main topics we address include graded chain conditions, the graded Jacobson radical, and the gr-socle for graded modules. We present several descending (ascending) chain conditions for graded modules and we refer to the most general one as $\Gamma_0$-artinian ($\Gamma_0$-noetherian). We show that $\Gamma_0$-artinian (resp. $\Gamma_0$-noetherian) modules share many properties with artinian (noetherian) modules in the classical theory. However, we present an example of a right $\Gamma_0$-artinian ring that is not right $\Gamma_0$-noetherian. Following the pattern of the classical case, we examine the basic properties of the graded Jacobson radical and the gr-socle for groupoid graded modules. We also establish some fundamental properties of the graded Jacobson radical of groupoid graded rings. Finally, we introduce the notion of gr-semilocal ring, which simultaneously generalizes the concepts of semilocal ring and semilocal (small) category.
9. Beyond risk: A proto-framework for assessing the societal impact of AI systems
Authors: Willem Fourie β€’ Published: 2025-08-05 β€’ Source: arXiv
In the discourse on AI regulation, 'responsible AI' is the dominant paradigm, with the focus on mitigating the risks related to AI systems. While this focus is important and necessary, it has limited use for a systematic consideration of AI's societal impact. This paper proposes a proto-framework for assessing the societal impact of AI systems by operationalising the concept of freedom. This proto-framework is intended as a step towards a fully operationalised framework to be used in policymaking contexts. By drawing on Kantian philosophy and related contemporary interpretations, freedom is developed as the counterpart to the concept of responsibility. Two dimensions of freedom are developed in further detail: freedom as capability and freedom as opportunity. These two dimensions of freedom are then applied in a proto-framework that systematically considers AI's impact on society using the Sustainable Development Goals. This proto-framework aims to complement current risk-based approaches and thereby offers a first step towards operationalising the concept of freedom in AI regulation.
10. Automated Algorithmic Discovery for Gravitational-Wave Detection Guided by LLM-Informed Evolutionary Monte Carlo Tree Search
Authors: He Wang, Liang Zeng β€’ Published: 2025-08-05 β€’ Source: arXiv
Computational scientific discovery increasingly relies on algorithms to process complex data and identify meaningful patterns - yet faces persistent challenges in gravitational-wave signal identification. While existing algorithmic approaches like matched filtering (MF) and deep neural networks (DNNs) have achieved partial success, their limitations directly stem from fundamental limitations: MF's excessive computational demands arise from its reliance on predefined theoretical waveform templates, while DNNs' black-box architectures obscure decision logic and introduce hidden biases. We propose Evolutionary Monte Carlo Tree Search (Evo-MCTS), a framework that addresses these limitations through systematic algorithm space exploration guided by domain-aware physical constraints. Our approach combines tree-structured search with evolutionary optimization and large language model heuristics to create interpretable algorithmic solutions. Our Evo-MCTS framework demonstrates substantial improvements, achieving a 20.2\% improvement over state-of-the-art gravitational wave detection algorithms on the MLGWSC-1 benchmark dataset. High-performing algorithm variants consistently exceed thresholds. The framework generates human-interpretable algorithmic pathways that reveal distinct performance patterns. Beyond performance improvements, our framework discovers novel algorithmic combinations, thereby establishing a transferable methodology for automated algorithmic discovery across computational science domains.
11. Efficient Morphology-Aware Policy Transfer to New Embodiments
Authors: Michael Przystupa, Hongyao Tang, Martin Jagersand, Santiago Miret, Mariano Phielipp, Matthew E. Taylor, Glen Berseth β€’ Published: 2025-08-05 β€’ Source: arXiv
Morphology-aware policy learning is a means of enhancing policy sample efficiency by aggregating data from multiple agents. These types of policies have previously been shown to help generalize over dynamic, kinematic, and limb configuration variations between agent morphologies. Unfortunately, these policies still have sub-optimal zero-shot performance compared to end-to-end finetuning on morphologies at deployment. This limitation has ramifications in practical applications such as robotics because further data collection to perform end-to-end finetuning can be computationally expensive. In this work, we investigate combining morphology-aware pretraining with parameter efficient finetuning (PEFT) techniques to help reduce the learnable parameters necessary to specialize a morphology-aware policy to a target embodiment. We compare directly tuning sub-sets of model weights, input learnable adapters, and prefix tuning techniques for online finetuning. Our analysis reveals that PEFT techniques in conjunction with policy pre-training generally help reduce the number of samples to necessary to improve a policy compared to training models end-to-end from scratch. We further find that tuning as few as less than 1% of total parameters will improve policy performance compared the zero-shot performance of the base pretrained a policy.
12. Intent Preserving Generation of Diverse and Idiomatic (Code-)Artifacts
Authors: Oliver Westphal β€’ Published: 2025-08-05 β€’ Source: arXiv
When automatically generating programming exercise tasks one often also needs to automatically generate programs. At the very least when providing sample solutions is part of automated feedback. But programs can also be used as part of the exercise task description to communicate a task's requirements. Writing good program generators that produce varied yet idiomatic code while being easily adaptable for new tasks is challenging. The challenges are intensified if task generation requires additional artifacts, like a more general behavior specification for testing or additional textual descriptions. Manually writing generators for multiple different but strongly related artifacts gets complicated quickly. We present an approach where instead of writing monolithic generators for multiple connected artifacts one specifies a small set of abstract building blocks and for each such building block defines sets of concrete realizations for various kinds of artifacts. Then the intended structure of the resulting artifacts is specified as a composition of the small abstract building blocks. This abstract description then serves as the common source from which related artifacts can be derived automatically. The approach is generic in the kind of artifacts it can produce and is therefore adaptable to a wide range of contexts.
13. Likelihood Matching for Diffusion Models
Authors: Lei Qian, Wu Su, Yanqi Huang, Song Xi Chen β€’ Published: 2025-08-05 β€’ Source: arXiv
We propose a Likelihood Matching approach for training diffusion models by first establishing an equivalence between the likelihood of the target data distribution and a likelihood along the sample path of the reverse diffusion. To efficiently compute the reverse sample likelihood, a quasi-likelihood is considered to approximate each reverse transition density by a Gaussian distribution with matched conditional mean and covariance, respectively. The score and Hessian functions for the diffusion generation are estimated by maximizing the quasi-likelihood, ensuring a consistent matching of both the first two transitional moments between every two time points. A stochastic sampler is introduced to facilitate computation that leverages on both the estimated score and Hessian information. We establish consistency of the quasi-maximum likelihood estimation, and provide non-asymptotic convergence guarantees for the proposed sampler, quantifying the rates of the approximation errors due to the score and Hessian estimation, dimensionality, and the number of diffusion steps. Empirical and simulation evaluations demonstrate the effectiveness of the proposed Likelihood Matching and validate the theoretical results.
14. LLMDistill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase Recommendations at eBay
Authors: Soumik Dey, Benjamin Braun, Naveen Ravipati, Hansi Wu, Binbin Li β€’ Published: 2025-08-05 β€’ Source: arXiv
Sellers at eBay are recommended keyphrases to bid on to enhance the performance of their advertising campaigns. The relevance of these keyphrases is crucial in avoiding the overcrowding of search systems with irrelevant items and maintaining a positive seller perception. It is essential that keyphrase recommendations align with both seller and Search judgments regarding auctions. Due to the difficulty in procuring negative human judgment at scale, employing LLM-as-a-judge to mimic seller judgment has been established as the norm in several studies. This study introduces a novel two-step LLM distillation process from a LLM-judge used to debias our Embedding Based Retrieval (EBR) model from the various biases that exist in click-data. We distill from an LLM teacher via a cross-encoder assistant into a bi-encoder student using a multi-task training approach, ultimately employing the student bi-encoder to retrieve relevant advertiser keyphrases. We show that integrating a knowledge distillation process from LLMs in a multi-task training setup enhances bi-encoder performance in retrieving relevant advertiser keyphrases at eBay.
15. FlowBack-Adjoint: Physics-Aware and Energy-Guided Conditional Flow-Matching for All-Atom Protein Backmapping
Authors: Alex Berlaga, Michael S. Jones, Andrew L. Ferguson β€’ Published: 2025-08-05 β€’ Source: arXiv
Coarse-grained (CG) molecular models of proteins can substantially increase the time and length scales accessible to molecular dynamics simulations of proteins, but recovery of accurate all-atom (AA) ensembles from CG simulation trajectories can be essential for exposing molecular mechanisms of folding and docking and for calculation of physical properties requiring atomistic detail. The recently reported deep generative model FlowBack restores AA detail to protein C-alpha traces using a flow-matching architecture and demonstrates state-of-the-art performance in generation of AA structural ensembles. Training, however, is performed exclusively on structural data and the absence of any awareness of interatomic energies or forces within training results in small fractions of incorrect bond lengths, atomic clashes, and otherwise high-energy structures. In this work, we introduce FlowBack-Adjoint as a lightweight enhancement that upgrades the pre-trained FlowBack model through a one-time, physics-aware post-training pass. Auxiliary contributions to the flow introduce physical awareness of bond lengths and Lennard-Jones interactions and gradients of a molecular mechanics force field energy are incorporated via adjoint matching to steer the FlowBack-Adjoint vector field to produce lower-energy configurations. In benchmark tests against FlowBack, FlowBack-Adjoint lowers single-point energies by a median of ~78 kcal/mol.residue, reduces errors in bond lengths by >92%, eliminates >98% of molecular clashes, maintains excellent diversity of the AA configurational ensemble, and produces configurations capable of initializing stable all-atom molecular dynamics simulations without requiring energy relaxation. We propose FlowBack-Adjoint as an accurate and efficient physics-aware deep generative model for AA backmapping from C-alpha traces.
16. FPG-NAS: FLOPs-Aware Gated Differentiable Neural Architecture Search for Efficient 6DoF Pose Estimation
Authors: Nassim Ali Ousalah, Peyman Rostami, Anis Kacem, Enjie Ghorbel, Emmanuel Koumandakis, Djamila Aouada β€’ Published: 2025-08-05 β€’ Source: arXiv
We introduce FPG-NAS, a FLOPs-aware Gated Differentiable Neural Architecture Search framework for efficient 6DoF object pose estimation. Estimating 3D rotation and translation from a single image has been widely investigated yet remains computationally demanding, limiting applicability in resource-constrained scenarios. FPG-NAS addresses this by proposing a specialized differentiable NAS approach for 6DoF pose estimation, featuring a task-specific search space and a differentiable gating mechanism that enables discrete multi-candidate operator selection, thus improving architectural diversity. Additionally, a FLOPs regularization term ensures a balanced trade-off between accuracy and efficiency. The framework explores a vast search space of approximately 10\textsuperscript{92} possible architectures. Experiments on the LINEMOD and SPEED+ datasets demonstrate that FPG-NAS-derived models outperform previous methods under strict FLOPs constraints. To the best of our knowledge, FPG-NAS is the first differentiable NAS framework specifically designed for 6DoF object pose estimation.
17. Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
Authors: Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, Chi Jin β€’ Published: 2025-08-05 β€’ Source: arXiv
We introduce Goedel-Prover-V2, a series of open-source language models that set a new state-of-the-art in automated theorem proving. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) Scaffolded data synthesis: We generate synthetic tasks of increasing difficulty to train the model to master increasingly complex theorems; (2) Verifier-guided self-correction: We enable the model to iteratively revise its proofs by leveraging feedback from the Lean compiler; (3) Model averaging: We merge model checkpoints to mitigate the decrease in model output diversity in later stages of training. Our small model, Goedel-Prover-V2-8B, reaches 84.6% pass@32 on MiniF2F and outperforms DeepSeek-Prover-V2-671B under the same metric, despite being 80X smaller. Our flagship model, Goedel-Prover-V2-32B, achieves 88.1% on MiniF2F at pass@32 in standard mode and 90.4% in self-correction mode, outperforming prior SOTA by a large margin. Additionally, our flagship model solves 86 problems on PutnamBench at pass@184, securing the first place among open-source models on the leaderboard, surpassing DeepSeek-Prover-V2-671B's record of solving 47 problems by pass@1024 with a significantly smaller model size and compute budget. At the time of its release (July-August 2025), Goedel-Prover-V2 achieves the strongest overall performance among all open-source theorem provers. It also ranks among the top-performing models--including closed-source systems with publicly reported performance--under a constrained test-time compute budget. Our models, code, and data are released at https://github.com/Goedel-LM/Goedel-Prover-V2.
18. Towards a classification of topological defects in $K3$ sigma models
Authors: Roberta Angius, Stefano Giaccari β€’ Published: 2025-08-05 β€’ Source: arXiv
Given a $K3$ surface, a supersymmetric non-linear K3 sigma model is the internal superconformal field theory (SCFT) in a six dimensional compactification of type IIA superstring on $\mathbb{R}^{1,5} \times K3$. These models have attracted attention due to the discovery of Mathieu moonshine phenomena for the elliptic genera of K3 surfaces, and have played a pivotal role in extending Mukai's theorem on classification of symplectic automorphisms of $K3$ surfaces. We report on recent progress (arXiv:2402.08719 [hep-th]) in characterizing topological defects in $K3$ models, generalizing the notion of symmetries to categories of topological operators supported on arbitrary codimension submanifolds with possibly non-invertible fusion rules. Taking advantage of the interpretation of Mukai lattice as the D-brane charge lattice, we present a number of general results for the category of topological defect lines preserving the superconformal algebra and spectral flow, obtained by studying their fusion with boundary states. While for certain K3 models infinitely many simple defects, and even a continuum, can occur, at generic points in the moduli space the category is actually trivial, i.e. it is generated by the identity defect. Furthermore, if a K3 model is at the attractor point for some BPS configuration of D-branes, then all topological defects have integral quantum dimension. We also introduce a conjecture that a continuum of topological defects arises if and only if the K3 model is a (possibly generalized) orbifold of a torus model. These general results are confirmed by the analysis of significant examples. We also point out the connection to recent studies of topological defects in the Conway moonshine module theory (arXiv:2412.21141 [hep-th],arXiv:2504.18619 [hep-th]).
19. CloudBreaker: Breaking the Cloud Covers of Sentinel-2 Images using Multi-Stage Trained Conditional Flow Matching on Sentinel-1
Authors: Saleh Sakib Ahmed, Sara Nowreen, M. Sohel Rahman β€’ Published: 2025-08-05 β€’ Source: arXiv
Cloud cover and nighttime conditions remain significant limitations in satellite-based remote sensing, often restricting the availability and usability of multi-spectral imagery. In contrast, Sentinel-1 radar images are unaffected by cloud cover and can provide consistent data regardless of weather or lighting conditions. To address the challenges of limited satellite imagery, we propose CloudBreaker, a novel framework that generates high-quality multi-spectral Sentinel-2 signals from Sentinel-1 data. This includes the reconstruction of optical (RGB) images as well as critical vegetation and water indices such as NDVI and NDWI.We employed a novel multi-stage training approach based on conditional latent flow matching and, to the best of our knowledge, are the first to integrate cosine scheduling with flow matching. CloudBreaker demonstrates strong performance, achieving a Frechet Inception Distance (FID) score of 0.7432, indicating high fidelity and realism in the generated optical imagery. The model also achieved Structural Similarity Index Measure (SSIM) of 0.6156 for NDWI and 0.6874 for NDVI, indicating a high degree of structural similarity. This establishes CloudBreaker as a promising solution for a wide range of remote sensing applications where multi-spectral data is typically unavailable or unreliable
20. DyCAF-Net: Dynamic Class-Aware Fusion Network
Authors: Md Abrar Jahin, Shahriar Soudeep, M. F. Mridha, Nafiz Fahad, Md. Jakir Hossen β€’ Published: 2025-08-05 β€’ Source: arXiv
Recent advancements in object detection rely on modular architectures with multi-scale fusion and attention mechanisms. However, static fusion heuristics and class-agnostic attention limit performance in dynamic scenes with occlusions, clutter, and class imbalance. We introduce Dynamic Class-Aware Fusion Network (DyCAF-Net) that addresses these challenges through three innovations: (1) an input-conditioned equilibrium-based neck that iteratively refines multi-scale features via implicit fixed-point modeling, (2) a dual dynamic attention mechanism that adaptively recalibrates channel and spatial responses using input- and class-dependent cues, and (3) class-aware feature adaptation that modulates features to prioritize discriminative regions for rare classes. Through comprehensive ablation studies with YOLOv8 and related architectures, alongside benchmarking against nine state-of-the-art baselines, DyCAF-Net achieves significant improvements in precision, mAP@50, and mAP@50-95 across 13 diverse benchmarks, including occlusion-heavy and long-tailed datasets. The framework maintains computational efficiency ($\sim$11.1M parameters) and competitive inference speeds, while its adaptability to scale variance, semantic overlaps, and class imbalance positions it as a robust solution for real-world detection tasks in medical imaging, surveillance, and autonomous systems.
21. The problem of sharp notch in microstructured solids governed by dipolar gradient elasticity
Authors: P. A. Gourgiotis, M. D. Sifnaiou, H. G. Georgiadis β€’ Published: 2025-08-05 β€’ Source: arXiv
In this paper, we deal with the asymptotic problem of a body of infinite extent with a notch (re-entrant corner) under remotely applied plane-strain or anti-plane shear loadings. The problem is formulated within the framework of the Toupin-Mindlin theory of dipolar gradient elasticity. This generalized continuum theory is appropriate to model the response of materials with microstructure. A linear version of the theory results by considering a linear isotropic expression for the strain-energy density that depends on strain-gradient terms, in addition to the standard strain terms appearing in classical elasticity. Through this formulation, a microstructural material length is introduced, in addition to the standard Lam\'e constants . The faces of the notch are considered to be traction-free and a boundary-layer approach is followed. The boundary value problem is attacked with the asymptotic Knein-Williams technique. Our analysis leads to an eigenvalue problem, which, along with the restriction of a bounded strain energy, provides the asymptotic fields. The cases of a crack and a half-space are analyzed in detail as limit cases of the general notch (infinite wedge) problem. The results show significant departure from the predictions of the standard fracture mechanics.
22. Marito: Structuring and Building Open Multilingual Terminologies for South African NLP
Authors: Vukosi Marivate, Isheanesu Dzingirai, Fiskani Banda, Richard Lastrucci, Thapelo Sindane, Keabetswe Madumo, Kayode Olaleye, Abiodun Modupe, Unarine Netshifhefhe, Herkulaas Combrink, Mohlatlego Nakeng, Matome Ledwaba β€’ Published: 2025-08-05 β€’ Source: arXiv
The critical lack of structured terminological data for South Africa's official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. \emph{Marito} addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational \emph{Marito} dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models. \emph{Marito} provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa's rich linguistic diversity is represented in the digital age.
23. Understanding Demand for Shared Autonomous Micro-Mobility
Authors: Naroa Coretti Sanchez, Kent Larson β€’ Published: 2025-08-05 β€’ Source: arXiv
This study examines the behavioral and environmental implications of shared autonomous micro-mobility systems, focusing on autonomous bicycles and their integration with transit in the U.S. While prior research has addressed operational and lifecycle aspects, a critical gap remains in understanding which modes these services are likely to substitute, who is most inclined to adopt them, and how service attributes influence user decisions. We design a context-aware stated preference survey grounded in real-world trips and estimate discrete choice models, including a hybrid model incorporating latent attitudes. Findings indicate that adoption, mode shift, and environmental impacts are highly sensitive to service design. Scenarios with minimal wait and cost yield high adoption but increase emissions, while moderate waits are more likely to reduce impacts. Adoption likelihood varies with demographic characteristics, and outcomes depend on city type, context, and infrastructure assumptions. These insights can inform the development of more sustainable and equitable mobility systems.
24. Theatre in the Loop: A Rehearsal-Based, Collaborative Workflow for Expressive Robotic Behaviours
Authors: Pavlos Panagiotidis, Victor Zhi Heung Ngo, Sean Myatt, Roma Patel, Rachel Ramchurn, Alan Chamberlain, Ayse Kucukyilmaz β€’ Published: 2025-08-05 β€’ Source: arXiv
In this paper, we propose theatre-in-the-loop, a framework for developing expressive robot behaviours tailored to artistic performance through a director-guided puppeteering workflow. Leveraging theatrical methods, we use narrative objectives to direct a puppeteer in generating improvised robotic gestures that convey specific emotions. These improvisations are captured and curated to build a dataset of reusable movement templates for standalone playback in future autonomous performances. Initial trials demonstrate the feasibility of this approach, illustrating how the workflow enables precise sculpting of robotic gestures into coherent emotional arcs while revealing challenges posed by the robot's mechanical constraints. We argue that this practice-led framework provides a model for interdisciplinary teams creating socially expressive robot behaviours, contributing to (1) theatre as an interactive training ground for human-robot interaction and (2) co-creation methodologies between humans and machines.
25. VQA support to Arabic Language Learning Educational Tool
Authors: Khaled Bachir Delassi, Lakhdar Zeggane, Hadda Cherroun, Abdelhamid Haouhat, Kaoutar Bouzouad β€’ Published: 2025-08-05 β€’ Source: arXiv
We address the problem of scarcity of educational Arabic Language Learning tools that advocate modern pedagogical models such as active learning which ensures language proficiency. In fact, we investigate the design and evaluation of an AI-powered educational tool designed to enhance Arabic language learning for non-native speakers with beginner-to-intermediate proficiency level. The tool leverages advanced AI models to generate interactive visual quizzes, deploying Visual Question Answering as the primary activity. Adopting a constructivist learning approach, the system encourages active learning through real-life visual quizzes, and image-based questions that focus on improving vocabulary, grammar, and comprehension. The system integrates Vision-Language Pretraining models to generate contextually relevant image description from which Large Language Model generate assignments based on customized Arabic language Learning quizzes thanks to prompting. The effectiveness of the tool is evaluated through a manual annotated benchmark consisting of 1266 real-life visual quizzes, with human participants providing feedback. The results show a suitable accuracy rates, validating the tool's potential to bridge the gap in Arabic language education and highlighting the tool's promise as a reliable, AI-powered resource for Arabic learners, offering personalized and interactive learning experiences.
26. A Genetic Algorithm Framework for Optimizing Three-Impulse Orbital Transfers with Poliastro Simulation
Authors: Phuc Hao Do, Tran Duc Le β€’ Published: 2025-08-05 β€’ Source: arXiv
Orbital maneuver planning is a critical aspect of mission design, aimed at minimizing propellant consumption, which is directly correlated with the total velocity change ($\Delta V$). While analytical solutions like the Hohmann and Bi-elliptic transfers offer optimal strategies for specific cases, they lack the flexibility for more general optimization problems. This paper presents a computational framework that couples a Genetic Algorithm (GA) with the Poliastro orbital mechanics library to autonomously discover fuel-optimal, three-impulse transfer trajectories between coplanar circular orbits. We validate this framework across two distinct scenarios: a low-energy transfer from Low Earth Orbit (LEO) to a Geostationary Orbit (GEO), and a high-energy transfer to a distant orbit with a radius 20 times that of LEO. Our results demonstrate the framework's remarkable adaptability. For the LEO-to-GEO transfer, the GA precisely converges to the classical Hohmann transfer, achieving an identical $\Delta V$ of 3853.96 m/s and validating the method's accuracy. Conversely, for the high-energy transfer, the GA identifies a superior Bi-elliptic trajectory that yields a significant $\Delta V$ saving of 213.47 m/s compared to the Hohmann transfer. This fuel efficiency, however, necessitates a trade-off, extending the mission duration from approximately 1 day to over 140 years. This work demonstrates an accessible and powerful toolchain for the rapid prototyping of optimal trajectories, showcasing how combining evolutionary algorithms with open-source libraries provides a robust method for solving complex astrodynamics problems and quantifying their critical design trade-offs.
27. Learning to Incentivize: LLM-Empowered Contract for AIGC Offloading in Teleoperation
Authors: Zijun Zhan, Yaxian Dong, Daniel Mawunyo Doe, Yuqing Hu, Shuai Li, Shaohua Cao, Zhu Han β€’ Published: 2025-08-05 β€’ Source: arXiv
With the rapid growth in demand for AI-generated content (AIGC), edge AIGC service providers (ASPs) have become indispensable. However, designing incentive mechanisms that motivate ASPs to deliver high-quality AIGC services remains a challenge, especially in the presence of information asymmetry. In this paper, we address bonus design between a teleoperator and an edge ASP when the teleoperator cannot observe the ASP's private settings and chosen actions (diffusion steps). We formulate this as an online learning contract design problem and decompose it into two subproblems: ASP's settings inference and contract derivation. To tackle the NP-hard setting-inference subproblem with unknown variable sizes, we introduce a large language model (LLM)-empowered framework that iteratively refines a naive seed solver using the LLM's domain expertise. Upon obtaining the solution from the LLM-evolved solver, we directly address the contract derivation problem using convex optimization techniques and obtain a near-optimal contract. Simulation results on our Unity-based teleoperation platform show that our method boosts the teleoperator's utility by $5 \sim 40\%$ compared to benchmarks, while preserving positive incentives for the ASP. The code is available at https://github.com/Zijun0819/llm4contract.
28. An Auditable Agent Platform For Automated Molecular Optimisation
Authors: Atabey ÜnlΓΌ, Phil Rohr, Ahmet Celebi β€’ Published: 2025-08-05 β€’ Source: arXiv
Drug discovery frequently loses momentum when data, expertise, and tools are scattered, slowing design cycles. To shorten this loop we built a hierarchical, tool using agent framework that automates molecular optimisation. A Principal Researcher defines each objective, a Database agent retrieves target information, an AI Expert generates de novo scaffolds with a sequence to molecule deep learning model, a Medicinal Chemist edits them while invoking a docking tool, a Ranking agent scores the candidates, and a Scientific Critic polices the logic. Each tool call is summarised and stored causing the full reasoning path to remain inspectable. The agents communicate through concise provenance records that capture molecular lineage, to build auditable, molecule centered reasoning trajectories and reuse successful transformations via in context learning. Three cycle research loops were run against AKT1 protein using five large language models. After ranking the models by mean docking score, we ran 20 independent scale ups on the two top performers. We then compared the leading LLMs' binding affinity results across three configurations, LLM only, single agent, and multi agent. Our results reveal an architectural trade off, the multi agent setting excelled at focused binding optimization, improving average predicted binding affinity by 31%. In contrast, single agent runs generated molecules with superior drug like properties at the cost of less potent binding scores. Unguided LLM runs finished fastest, yet their lack of transparent tool signals left the validity of their reasoning paths unverified. These results show that test time scaling, focused feedback loops and provenance convert general purpose LLMs into auditable systems for molecular design, and suggest that extending the toolset to ADMET and selectivity predictors could push research workflows further along the discovery pipeline.
29. Multi-Objective Infeasibility Diagnosis for Routing Problems Using Large Language Models
Authors: Kai Li, Ruihao Zheng, Xinye Hao, Zhenkun Wang β€’ Published: 2025-08-05 β€’ Source: arXiv
In real-world routing problems, users often propose conflicting or unreasonable requirements, which result in infeasible optimization models due to overly restrictive or contradictory constraints, leading to an empty feasible solution set. Existing Large Language Model (LLM)-based methods attempt to diagnose infeasible models, but modifying such models often involves multiple potential adjustments that these methods do not consider. To fill this gap, we introduce Multi-Objective Infeasibility Diagnosis (MOID), which combines LLM agents and multi-objective optimization within an automatic routing solver, to provide a set of representative actionable suggestions. Specifically, MOID employs multi-objective optimization to consider both path cost and constraint violation, generating a set of trade-off solutions, each encompassing varying degrees of model adjustments. To extract practical insights from these solutions, MOID utilizes LLM agents to generate a solution analysis function for the infeasible model. This function analyzes these distinct solutions to diagnose the original infeasible model, providing users with diverse diagnostic insights and suggestions. Finally, we compare MOID with several LLM-based methods on 50 types of infeasible routing problems. The results indicate that MOID automatically generates multiple diagnostic suggestions in a single run, providing more practical insights for restoring model feasibility and decision-making compared to existing methods.
30. Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling
Authors: Xinlei Yu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Ruolin Shen, Jiangning Zhang, Xiaobin Hu, Yanwei Fu, Shuicheng Yan β€’ Published: 2025-08-05 β€’ Source: arXiv
Existing vision-language models (VLMs), whether generalists or specialists, remain constrained by their parameter scale, lack robust self-correction capabilities, and underperform in tasks involving long visual contexts and complex reasoning, resulting in suboptimal performance on document-based tasks. To address this, we propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling, tailored for visual document understanding and visual question answering (VQA). It comprises four distinct small-scale agents, i.e., planning, execution, judgment, and answer agents, with clearly defined roles and effective collaboration. Notably, the judgment agent exclusively verifies correctness and redirects to prior agents for revisions, outperforming conventional correction strategies. To further expand the capability boundaries of the framework, we propose mixed reward modeling that balances agent-specific abilities and global collaboration, as well as agent-wise hybrid test-time scaling, which customizes different scaling strategies for each agent based on their functions. Evaluated on benchmarks spanning both document-based and non-document-based settings, our MACT shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks. Especially, it stands out in benchmarks involving long visual contexts and complicated reasoning. The three variants of MACT consistently hold the top three positions in average scores, leading in 13 of the 15 benchmarks. Code will be available at: https://github.com/YU-deep/MACT.git.
31. Agentic AI in 6G Software Businesses: A Layered Maturity Model
Authors: Muhammad Zohaib, Muhammad Azeem Akbar, Sami Hyrynsalmi, Arif Ali Khan β€’ Published: 2025-08-05 β€’ Source: arXiv
The emergence of agentic AI systems in 6G software businesses presents both strategic opportunities and significant challenges. While such systems promise increased autonomy, scalability, and intelligent decision-making across distributed environments, their adoption raises concerns regarding technical immaturity, integration complexity, organizational readiness, and performance-cost trade-offs. In this study, we conducted a preliminary thematic mapping to identify factors influencing the adoption of agentic software within the context of 6G. Drawing on a multivocal literature review and targeted scanning, we identified 29 motivators and 27 demotivators, which were further categorized into five high-level themes in each group. This thematic mapping offers a structured overview of the enabling and inhibiting forces shaping organizational readiness for agentic transformation. Positioned as a feasibility assessment, the study represents an early phase of a broader research initiative aimed at developing and validating a layered maturity model grounded in CMMI model with the software architectural three dimensions possibly Data, Business Logic, and Presentation. Ultimately, this work seeks to provide a practical framework to help software-driven organizations assess, structure, and advance their agent-first capabilities in alignment with the demands of 6G.
32. Adaptive AI Agent Placement and Migration in Edge Intelligence Systems
Authors: Xingdan Wang, Jiayi He, Zhiqing Tang, Jianxiong Guo, Jiong Lou, Liping Qian, Tian Wang, Weijia Jia β€’ Published: 2025-08-05 β€’ Source: arXiv
The rise of LLMs such as ChatGPT and Claude fuels the need for AI agents capable of real-time task handling. However, migrating data-intensive, multi-modal edge workloads to cloud data centers, traditionally used for agent deployment, introduces significant latency. Deploying AI agents at the edge improves efficiency and reduces latency. However, edge environments present challenges due to limited and heterogeneous resources. Maintaining QoS for mobile users necessitates agent migration, which is complicated by the complexity of AI agents coordinating LLMs, task planning, memory, and external tools. This paper presents the first systematic deployment and management solution for LLM-based AI agents in dynamic edge environments. We propose a novel adaptive framework for AI agent placement and migration in edge intelligence systems. Our approach models resource constraints and latency/cost, leveraging ant colony algorithms and LLM-based optimization for efficient decision-making. It autonomously places agents to optimize resource utilization and QoS and enables lightweight agent migration by transferring only essential state. Implemented on a distributed system using AgentScope and validated across globally distributed edge servers, our solution significantly reduces deployment latency and migration costs.
33. From Legacy to Standard: LLM-Assisted Transformation of Cybersecurity Playbooks into CACAO Format
Authors: Mehdi Akbari Gurabi, Lasse Nitz, Radu-Mihai Castravet, Roman Matzutt, Avikarsha Mandal, Stefan Decker β€’ Published: 2025-08-05 β€’ Source: arXiv
Existing cybersecurity playbooks are often written in heterogeneous, non-machine-readable formats, which limits their automation and interoperability across Security Orchestration, Automation, and Response platforms. This paper explores the suitability of Large Language Models, combined with Prompt Engineering, to automatically translate legacy incident response playbooks into the standardized, machine-readable CACAO format. We systematically examine various Prompt Engineering techniques and carefully design prompts aimed at maximizing syntactic accuracy and semantic fidelity for control flow preservation. Our modular transformation pipeline integrates a syntax checker to ensure syntactic correctness and features an iterative refinement mechanism that progressively reduces syntactic errors. We evaluate the proposed approach on a custom-generated dataset comprising diverse legacy playbooks paired with manually created CACAO references. The results demonstrate that our method significantly improves the accuracy of playbook transformation over baseline models, effectively captures complex workflow structures, and substantially reduces errors. It highlights the potential for practical deployment in automated cybersecurity playbook transformation tasks.
34. Key-Augmented Neural Triggers for Knowledge Sharing
Authors: Alex Wolf, Marco Edoardo Palma, Pooja Rani, Harald C. Gall β€’ Published: 2025-08-05 β€’ Source: arXiv
Repository-level code comprehension and knowledge sharing remain core challenges in software engineering. Large language models (LLMs) have shown promise by generating explanations of program structure and logic. However, these approaches still face limitations: First, relevant knowledge is distributed across multiple files within a repository, aka semantic fragmentation. Second, retrieval inefficiency and attention saturation degrade performance in RAG pipelines, where long, unaligned contexts overwhelm attention. Third, repository specific training data is scarce and often outdated. Finally, proprietary LLMs hinder industrial adoption due to privacy and deployment constraints. To address these issues, we propose Key-Augmented Neural Triggers (KANT), a novel approach that embeds knowledge anchors into both training and inference. Unlike prior methods, KANT enables internal access to repository specific knowledge, reducing fragmentation and grounding inference in localized context. Moreover, we synthesize specialized data directly from code. At inference, knowledge anchors replace verbose context, reducing token overhead and latency while supporting efficient, on premise deployment. We evaluate KANT via: a qualitative human evaluation of the synthesized dataset's intent coverage and quality across five dimensions; compare against SOTA baselines across five qualitative dimensions and inference speed; and replication across different LLMs to assess generalizability. Results show that the synthetic training data aligned with information-seeking needs. KANT achieved over 60% preference from human annotators and a LocalStack expert (preferring 79% of cases). Also, KANT reduced inference latency by up to 85% across all models. Overall, it is well-suited for scalable, low-latency, on-premise deployments, providing a strong foundation for code comprehension.
35. CTTS: Collective Test-Time Scaling
Authors: Zhende Song, Shengji Tang, Peng Ye, Jiayuan Fan, Tao Chen β€’ Published: 2025-08-05 β€’ Source: arXiv
Test-time scaling (TTS) has emerged as a promising research field for enhancing the effectiveness of large language models (LLMs) without extra training. However, most existing approaches, e.g., Best-of-N and Self-Consistency rely on a single agent interacting with a reward model (SA-SR), constrained by limited capabilities of a single test-time scaling (STTS) paradigm. On the other hand, recent works demonstrate that collective-agent methods can break through the upper bound of single-agent systems by orchestrating diverse models. Thus, in this paper, we take a first step towards exploring Collective Test-Time Scaling (CTTS). Consider the different interaction types of single and multiple models, we design three primary paradigms to investigate the optimal paradigm of CTTS: (1) single agent to multiple reward models (SA-MR); (2) multiple agents to single reward model (MA-SR); and (3) multiple agents to multiple reward models (MA-MR). Extensive experiments demonstrate that MA-MR consistently achieves the best performance. Based on this, we propose a novel framework named CTTS-MM that effectively leverages both multi-agent and multi-reward-model collaboration for enhanced inference. Specifically, for multi-agent collaboration, we propose an Agent Collaboration Search (ACS), which searches for the most effective combination of LLM agents from a large candidate pool; for multi-reward-model collaboration, we propose Mixture of Reword Models (MoR), which consists of a curated question pool and a Prior Reward model Ensemble Selection (PRES) to select the optimal combinations of reward models via Pair-wise Reward Ranking (PRR) metric. Experiments across seven mainstream benchmarks demonstrate that the proposed CTTS-MM consistently obtains superior performance. Code will be released at https://github.com/magent4aci/CTTS-MM.