πŸ€– AI Research Papers

October 05, 2025

πŸ€– AI-Generated Research Summary

Comprehensive Summary of Recent Research in AI, LLMs, Agents, and Workflows

This summary synthesizes key insights from a collection of 35 recent research papers spanning large language models (LLMs), agents, generative models, workflow automation, and their applications. The analysis is structured to highlight key research trends, breakthrough findings, methodological approaches, applications and use cases, and future directions.


1. Key Research Trends

a. Advances in Generative Models

b. LLMs and Reasoning

c. Agents and Workflow Automation

d. Multimodal and Cross-Modal Learning

e. Scalability and Efficiency


2. Breakthrough Findings


3. Methodological Approaches


4. Applications and Use Cases


5. Future Directions


Conclusion

This collection of papers reflects a vibrant and rapidly evolving AI landscape, with significant progress in generative modeling, agent-based automation, multimodal learning, and robust evaluation. The field is moving toward more controllable, interpretable, and generalizable AI systems that can operate reliably in complex, real-world settings. Researchers and practitioners are encouraged to build on these advances, with particular attention to scalability, safety, and human-AI collaboration as foundational pillars for the next generation of AI technologies.

πŸ“š arXiv (35 papers)
1. Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity
Authors: Eric Tillmann Bill, Enis Simsar, Thomas Hofmann β€’ Published: 2025-10-02 β€’ Source: arXiv
Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions, often showing attribute leakage, identity entanglement, and subject omissions. We introduce the first theoretical framework with a principled, optimizable objective for steering sampling dynamics toward multi-subject fidelity. Viewing flow matching (FM) through stochastic optimal control (SOC), we formulate subject disentanglement as control over a trained FM sampler. This yields two architecture-agnostic algorithms: (i) a training-free test-time controller that perturbs the base velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight fine-tuning rule that regresses a control network to a backward adjoint signal while preserving base-model capabilities. The same formulation unifies prior attention heuristics, extends to diffusion models via a flow-diffusion correspondence, and provides the first fine-tuning route explicitly designed for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and Stable Diffusion XL, both algorithms consistently improve multi-subject alignment while maintaining base-model style. Test-time control runs efficiently on commodity GPUs, and fine-tuned controllers trained on limited prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal Control for Unentangled Subjects), which achieves state-of-the-art multi-subject fidelity across models.
2. KaVa: Latent Reasoning via Compressed KV-Cache Distillation
Authors: Anna Kuzina, Maciej Pioro, Paul N. Whatmough, Babak Ehteshami Bejnordi β€’ Published: 2025-10-02 β€’ Source: arXiv
Large Language Models (LLMs) excel at multi-step reasoning problems with explicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifacts. Latent reasoning has emerged as an efficient alternative that internalizes the thought process, but it suffers from a critical lack of supervision, limiting its effectiveness on complex, natural-language reasoning traces. In this work, we propose KaVa, the first framework that bridges this gap by distilling knowledge directly from a compressed KV-cache of the teacher into a latent-reasoning student via self-distillation, leveraging the representational flexibility of continuous latent tokens to align stepwise KV trajectories. We show that the abstract, unstructured knowledge within compressed KV-cache, which lacks direct token correspondence, can serve as a rich supervisory signal for a latent reasoning student. Empirically, the approach consistently outperforms strong latent baselines, exhibits markedly smaller degradation from equation-only to natural-language traces, and scales to larger backbones while preserving efficiency. These results establish compressed KV-cache distillation as a scalable supervision signal for latent reasoning, combining the accuracy of CoT-trained teachers with the efficiency and deployability of latent inference.
3. Inferring Dynamic Physical Properties from Video Foundation Models
Authors: Guanqi Zhan, Xianzheng Ma, Weidi Xie, Andrew Zisserman β€’ Published: 2025-10-02 β€’ Source: arXiv
We study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, consisting of synthetic training and testing splits, as well as a real split for real world evaluation. (ii) We explore three ways to infer the physical property from videos: (a) an oracle method where we supply the visual cues that intrinsically reflect the property using classical computer vision techniques; (b) a simple read out mechanism using a visual prompt and trainable prompt vector for cross-attention on pre-trained video generative and self-supervised models; and (c) prompt strategies for Multi-modal Large Language Models (MLLMs). (iii) We show that video foundation models trained in a generative or self-supervised manner achieve a similar performance, though behind that of the oracle, and MLLMs are currently inferior to the other models, though their performance can be improved through suitable prompting.
4. NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation
Authors: Ruozhen He, Moayed Haji-Ali, Ziyan Yang, Vicente Ordonez β€’ Published: 2025-10-02 β€’ Source: arXiv
Text-to-image diffusion models trained on a fixed set of resolutions often fail to generalize, even when asked to generate images at lower resolutions than those seen during training. High-resolution text-to-image generators are currently unable to easily offer an out-of-the-box budget-efficient alternative to their users who might not need high-resolution images. We identify a key technical insight in diffusion models that when addressed can help tackle this limitation: Noise schedulers have unequal perceptual effects across resolutions. The same level of noise removes disproportionately more signal from lower-resolution images than from high-resolution images, leading to a train-test mismatch. We propose NoiseShift, a training-free method that recalibrates the noise level of the denoiser conditioned on resolution size. NoiseShift requires no changes to model architecture or sampling schedule and is compatible with existing models. When applied to Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, quality at low resolutions is significantly improved. On LAION-COCO, NoiseShift improves SD3.5 by 15.89%, SD3 by 8.56%, and Flux-Dev by 2.44% in FID on average. On CelebA, NoiseShift improves SD3.5 by 10.36%, SD3 by 5.19%, and Flux-Dev by 3.02% in FID on average. These results demonstrate the effectiveness of NoiseShift in mitigating resolution-dependent artifacts and enhancing the quality of low-resolution image generation.
5. Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation
Authors: Raphael Tang, Crystina Zhang, Wenyan Li, Carmen Lai, Pontus Stenetorp, Yao Lu β€’ Published: 2025-10-02 β€’ Source: arXiv
In arena-style evaluation of large language models (LLMs), two LLMs respond to a user query, and the user chooses the winning response or deems the "battle" a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.
6. Diffusion Models and the Manifold Hypothesis: Log-Domain Smoothing is Geometry Adaptive
Authors: Tyler Farghly, Peter Potaptchik, Samuel Howard, George Deligiannidis, Jakiw Pidstrigach β€’ Published: 2025-10-02 β€’ Source: arXiv
Diffusion models have achieved state-of-the-art performance, demonstrating remarkable generalisation capabilities across diverse domains. However, the mechanisms underpinning these strong capabilities remain only partially understood. A leading conjecture, based on the manifold hypothesis, attributes this success to their ability to adapt to low-dimensional geometric structure within the data. This work provides evidence for this conjecture, focusing on how such phenomena could result from the formulation of the learning problem through score matching. We inspect the role of implicit regularisation by investigating the effect of smoothing minimisers of the empirical score matching objective. Our theoretical and empirical results confirm that smoothing the score function -- or equivalently, smoothing in the log-density domain -- produces smoothing tangential to the data manifold. In addition, we show that the manifold along which the diffusion model generalises can be controlled by choosing an appropriate smoothing.
7. ALMA Deep Field in SSA22: Reconstructed [CII] Luminosity Function at z = 6
Authors: Natsuki H. Hayatsu, Rob J. Ivison, Paola Andreani, Fabrizia Guglielmetti, Zhi-Yu Zhang, Andy Biggs, Hideki Umehata, Yuichi Matsuda, Naoki Yoshida, Mark A. Swinbank, Kotaro Kohno, Yoichi Tamura, Bunyo Hatsukade, Kouichiro Nakanishi, Yiping Ao, Tohru Nagao, Mariko Kubo, Tsutomu T. Takeuchi, Minju Lee, Takuma Izumi, Soh Ikarashi, Tohru Yamada β€’ Published: 2025-10-02 β€’ Source: arXiv
The ADF22 line survey reported detections of two high-$z$ line-emitting source candidates above 6-$\sigma$, both of which were shown to be spurious after follow-up observations. We investigate the detectability of far-infrared emitters in ALMA deep fields using mock observations by injecting artificial line-emitting sources into the visibility planes. We also discuss our investigation, conducted together with the ALMA operations team, of a possible technical problem in the original observations. Finally, we devise a method to estimate the [CII] luminosity function (LF) at $z \sim 6$, including a full analysis of signal contamination and sample completeness. The comparison of pixel distributions between the real and mock datacubes does not show significant differences, confirming that the effect of non-Gaussian noise is negligible for the ADF22 datacube. Using 100 blank mock-mosaic datasets, we show 0.43 $\pm$ 0.67 false detections per datacube with the previous source-finding method. We argue that the underestimation of the contamination rate in the previous work is caused by the smaller number of datacubes, using only 4 real ADF22 datacubes. We compare the results of clump-finding between the time division mode and frequency division mode correlator datacubes and confirm that the velocity widths of the clumps in the TDM case are up to 3 times wider than in the FDM case. The LF estimation using our model shows that a correction for the number count is required, up to one order of magnitude, in the luminosity range of $\geq 5 \times 10^8 L_\odot$. Our reconstruction method for the line LF can be applied to future blind line surveys.
8. Relativistic Jets and Winds in Radio-Identified Supermassive Black Hole Binary Candidates
Authors: Andrew G. Sullivan, Roger D. Blandford, Anna Synani, Philipe V. de la Parra, NoΓ©mie Globus, Mitchell C. Begelman, Anthony C. S. Readhead β€’ Published: 2025-10-02 β€’ Source: arXiv
Supermassive black hole binary systems (SMBHBs) are thought to emit the recently discovered nHz gravitational wave background; however, not a single individual nHz source has been confirmed to date. Long-term radio-monitoring at the Owens Valley Radio Observatory has revealed two potential SMBHB candidates: blazars PKS 2131-021 and PKS J0805-0111. These sources show periodic flux density variations across the electromagnetic spectrum, signaling the presence of a good clock. To explain the emission, we propose a generalizable jet model, where a mildly relativistic wind creates an outward-moving helical channel, along which the ultra-relativistic jet propagates. The observed flux variation from the jet is mostly due to aberration. The emission at lower frequency arises at larger radius and its variation is consequently delayed, as observed. Our model reproduces the main observable features of both sources and can be applied to other sources as they are discovered. We make predictions for radio polarization, direct imaging, and emission line variation, which can be tested with forthcoming observations. Our results motivate future numerical simulations of jetted SMBHB systems and have implications for the fueling, structure, and evolution of blazar jets.
9. Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models
Authors: Runqian Wang, Yilun Du β€’ Published: 2025-10-02 β€’ Source: arXiv
We introduce Equilibrium Matching (EqM), a generative modeling framework built from an equilibrium dynamics perspective. EqM discards the non-equilibrium, time-conditional dynamics in traditional diffusion and flow-based generative models and instead learns the equilibrium gradient of an implicit energy landscape. Through this approach, we can adopt an optimization-based sampling process at inference time, where samples are obtained by gradient descent on the learned landscape with adjustable step sizes, adaptive optimizers, and adaptive compute. EqM surpasses the generation performance of diffusion/flow models empirically, achieving an FID of 1.90 on ImageNet 256$\times$256. EqM is also theoretically justified to learn and sample from the data manifold. Beyond generation, EqM is a flexible framework that naturally handles tasks including partially noised image denoising, OOD detection, and image composition. By replacing time-conditional velocities with a unified equilibrium landscape, EqM offers a tighter bridge between flow and energy-based models and a simple route to optimization-driven inference.
10. ARMADA: Autonomous Online Failure Detection and Human Shared Control Empower Scalable Real-world Deployment and Adaptation
Authors: Wenye Yu, Jun Lv, Zixi Ying, Yang Jin, Chuan Wen, Cewu Lu β€’ Published: 2025-10-02 β€’ Source: arXiv
Imitation learning has shown promise in learning from large-scale real-world datasets. However, pretrained policies usually perform poorly without sufficient in-domain data. Besides, human-collected demonstrations entail substantial labour and tend to encompass mixed-quality data and redundant information. As a workaround, human-in-the-loop systems gather domain-specific data for policy post-training, and exploit closed-loop policy feedback to offer informative guidance, but usually require full-time human surveillance during policy rollout. In this work, we devise ARMADA, a multi-robot deployment and adaptation system with human-in-the-loop shared control, featuring an autonomous online failure detection method named FLOAT. Thanks to FLOAT, ARMADA enables paralleled policy rollout and requests human intervention only when necessary, significantly reducing reliance on human supervision. Hence, ARMADA enables efficient acquisition of in-domain data, and leads to more scalable deployment and faster adaptation to new scenarios. We evaluate the performance of ARMADA on four real-world tasks. FLOAT achieves nearly 95% accuracy on average, surpassing prior state-of-the-art failure detection approaches by over 20%. Besides, ARMADA manifests more than 4$\times$ increase in success rate and greater than 2$\times$ reduction in human intervention rate over multiple rounds of policy rollout and post-training, compared to previous human-in-the-loop learning methods.
11. Interactive Training: Feedback-Driven Neural Network Optimization
Authors: Wentao Zhang, Yang Young Lu, Yuntian Deng β€’ Published: 2025-10-02 β€’ Source: arXiv
Traditional neural network training typically follows fixed, predefined optimization recipes, lacking the flexibility to dynamically respond to instabilities or emerging training issues. In this paper, we introduce Interactive Training, an open-source framework that enables real-time, feedback-driven intervention during neural network training by human experts or automated AI agents. At its core, Interactive Training uses a control server to mediate communication between users or agents and the ongoing training process, allowing users to dynamically adjust optimizer hyperparameters, training data, and model checkpoints. Through three case studies, we demonstrate that Interactive Training achieves superior training stability, reduced sensitivity to initial hyperparameters, and improved adaptability to evolving user needs, paving the way toward a future training paradigm where AI agents autonomously monitor training logs, proactively resolve instabilities, and optimize training dynamics.
12. Test-Time Anchoring for Discrete Diffusion Posterior Sampling
Authors: Litu Rout, Andreas Lugmayr, Yasamin Jafarian, Srivatsan Varadharajan, Constantine Caramanis, Sanjay Shakkottai, Ira Kemelmacher-Shlizerman β€’ Published: 2025-10-02 β€’ Source: arXiv
We study the problem of posterior sampling using pretrained discrete diffusion foundation models, aiming to recover images from noisy measurements without retraining task-specific models. While diffusion models have achieved remarkable success in generative modeling, most advances rely on continuous Gaussian diffusion. In contrast, discrete diffusion offers a unified framework for jointly modeling categorical data such as text and images. Beyond unification, discrete diffusion provides faster inference, finer control, and principled training-free Bayesian inference, making it particularly well-suited for posterior sampling. However, existing approaches to discrete diffusion posterior sampling face severe challenges: derivative-free guidance yields sparse signals, continuous relaxations limit applicability, and split Gibbs samplers suffer from the curse of dimensionality. To overcome these limitations, we introduce Anchored Posterior Sampling (APS) for masked diffusion foundation models, built on two key innovations -- quantized expectation for gradient-like guidance in discrete embedding space, and anchored remasking for adaptive decoding. Our approach achieves state-of-the-art performance among discrete diffusion samplers across linear and nonlinear inverse problems on the standard benchmarks. We further demonstrate the benefits of our approach in training-free stylization and text-guided editing.
13. Beyond Belief Propagation: Cluster-Corrected Tensor Network Contraction with Exponential Convergence
Authors: Siddhant Midha, Yifan F. Zhang β€’ Published: 2025-10-02 β€’ Source: arXiv
Tensor network contraction on arbitrary graphs is a fundamental computational challenge with applications ranging from quantum simulation to error correction. While belief propagation (BP) provides a powerful approximation algorithm for this task, its accuracy limitations are poorly understood and systematic improvements remain elusive. Here, we develop a rigorous theoretical framework for BP in tensor networks, leveraging insights from statistical mechanics to devise a \emph{cluster expansion} that systematically improves the BP approximation. We prove that the cluster expansion converges exponentially fast if an object called the \emph{loop contribution} decays sufficiently fast with the loop size, giving a rigorous error bound on BP. We also provide a simple and efficient algorithm to compute the cluster expansion to arbitrary order. We demonstrate the efficacy of our method on the two-dimensional Ising model, where we find that our method significantly improves upon BP and existing corrective algorithms such as loop series expansion. Our work opens the door to a systematic theory of BP for tensor networks and its applications in decoding classical and quantum error-correcting codes and simulating quantum systems.
14. Charge order through crystallization of Frenkel excitons: realization in kagome metals
Authors: Ruoshi Jiang, Bartomeu Monserrat, Wei Ku β€’ Published: 2025-10-02 β€’ Source: arXiv
Charge order is a widely observed and representative example of spontaneous broken symmetries in quantum states of matter. Owing to the large intra-atomic Coulomb energy, the charge redistribution in such an order typically implies significant alteration of the electronic and lattice properties of materials. While the standard description of charge order, namely a "charge density wave" instability of the Fermi surface, has been broadly and successfully applied to good metals, its applicability to correlated ionic materials has been rather limited. Here, we propose an alternative general scenario of charge order - crystallization of long-lived Frenkel excitons - suitable for these ionic materials. We demonstrate this scenario on the recently discovered kagome superconductors and successfully reproduce all the characteristics of experimental observations on both local charge correlations and long-range ordering. The proposed generic scenario offers a long-sought understanding of charge order applicable to modern correlated functional materials.
15. MultiModal Action Conditioned Video Generation
Authors: Yichen Li, Antonio Torralba β€’ Published: 2025-10-02 β€’ Source: arXiv
Current video models fail as world model as they lack fine-graiend control. General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce fine-grained multimodal actions to capture such precise control. We consider senses of proprioception, kinesthesia, force haptics, and muscle activation. Such multimodal senses naturally enables fine-grained interactions that are difficult to simulate with text-conditioned generative models. To effectively simulate fine-grained multisensory actions, we develop a feature learning paradigm that aligns these modalities while preserving the unique information each modality provides. We further propose a regularization scheme to enhance causality of the action trajectory features in representing intricate interaction dynamics. Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift. Extensive ablation studies and downstream applications demonstrate the effectiveness and practicality of our work.
16. Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks
Authors: Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar, Miguel Ballesteros, Alan Ritter, Dan Roth β€’ Published: 2025-10-02 β€’ Source: arXiv
Despite recent rapid progress in AI safety, current large language models remain vulnerable to adversarial attacks in multi-turn interaction settings, where attackers strategically adapt their prompts across conversation turns and pose a more critical yet realistic challenge. Existing approaches that discover safety vulnerabilities either rely on manual red-teaming with human experts or employ automated methods using pre-defined templates and human-curated attack data, with most focusing on single-turn attacks. However, these methods did not explore the vast space of possible multi-turn attacks, failing to consider novel attack trajectories that emerge from complex dialogue dynamics and strategic conversation planning. This gap is particularly critical given recent findings that LLMs exhibit significantly higher vulnerability to multi-turn attacks compared to single-turn attacks. We propose DialTree-RPO, an on-policy reinforcement learning framework integrated with tree search that autonomously discovers diverse multi-turn attack strategies by treating the dialogue as a sequential decision-making problem, enabling systematic exploration without manually curated data. Through extensive experiments, our approach not only achieves more than 25.9% higher ASR across 10 target models compared to previous state-of-the-art approaches, but also effectively uncovers new attack strategies by learning optimal dialogue policies that maximize attack success across multiple turns.
17. Learning to Generate Object Interactions with Physics-Guided Video Diffusion
Authors: David Romero, Ariana Bermudez, Hao Li, Fabio Pizzati, Ivan Laptev β€’ Published: 2025-10-02 β€’ Source: arXiv
Recent models for video generation have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, however, current approaches still struggle to generate physically plausible object interactions and lack physics-grounded control mechanisms. To address this limitation, we introduce KineMask, an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements of object interactions in real scenes. Furthermore, KineMask integrates low-level motion control with high-level textual conditioning via predictive scene descriptions, leading to effective support for synthesis of complex dynamical phenomena. Extensive experiments show that KineMask achieves strong improvements over recent models of comparable size. Ablation studies further highlight the complementary roles of low- and high-level conditioning in VDMs. Our code, model, and data will be made publicly available.
18. Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Authors: Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh β€’ Published: 2025-10-02 β€’ Source: arXiv
Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher's capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model's position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/
19. VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL
Authors: Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muhammad Muaz, Lili Qiu β€’ Published: 2025-10-02 β€’ Source: arXiv
With the rapid advancement of AI-generated videos, there is an urgent need for effective detection tools to mitigate societal risks such as misinformation and reputational harm. In addition to accurate classification, it is essential that detection models provide interpretable explanations to ensure transparency for regulators and end users. To address these challenges, we introduce VidGuard-R1, the first video authenticity detector that fine-tunes a multi-modal large language model (MLLM) using group relative policy optimization (GRPO). Our model delivers both highly accurate judgments and insightful reasoning. We curate a challenging dataset of 140k real and AI-generated videos produced by state-of-the-art generation models, carefully designing the generation process to maximize discrimination difficulty. We then fine-tune Qwen-VL using GRPO with two specialized reward models that target temporal artifacts and generation complexity. Extensive experiments demonstrate that VidGuard-R1 achieves state-of-the-art zero-shot performance on existing benchmarks, with additional training pushing accuracy above 95%. Case studies further show that VidGuard-R1 produces precise and interpretable rationales behind its predictions. The code is publicly available at https://VidGuard-R1.github.io.
20. Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation
Authors: Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, Sepp Hochreiter β€’ Published: 2025-10-02 β€’ Source: arXiv
Hallucinations are a common issue that undermine the reliability of large language models (LLMs). Recent studies have identified a specific subset of hallucinations, known as confabulations, which arise due to predictive uncertainty of LLMs. To detect confabulations, various methods for estimating predictive uncertainty in natural language generation (NLG) have been developed. These methods are typically evaluated by correlating uncertainty estimates with the correctness of generated text, with question-answering (QA) datasets serving as the standard benchmark. However, commonly used approximate correctness functions have substantial disagreement between each other and, consequently, in the ranking of the uncertainty estimation methods. This allows one to inflate the apparent performance of uncertainty estimation methods. We propose using several alternative risk indicators for risk correlation experiments that improve robustness of empirical assessment of UE algorithms for NLG. For QA tasks, we show that marginalizing over multiple LLM-as-a-judge variants leads to reducing the evaluation biases. Furthermore, we explore structured tasks as well as out of distribution and perturbation detection tasks which provide robust and controllable risk indicators. Finally, we propose to use an Elo rating of uncertainty estimation methods to give an objective summarization over extensive evaluation settings.
21. Fine-Grained Urban Traffic Forecasting on Metropolis-Scale Road Networks
Authors: Fedor Velikonivtsev, Oleg Platonov, Gleb Bazhenov, Liudmila Prokhorenkova β€’ Published: 2025-10-02 β€’ Source: arXiv
Traffic forecasting on road networks is a complex task of significant practical importance that has recently attracted considerable attention from the machine learning community, with spatiotemporal graph neural networks (GNNs) becoming the most popular approach. The proper evaluation of traffic forecasting methods requires realistic datasets, but current publicly available benchmarks have significant drawbacks, including the absence of information about road connectivity for road graph construction, limited information about road properties, and a relatively small number of road segments that falls short of real-world applications. Further, current datasets mostly contain information about intercity highways with sparsely located sensors, while city road networks arguably present a more challenging forecasting task due to much denser roads and more complex urban traffic patterns. In this work, we provide a more complete, realistic, and challenging benchmark for traffic forecasting by releasing datasets representing the road networks of two major cities, with the largest containing almost 100,000 road segments (more than a 10-fold increase relative to existing datasets). Our datasets contain rich road features and provide fine-grained data about both traffic volume and traffic speed, allowing for building more holistic traffic forecasting systems. We show that most current implementations of neural spatiotemporal models for traffic forecasting have problems scaling to datasets of our size. To overcome this issue, we propose an alternative approach to neural traffic forecasting that uses a GNN without a dedicated module for temporal sequence processing, thus achieving much better scalability, while also demonstrating stronger forecasting performance. We hope our datasets and modeling insights will serve as a valuable resource for research in traffic forecasting.
22. BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals
Authors: Chenqi Li, Yu Liu, Timothy Denison, Tingting Zhu β€’ Published: 2025-10-02 β€’ Source: arXiv
Biosignals offer valuable insights into the physiological states of the human body. Although biosignal modalities differ in functionality, signal fidelity, sensor comfort, and cost, they are often intercorrelated, reflecting the holistic and interconnected nature of human physiology. This opens up the possibility of performing the same tasks using alternative biosignal modalities, thereby improving the accessibility, usability, and adaptability of health monitoring systems. However, the limited availability of large labeled datasets presents challenges for training models tailored to specific tasks and modalities of interest. Unsupervised cross-modal knowledge transfer offers a promising solution by leveraging knowledge from an existing modality to support model training for a new modality. Existing methods are typically based on knowledge distillation, which requires running a teacher model alongside student model training, resulting in high computational and memory overhead. This challenge is further exacerbated by the recent development of foundation models that demonstrate superior performance and generalization across tasks at the cost of large model sizes. To this end, we explore a new framework for unsupervised cross-modal knowledge transfer of biosignals by training a lightweight bridge network to align the intermediate representations and enable information flow between foundation models and across modalities. Specifically, we introduce an efficient strategy for selecting alignment positions where the bridge should be constructed, along with a flexible prototype network as the bridge architecture. Extensive experiments across multiple biosignal modalities, tasks, and datasets show that BioX-Bridge reduces the number of trainable parameters by 88--99\% while maintaining or even improving transfer performance compared to state-of-the-art methods.
23. Diffusion^2: Turning 3D Environments into Radio Frequency Heatmaps
Authors: Kyoungjun Park, Yifan Yang, Changhan Ge, Lili Qiu, Shiqi Jiang β€’ Published: 2025-10-02 β€’ Source: arXiv
Modeling radio frequency (RF) signal propagation is essential for understanding the environment, as RF signals offer valuable insights beyond the capabilities of RGB cameras, which are limited by the visible-light spectrum, lens coverage, and occlusions. It is also useful for supporting wireless diagnosis, deployment, and optimization. However, accurately predicting RF signals in complex environments remains a challenge due to interactions with obstacles such as absorption and reflection. We introduce Diffusion^2, a diffusion-based approach that uses 3D point clouds to model the propagation of RF signals across a wide range of frequencies, from Wi-Fi to millimeter waves. To effectively capture RF-related features from 3D data, we present the RF-3D Encoder, which encapsulates the complexities of 3D geometry along with signal-specific details. These features undergo multi-scale embedding to simulate the actual RF signal dissemination process. Our evaluation, based on synthetic and real-world measurements, demonstrates that Diffusion^2 accurately estimates the behavior of RF signals in various frequency bands and environmental conditions, with an error margin of just 1.9 dB and 27x faster than existing methods, marking a significant advancement in the field. Refer to https://rfvision-project.github.io/ for more information.
24. Parallel Scaling Law: Unveiling Reasoning Generalization through A Cross-Linguistic Perspective
Authors: Wen Yang, Junhong Wu, Chong Li, Chengqing Zong, Jiajun Zhang β€’ Published: 2025-10-02 β€’ Source: arXiv
Recent advancements in Reinforcement Post-Training (RPT) have significantly enhanced the capabilities of Large Reasoning Models (LRMs), sparking increased interest in the generalization of RL-based reasoning. While existing work has primarily focused on investigating its generalization across tasks or modalities, this study proposes a novel cross-linguistic perspective to investigate reasoning generalization. This raises a crucial question: $\textit{Does the reasoning capability achieved from English RPT effectively transfer to other languages?}$ We address this by systematically evaluating English-centric LRMs on multilingual reasoning benchmarks and introducing a metric to quantify cross-lingual transferability. Our findings reveal that cross-lingual transferability varies significantly across initial model, target language, and training paradigm. Through interventional studies, we find that models with stronger initial English capabilities tend to over-rely on English-specific patterns, leading to diminished cross-lingual generalization. To address this, we conduct a thorough parallel training study. Experimental results yield three key findings: $\textbf{First-Parallel Leap}$, a substantial leap in performance when transitioning from monolingual to just a single parallel language, and a predictable $\textbf{Parallel Scaling Law}$, revealing that cross-lingual reasoning transfer follows a power-law with the number of training parallel languages. Moreover, we identify the discrepancy between actual monolingual performance and the power-law prediction as $\textbf{Monolingual Generalization Gap}$, indicating that English-centric LRMs fail to fully generalize across languages. Our study challenges the assumption that LRM reasoning mirrors human cognition, providing critical insights for the development of more language-agnostic LRMs.
25. InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents
Authors: Yaxin Du, Yuanshuo Zhang, Xiyuan Yang, Yifan Zhou, Cheng Wang, Gongyi Zou, Xianghe Pang, Wenhao Wang, Menglan Chen, Shuo Tang, Zhiyu Li, Siheng Chen β€’ Published: 2025-10-02 β€’ Source: arXiv
Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open-web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real-world tasks require precise, domain-specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such tools -- and more importantly, whether they can integrate them with general-purpose search to solve complex tasks. Therefore, we introduce InfoMosaic-Bench, the first benchmark dedicated to multi-source information seeking in tool-augmented agents. Covering six representative domains (medicine, finance, maps, video, web, and multi-domain integration), InfoMosaic-Bench requires agents to combine general-purpose search with domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non-triviality. Experiments with 14 state-of-the-art LLM agents reveal three findings: (i) web information alone is insufficient, with GPT-5 achieving only 38.2% accuracy and 67.5% pass rate; (ii) domain tools provide selective but inconsistent benefits, improving some domains while degrading others; and (iii) 22.4% of failures arise from incorrect tool usage or selection, highlighting that current LLMs still struggle with even basic tool handling.
26. microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification
Authors: Sathira Silva, Eman Ali, Chetan Arora, Muhammad Haris Khan β€’ Published: 2025-10-02 β€’ Source: arXiv
Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues. While CLIP exhibits strong zero-shot transfer, its reliance on coarse global features restricts its performance on fine-grained classification tasks. Prior efforts inject fine-grained knowledge by aligning large language model (LLM) descriptions with the CLIP $\texttt{[CLS]}$ token; however, this approach overlooks spatial precision. We propose $\textbf{microCLIP}$, a self-training framework that jointly refines CLIP's visual and textual representations using fine-grained cues. At its core is Saliency-Oriented Attention Pooling (SOAP) within a lightweight TokenFusion module, which builds a saliency-guided $\texttt{[FG]}$ token from patch embeddings and fuses it with the global $\texttt{[CLS]}$ token for coarse-fine alignment. To stabilize adaptation, we introduce a two-headed LLM-derived classifier: a frozen classifier that, via multi-view alignment, provides a stable text-based prior for pseudo-labeling, and a learnable classifier initialized from LLM descriptions and fine-tuned with TokenFusion. We further develop Dynamic Knowledge Aggregation, which convexly combines fixed LLM/CLIP priors with TokenFusion's evolving logits to iteratively refine pseudo-labels. Together, these components uncover latent fine-grained signals in CLIP, yielding a consistent $2.90\%$ average accuracy gain across 13 fine-grained benchmarks while requiring only light adaptation. Our code is available at https://github.com/sathiiii/microCLIP.
27. Game-theoretic Social Distancing in Competitive Bi-Virus SIS Epidemics
Authors: Benjamin Catalano, Keith Paarporn, Sebin Gracy β€’ Published: 2025-10-02 β€’ Source: arXiv
Numerous elements drive the spread of infectious diseases in complex real-world networks. Of particular interest is social behaviors that evolve in tandem with the spread of disease. Moreover, recent studies highlight the importance of understanding how multiple strains spread simultaneously through a population (e.g. Delta and Omicron variants of SARS-CoV-2). In this paper, we propose a bi-virus SIS epidemic model coupled with a game-theoretic social distancing behavior model. The behaviors are governed by replicator equations from evolutionary game theory. The prevalence of each strain impacts the choice of an individual to social distance, and, in turn, their behavior affects the spread of each virus in the SIS model. Our analysis identifies equilibria of the system and their local stability properties, which reveal several isolated fixed points with varying levels of social distancing. We find that endemic co-existence is possible only when the reproduction numbers of both strains are equal. Assuming the reproduction number for each virus is the same, we identify suitable parameter regimes that give rise to lines of coexistence equilibria. Moreover, we also identify conditions for local exponential stability of said lines of equilibria. We illustrate our findings with several numerical simulations.
28. NeuroSwift: A Lightweight Cross-Subject Framework for fMRI Visual Reconstruction of Complex Scenes
Authors: Shiyi Zhang, Dong Liang, Yihang Zhou β€’ Published: 2025-10-02 β€’ Source: arXiv
Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain's abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift's CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.
29. Paving the Way Towards Kinematic Assessment Using Monocular Video: A Preclinical Benchmark of State-of-the-Art Deep-Learning-Based 3D Human Pose Estimators Against Inertial Sensors in Daily Living Activities
Authors: Mario Medrano-Paredes, Carmen FernΓ‘ndez-GonzΓ‘lez, Francisco-Javier DΓ­az-Pernas, Hichem Saoudi, Javier GonzΓ‘lez-Alonso, Mario MartΓ­nez-Zarzuela β€’ Published: 2025-10-02 β€’ Source: arXiv
Advances in machine learning and wearable sensors offer new opportunities for capturing and analyzing human movement outside specialized laboratories. Accurate assessment of human movement under real-world conditions is essential for telemedicine, sports science, and rehabilitation. This preclinical benchmark compares monocular video-based 3D human pose estimation models with inertial measurement units (IMUs), leveraging the VIDIMU dataset containing a total of 13 clinically relevant daily activities which were captured using both commodity video cameras and five IMUs. During this initial study only healthy subjects were recorded, so results cannot be generalized to pathological cohorts. Joint angles derived from state-of-the-art deep learning frameworks (MotionAGFormer, MotionBERT, MMPose 2D-to-3D pose lifting, and NVIDIA BodyTrack) were evaluated against joint angles computed from IMU data using OpenSim inverse kinematics following the Human3.6M dataset format with 17 keypoints. Among them, MotionAGFormer demonstrated superior performance, achieving the lowest overall RMSE ($9.27\deg \pm 4.80\deg$) and MAE ($7.86\deg \pm 4.18\deg$), as well as the highest Pearson correlation ($0.86 \pm 0.15$) and the highest coefficient of determination $R^{2}$ ($0.67 \pm 0.28$). The results reveal that both technologies are viable for out-of-the-lab kinematic assessment. However, they also highlight key trade-offs between video- and sensor-based approaches including costs, accessibility, and precision. This study clarifies where off-the-shelf video models already provide clinically promising kinematics in healthy adults and where they lag behind IMU-based estimates while establishing valuable guidelines for researchers and clinicians seeking to develop robust, cost-effective, and user-friendly solutions for telehealth and remote patient monitoring.
30. RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems
Authors: Yuxiao Qu, Anikait Singh, Yoonho Lee, Amrith Setlur, Ruslan Salakhutdinov, Chelsea Finn, Aviral Kumar β€’ Published: 2025-10-02 β€’ Source: arXiv
Reasoning requires going beyond pattern matching or memorization of solutions to identify and implement "algorithmic procedures" that can be used to deduce answers to hard problems. Doing so requires realizing the most relevant primitives, intermediate results, or shared procedures, and building upon them. While RL post-training on long chains of thought ultimately aims to uncover this kind of algorithmic behavior, most reasoning traces learned by large models fail to consistently capture or reuse procedures, instead drifting into verbose and degenerate exploration. To address more effective reasoning, we introduce reasoning abstractions: concise natural language descriptions of procedural and factual knowledge that guide the model toward learning successful reasoning. We train models to be capable of proposing multiple abstractions given a problem, followed by RL that incentivizes building a solution while using the information provided by these abstractions. This results in a two-player RL training paradigm, abbreviated as RLAD, that jointly trains an abstraction generator and a solution generator. This setup effectively enables structured exploration, decouples learning signals of abstraction proposal and solution generation, and improves generalization to harder problems. We also show that allocating more test-time compute to generating abstractions is more beneficial for performance than generating more solutions at large test budgets, illustrating the role of abstractions in guiding meaningful exploration.
31. From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding
Authors: Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen, Garin Kessler β€’ Published: 2025-10-02 β€’ Source: arXiv
Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks, yet their practical use is limited by the "needle in a haystack" problem: the massive number of visual tokens produced from raw video frames exhausts the model's context window. Existing solutions alleviate this issue by selecting a sparse set of frames, thereby reducing token count, but such frame-wise selection discards essential temporal dynamics, leading to suboptimal reasoning about motion and event continuity. In this work we systematically explore the impact of temporal information and demonstrate that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding. To maintain a fixed computational budget while accommodating the larger token footprint of clips, we propose an adaptive resolution strategy that dynamically balances spatial resolution and clip length, ensuring a constant token count per video. Experiments on three long-form video benchmarks demonstrate that our training-free approach, F2C, outperforms uniform sampling up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench and MLVU benchmarks, respectively. These results highlight the importance of preserving temporal coherence in frame selection and provide a practical pathway for scaling Video LLMs to real world video understanding applications. Project webpage is available at https://guangyusun.com/f2c .
32. Transformers Discover Molecular Structure Without Graph Priors
Authors: Tobias Kreiman, Yutong Bai, Fadi Atieh, Elizabeth Weaver, Eric Qu, Aditi S. Krishnapriyan β€’ Published: 2025-10-02 β€’ Source: arXiv
Graph Neural Networks (GNNs) are the dominant architecture for molecular machine learning, particularly for molecular property prediction and machine learning interatomic potentials (MLIPs). GNNs perform message passing on predefined graphs often induced by a fixed radius cutoff or k-nearest neighbor scheme. While this design aligns with the locality present in many molecular tasks, a hard-coded graph can limit expressivity due to the fixed receptive field and slows down inference with sparse graph operations. In this work, we investigate whether pure, unmodified Transformers trained directly on Cartesian coordinates$\unicode{x2013}$without predefined graphs or physical priors$\unicode{x2013}$can approximate molecular energies and forces. As a starting point for our analysis, we demonstrate how to train a Transformer to competitive energy and force mean absolute errors under a matched training compute budget, relative to a state-of-the-art equivariant GNN on the OMol25 dataset. We discover that the Transformer learns physically consistent patterns$\unicode{x2013}$such as attention weights that decay inversely with interatomic distance$\unicode{x2013}$and flexibly adapts them across different molecular environments due to the absence of hard-coded biases. The use of a standard Transformer also unlocks predictable improvements with respect to scaling training resources, consistent with empirical scaling laws observed in other domains. Our results demonstrate that many favorable properties of GNNs can emerge adaptively in Transformers, challenging the necessity of hard-coded graph inductive biases and pointing toward standardized, scalable architectures for molecular modeling.
33. Rotation Periods for Stars in Open Cluster NGC 6819 From Kepler IRIS Light Curves
Authors: Sabina Sagynbayeva, Isabel L. Colman, Will M. Farr β€’ Published: 2025-10-02 β€’ Source: arXiv
We present an updated catalog of stellar rotation periods for the 2.5 Gyr open cluster NGC 6819 using the Kepler IRIS light curves from superstamp data. Our analysis uses Gaussian Process modeling to extract robust rotation signals from image subtraction light curves, allowing us unprecedented data access and measurement precision in the crowded cluster field. After applying stringent quality and contamination cuts, we identify 271 reliable rotation periods, representing by far the largest sample of rotators measured in a single intermediate-age cluster. Compared to previous work, which relied on only ~30 stars, our catalog extends the gyrochronological sequence of NGC 6819 with an order of magnitude more measurements and improved precision. The expanded dataset reveals both the expected temperature-dependent spin-down trend and substantial scatter at fixed effective temperature, including a bimodal distribution of fast and slow rotators. We also identify a distinct ``pile-up'' sequence consistent with predictions of weakened magnetic braking at critical Rossby numbers. These results strengthen this cluster's role as a benchmark for stellar spin evolution, while also highlighting the limitations of traditional gyrochronology at older ages. The final catalog and the model implementations are all available on Zenodo.
34. DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing
Authors: Zihan Zhou, Shilin Lu, Shuli Leng, Shaocong Zhang, Zhuming Lian, Xinlei Yu, Adams Wai-Kin Kong β€’ Published: 2025-10-02 β€’ Source: arXiv
Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX's rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.
35. The Unreasonable Effectiveness of Scaling Agents for Computer Use
Authors: Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, Xin Eric Wang β€’ Published: 2025-10-02 β€’ Source: arXiv
Computer-use agents (CUAs) hold promise for automating everyday digital tasks, but their unreliability and high variance hinder their application to long-horizon, complex tasks. We introduce Behavior Best-of-N (bBoN), a method that scales over agents by generating multiple rollouts and selecting among them using behavior narratives that describe the agents' rollouts. It enables both wide exploration and principled trajectory selection, substantially improving robustness and success rates. On OSWorld, our bBoN scaling method establishes a new state of the art (SoTA) at 69.9%, significantly outperforming prior methods and approaching human-level performance at 72%, with comprehensive ablations validating key design choices. We further demonstrate strong generalization results to different operating systems on WindowsAgentArena and AndroidWorld. Crucially, our results highlight the unreasonable effectiveness of scaling CUAs, when you do it right: effective scaling requires structured trajectory understanding and selection, and bBoN provides a practical framework to achieve this.