πŸ€– AI Research Papers

August 18, 2025

πŸ€– AI-Generated Research Summary

Comprehensive Summary of 35 Research Papers on AI, LLMs, Agents, and Workflows

This summary synthesizes key insights from a diverse set of recent research papers, with a focus on AI, large language models (LLMs), agents, and workflow automation. The analysis is structured to highlight major research trends, breakthrough findings, methodological approaches, practical applications, and future research directions.


1. Key Research Trends

a. Advancements in Large Language Models (LLMs) and Multimodal AI

b. AI for Scientific and Technical Domains

c. AI for Security, Privacy, and Safety

d. Synthetic Data and Dataset Creation

e. AI in Robotics and Perception


2. Breakthrough Findings


3. Methodological Approaches


4. Applications and Use Cases


5. Future Directions


Conclusion

This collection of papers reflects a vibrant and rapidly evolving landscape in AI research, with significant progress in LLMs, multimodal models, workflow automation, and domain-specific applications. Key trends include a focus on efficiency, adaptability, security, and the creation of high-quality synthetic data. Methodologically, there is a shift toward modular, train-free, and reward-guided approaches, as well as privacy-preserving computation. The practical impact spans healthcare, security, robotics, and scientific discovery, with future research poised to further bridge the gap between advanced AI capabilities and real-world needs.

πŸ“š arXiv (35 papers)
1. Thyme: Think Beyond Images
Authors: Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, Guorui Zhou β€’ Published: 2025-08-15 β€’ Source: arXiv
Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.
2. Is ChatGPT-5 Ready for Mammogram VQA?
Authors: Qiang Li, Shansong Wang, Mingzhe Hu, Mojtaba Safari, Zachary Eidex, Xiaofeng Yang β€’ Published: 2025-08-15 β€’ Source: arXiv
Mammogram visual question answering (VQA) integrates image interpretation with clinical reasoning and has potential to support breast cancer screening. We systematically evaluated the GPT-5 family and GPT-4o model on four public mammography datasets (EMBED, InBreast, CMMD, CBIS-DDSM) for BI-RADS assessment, abnormality detection, and malignancy classification tasks. GPT-5 consistently was the best performing model but lagged behind both human experts and domain-specific fine-tuned models. On EMBED, GPT-5 achieved the highest scores among GPT variants in density (56.8%), distortion (52.5%), mass (64.5%), calcification (63.5%), and malignancy (52.8%) classification. On InBreast, it attained 36.9% BI-RADS accuracy, 45.9% abnormality detection, and 35.0% malignancy classification. On CMMD, GPT-5 reached 32.3% abnormality detection and 55.0% malignancy accuracy. On CBIS-DDSM, it achieved 69.3% BI-RADS accuracy, 66.0% abnormality detection, and 58.2% malignancy accuracy. Compared with human expert estimations, GPT-5 exhibited lower sensitivity (63.5%) and specificity (52.3%). While GPT-5 exhibits promising capabilities for screening tasks, its performance remains insufficient for high-stakes clinical imaging applications without targeted domain adaptation and optimization. However, the tremendous improvements in performance from GPT-4o to GPT-5 show a promising trend in the potential for general large language models (LLMs) to assist with mammography VQA tasks.
3. A string based model with Hagedorn temperature of $T_H\sim 300~$MeV describes the spectrum of mesons and glueballs
Authors: MichaΕ‚ Marczenko, GyΕ‘zΕ‘ KovΓ‘cs, Larry McLerran, Krzysztof Redlich β€’ Published: 2025-08-15 β€’ Source: arXiv
We consider the thermodynamics of a color-confined phase of quantum chromodynamics (QCD) and pure gauge theory within a string-inspired model, corresponding to a physical spatial dimension, d = 3. We show that the physical mass spectrum of massive mesons--in both the strange and non-strange sectors separately--is reasonably well described and extended by the exponential mass spectrum of open strings, $\rho(m)$, characterized by a unique Hagedorn temperature, $T_H = \sqrt{3\sigma/2\pi}$, expressed by the string tension, $\sigma$. This $T_H$ is the value appropriate for d = 3 spatial dimensions, and is of order $T_H \sim 300~\rm MeV$ for typical values of the string tension. It is much larger than the values of $T_H$, which have been phenomenologically extracted \green{so far} to describe the meson spectrum. Glueball states in pure gauge theory, modeled by closed strings, exhibit a similarly large Hagedorn temperature, highlighting a universal feature of the exponential spectrum. We further analyze the thermodynamic properties of the equation of state at finite temperature and demonstrate that, in the confined phase, the string models agree with lattice QCD results. This lends further support to the recent interpretation of the QCD phase diagram that incorporates strings as relevant degrees of freedom.
4. LoRAtorio: An intrinsic approach to LoRA Skill Composition
Authors: Niki Foteinopoulou, Ignas Budvytis, Stephan Liwicki β€’ Published: 2025-08-15 β€’ Source: arXiv
Low-Rank Adaptation (LoRA) has become a widely adopted technique in text-to-image diffusion models, enabling the personalisation of visual concepts such as characters, styles, and objects. However, existing approaches struggle to effectively compose multiple LoRA adapters, particularly in open-ended settings where the number and nature of required skills are not known in advance. In this work, we present LoRAtorio, a novel train-free framework for multi-LoRA composition that leverages intrinsic model behaviour. Our method is motivated by two key observations: (1) LoRA adapters trained on narrow domains produce denoised outputs that diverge from the base model, and (2) when operating out-of-distribution, LoRA outputs show behaviour closer to the base model than when conditioned in distribution. The balance between these two observations allows for exceptional performance in the single LoRA scenario, which nevertheless deteriorates when multiple LoRAs are loaded. Our method operates in the latent space by dividing it into spatial patches and computing cosine similarity between each patch's predicted noise and that of the base model. These similarities are used to construct a spatially-aware weight matrix, which guides a weighted aggregation of LoRA outputs. To address domain drift, we further propose a modification to classifier-free guidance that incorporates the base model's unconditional score into the composition. We extend this formulation to a dynamic module selection setting, enabling inference-time selection of relevant LoRA adapters from a large pool. LoRAtorio achieves state-of-the-art performance, showing up to a 1.3% improvement in ClipScore and a 72.43% win rate in GPT-4V pairwise evaluations, and generalises effectively to multiple latent diffusion models.
5. Robust Topology and the Hausdorff-Smyth Monad on Metric Spaces over Continuous Quantales
Authors: Francesco Dagnino, Amin Farjudian Eugenio Moggi β€’ Published: 2025-08-15 β€’ Source: arXiv
We define a (preorder-enriched) category $\mathsf{Met}$ of quantale-valued metric spaces and uniformly continuous maps, with the essential requirement that the quantales are continuous. For each object $(X,d,Q)$ in this category, where $X$ is the carrier set, $Q$ is a continuous quantale, and $d: X \times X \to Q$ is the metric, we consider a topology $\tau_d$ on $X$, which generalizes the open ball topology, and a topology $\tau_{d,R}$ on the powerset $\mathsf{P}(X)$, called the robust topology, which captures robustness with respect to small perturbations of parameters. We define a (preorder-enriched) monad $\mathsf{P}_S$ on $\mathsf{Met}$, called the Hausdorff-Smyth monad, which captures the robust topology, in the sense that the open ball topology of the object $\mathsf{P}_S(X,d,Q)$ coincides with the robust topology $\tau_{d,R}$ for the object $(X,d,Q)$. We prove that every topology arises from a quantale-valued metric. As such, our framework provides a foundation for quantitative reasoning about imprecision and robustness in a wide range of computational and physical systems.
6. Deconfounding via Profiled Transfer Learning
Authors: Ziyuan Chen, Yifan Jiang, Jingyuan Liu, Fang Yao β€’ Published: 2025-08-15 β€’ Source: arXiv
Unmeasured confounders are a major source of bias in regression-based effect estimation and causal inference. In this paper, we advocate a new profiled transfer learning framework, ProTrans, to address confounding effects in the target dataset, when additional source datasets that possess similar confounding structures are available. We introduce the concept of profiled residuals to characterize the shared confounding patterns between source and target datasets. By incorporating these profiled residuals into the target debiasing step, we effectively mitigates the latent confounding effects. We also propose a source selection strategy to enhance robustness of ProTrans against noninformative sources. As a byproduct, ProTrans can also be utilized to estimate treatment effects when potential confounders exist, without the use of auxiliary features such as instrumental or proxy variables, which are often challenging to select in practice. Theoretically, we prove that the resulting estimated model shift from sources to target is confounding-free without any assumptions imposed on the true confounding structure, and that the target parameter estimation achieves the minimax optimal rate under mild conditions. Simulated and real-world experiments validate the effectiveness of ProTrans and support the theoretical findings.
7. Higher Zariski Geometry
Authors: Ko Aoki, Tobias Barthel, Anish Chedalavada, Tomer Schlank, Greg Stevenson β€’ Published: 2025-08-15 β€’ Source: arXiv
We revisit the classical constructions of tensor-triangular geometry in the setting of stably symmetric monoidal idempotent-complete $\infty$-categories, henceforth referred to as 2-rings. In this setting, we produce a Zariski topology, a Zariski spectrum, a category of locally 2-ringed spaces (more generally $\infty$-topoi), and an affine spectrum-global sections adjunction, based on the framework of ``$\infty$-topoi with geometric structure'' as developed by Lurie in \cite{LurieDAG5}. Using work of Kock and Pitsch, we compute that the underlying space of the Zariski spectrum of a 2-ring recovers the Balmer spectrum of its homotopy category. These constructions mirror the analogous structures in the classical Zariski geometry of commutative rings (and commutative ring spectra), and we also demonstrate additional compatibility between classical Zariski and higher Zariski geometry. For rigid 2-rings, we show that the descent results of Balmer and Favi admit coherent enhancements. As a corollary, we obtain that the Zariski spectrum fully faithfully embeds rigid 2-rings into locally 2-ringed $\infty$-topoi. In an appendix, we prove a ``stalk-locality principle'' for the telescope conjecture in the rigid setting, extending earlier work of Hrbek.
8. Approximate Factor Model with S-vine Copula Structure
Authors: Jialing Han, Yu-Ning Li β€’ Published: 2025-08-15 β€’ Source: arXiv
We propose a novel framework for approximate factor models that integrates an S-vine copula structure to capture complex dependencies among common factors. Our estimation procedure proceeds in two steps: first, we apply principal component analysis (PCA) to extract the factors; second, we employ maximum likelihood estimation that combines kernel density estimation for the margins with an S-vine copula to model the dependence structure. Jointly fitting the S-vine copula with the margins yields an oblique factor rotation without resorting to ad hoc restrictions or traditional projection pursuit methods. Our theoretical contributions include establishing the consistency of the rotation and copula parameter estimators, developing asymptotic theory for the factor-projected empirical process under dependent data, and proving the uniform consistency of the projected entropy estimators. Simulation studies demonstrate convergence with respect to both the dimensionality and the sample size. We further assess model performance through Value-at-Risk (VaR) estimation via Monte Carlo methods and apply our methodology to the daily returns of S&P 500 Index constituents to forecast the VaR of S&P 500 index.
9. Optimal CO2 storage management considering safety constraints in multi-stakeholder multi-site CCS projects: a game theoretic perspective
Authors: Jungang Chen, Seyyed A. Hosseini β€’ Published: 2025-08-15 β€’ Source: arXiv
Carbon capture and storage (CCS) projects typically involve a diverse array of stakeholders or players from public, private, and regulatory sectors, each with different objectives and responsibilities. Given the complexity, scale, and long-term nature of CCS operations, determining whether individual stakeholders can independently maximize their interests or whether collaborative coalition agreements are needed remains a central question for effective CCS project planning and management. CCS projects are often implemented in geologically connected sites, where shared geological features such as pressure space and reservoir pore capacity can lead to competitive behavior among stakeholders. Furthermore, CO2 storage sites are often located in geologically mature basins that previously served as sites for hydrocarbon extraction or wastewater disposal in order to leverage existing infrastructures, which makes unilateral optimization even more complicated and unrealistic. In this work, we propose a paradigm based on Markov games to quantitatively investigate how different coalition structures affect the goals of stakeholders. We frame this multi-stakeholder multi-site problem as a multi-agent reinforcement learning problem with safety constraints. Our approach enables agents to learn optimal strategies while compliant with safety regulations. We present an example where multiple operators are injecting CO2 into their respective project areas in a geologically connected basin. To address the high computational cost of repeated simulations of high-fidelity models, a previously developed surrogate model based on the Embed-to-Control (E2C) framework is employed. Our results demonstrate the effectiveness of the proposed framework in addressing optimal management of CO2 storage when multiple stakeholders with various objectives and goals are involved.
10. Coherent Structure Dynamics of Heat Transfer in Wakes of an Inclined Elliptical Cylinder: A Novel Lagrangian Framework
Authors: Pratham Singh, Raghav Singhal, Jiten C. Kalita β€’ Published: 2025-08-15 β€’ Source: arXiv
This work introduces a novel Lagrangian-based framework to analyze forced convective heat transfer in the unsteady wake of a heated elliptical cylinder inclined at angles ranging from $\theta = 0^\circ$ to $90^\circ$, in $15^\circ$ increments with $Pr = 0.71$ at a fixed Reynolds number of $Re = 100$. The framework correlates the temporal evolution of the surface-averaged Nusselt number with the dynamic behavior of Lagrangian saddle points, formed at the intersection of repelling and attracting Lagrangian Coherent Structures (LCSs) extracted via Finite-Time Lyapunov Exponent (FTLE) fields.The study is carried out within a precisely constructed observational domain, a previously unreported influential region in the near-wake, where the trajectory analysis of the newly defined key saddle points (active saddle points) consistently aligns with the trends in surface heat transfer. This domain enables predictive identification of key transitional events in the Nusselt number profile, including local extrema and slope inflections, across all inclination angles. The analysis reveals that oblique displacement of active saddle points enhances heat transfer by promoting the shedding of repelling LCSs, while parallel displacement leads to weakened heat transfer due to the delayed detachment of repelling coherent structures. The proposed framework enables the construction of a temporal function that closely replicates the monotonicity and transitional features of the Nusselt number evolution. Furthermore, threshold displacement metrics are defined for dominant repelling LCSs to correspond with peak heat transfer efficiency. The proposed methodology not only generalizes across a wide range of inclination angles but also provides a physically interpretable framework for predicting heat transfer enhancement based on coherent structure evolution in unsteady flows.
11. Controlling Multimodal LLMs via Reward-guided Decoding
Authors: Oscar MaΓ±as, Pierluca D'Oro, Koustuv Sinha, Adriana Romero-Soriano, Michal Drozdzal, Aishwarya Agrawal β€’ Published: 2025-08-15 β€’ Source: arXiv
As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM's decoding process. Concretely, we build two separate reward models to independently control the degree of object precision and recall in the model's output. Our approach enables on-the-fly controllability of an MLLM's inference process in two ways: first, by giving control over the relative importance of each reward function during decoding, allowing a user to dynamically trade off object precision for recall in image captioning tasks; second, by giving control over the breadth of the search during decoding, allowing the user to control the trade-off between the amount of test-time compute and the degree of visual grounding. We evaluate our method on standard object hallucination benchmarks, showing that it provides significant controllability over MLLM inference, while consistently outperforming existing hallucination mitigation methods.
12. Bulk viscous cosmological models with cosmological constant: Observational constraints
Authors: R. NoemΓ­ Villalobos, Yerko VΓ‘squez, Norman Cruz, Carlos H. LΓ³pez-Caraballo β€’ Published: 2025-08-15 β€’ Source: arXiv
We investigate whether viscous cold dark matter (vCDM) in a $\Lambda$-dominated FLRW universe can alleviate the Hubble tension while satisfying thermodynamic constraints, examining both flat and curved geometries. We model vCDM with bulk viscosity $\zeta = \zeta_0\,(\Omega_{vc}/\Omega_{vc0})^m$, where $m$ determines the viscosity evolution and $\Omega_{vc}$ is the density parameter of vCDM. We explore two particular scenarios: constant viscosity ($m=0$), and variable viscosity ($m$ free). Using Bayesian inference, we constrain these models with the latest datasets: the Pantheon+ SN Ia sample (both with SH0ES calibration, PPS, and without it, PP), $H(z)$ measurements from CC and BAO as separate datasets, and a Gaussian prior on $H_0$ from 2022 SH0ES baseline, $H_0=73.04 \pm 1.04$ km/s/Mpc (R22 prior). We compare the models via information criteria such as AIC, BIC, DIC, and Bayesian evidence. Our results reveal that the Hubble tension persists, although it shows partial alleviation ($\sim 1\sigma$ tension) in all investigated scenarios when local measurements are included. For the flat $m=0$ case, the joint analysis yields $H_0 = 71.05^{+0.62}_{-0.60}$ km/s/Mpc. Curved model initially favors $\Omega_{K0} > 0$ (at more than $2\sigma$), but this preference shifts toward flatness once the PPS+R22 prior are included. Notably, the current viscosity is constrained to $\zeta_0 \sim 10^6$ Pa s in all scenarios, in agreement with the thermodynamic requirements. Although model selection via BIC and Bayesian evidence favors $\Lambda$CDM, AIC and DIC show mild support for viscous models in some datasets. Bulk viscous models moderately improve fits but neither resolve the Hubble tension nor outperform the $\Lambda$CDM model. To achieve more robust constraints, future analyses should incorporate CMB observations, which are expected to break parameter degeneracies involving $m$ and $\tilde{\zeta}_0$.
13. Two-Impulse Trajectory Design in Two-Body Systems With Riemannian Geometry
Authors: Samuel G. Gessow, James Tseng, Eden Zafran, Brett T. Lopez β€’ Published: 2025-08-15 β€’ Source: arXiv
This work presents a new method for generating impulsive trajectories in restricted two-body systems by leveraging Riemannian geometry. The proposed method transforms the standard trajectory optimization problem into a purely geometric one that involves computing a set of geodesics for a suitable Riemannian metric. This transformation is achieved by defining a metric, specifically the Jacobi metric, that embeds the dynamics directly into the metric, so any geodesic of the metric is also a dynamically feasible trajectory. The method finds the fuel-optimal transfer trajectory by sampling candidate energy ($\Delta V$) changes for different points on the current and desired orbit, and efficiently computing and evaluating each candidate geodesic, which are equivalent to candidate orbit transfer trajectories via the Jacobi metric. The method bypasses the known issues of optimization-based methods, e.g., sensitivity to the initial guess, and can be applied to more complex two-body systems. The approach is demonstrated on the minimum-$\Delta V$ two-impulse phase-free orbit transfer problem, first on a Keplerian system and second on a system with a modeled $J_2$ perturbation. The proposed method is shown to meet or exceed the state-of-the-art methods in the minimum-$\Delta V$ problem in the Keplerian system. The generality and versatility of the approach is demonstrated by seamlessly including the $J_2$ perturbation, a case that many existing methods cannot handle. Numerical simulations and performance comparisons showcase the effectiveness of the approach.
14. Quantum Simulation of Collective Neutrino Oscillations in Dense Neutrino Environment
Authors: Shvetaank Tripathi, Sandeep Joshi, Garima Rajpoot, Prashant Shukla β€’ Published: 2025-08-15 β€’ Source: arXiv
Inside dense neutrino gases, such as neutron star mergers or core-collapse supernovae, collective neutrino effects cause the transformation of one neutrino flavour into another. Due to strong neutrino self-interactions in these environments, there is prevalence of flavour swapping. Considering these environments to be isotropic and homogeneous, we present a study of collective neutrino oscillations by simulating such a system on a noisy quantum simulator (Qiskit AerSimulator) and a quantum processor (ibm\_brisbane). We model the effective Hamiltonian governing neutrino interactions and by applying the Trotter-Suzuki approximation, decompose it into a tractable form suitable for quantum circuit implementation of the time-evolution propagator. Encoding the neutrino state for a system of two- and three-neutrinos onto qubits, we compute the time evolution of the inversion probability relative to the initial product state. Furthermore, we present quantum circuits to evaluate the concurrence as a measure of entanglement between the neutrinos.
15. Dependence of the recoherence times and recoherence increments on the state of phonon bath in a single qubit dephasing model
Authors: V. V. Ignatyuk, Ch. Samorodov β€’ Published: 2025-08-15 β€’ Source: arXiv
The recoherence times $t^*$ and the maximum values of the recoherence increments $\gamma_{\rm extr}$ are studied as functions of the bath parameters for a single qubit dephasing model, prepared initially by a special kind of the non-selective measurements. The recoherence/decoherence events (RDE), occurring at the initial stage of the system evolution, are found to be both similar and different from the system dynamics at large times. For instance, in contrast to the RDE observed on large time scales, the sub-Ohmic and Ohmic coupling regimes are more favourable for the short-time recoherence than the super-Ohmic one. On the other hand, the short-time behaviour of the recoherence and the long-time dynamics of the decoherence are closely related: the domain of the ohmicity indexes, where the decoherence changes its type (from the complete to incomplete one), is, simultaneously, that of the weakest recoherence. The obtained results give us some hints about the basic characteristics of the environment, which might provide the most optimal values of $t^*$ and $\gamma_{\rm extr}$ in some sense.
16. Dataset Creation for Visual Entailment using Generative AI
Authors: Rob Reijtenbach, Suzan Verberne, Gijs Wijnholds β€’ Published: 2025-08-15 β€’ Source: arXiv
In this paper we present and validate a new synthetic dataset for training visual entailment models. Existing datasets for visual entailment are small and sparse compared to datasets for textual entailment. Manually creating datasets is labor-intensive. We base our synthetic dataset on the SNLI dataset for textual entailment. We take the premise text from SNLI as input prompts in a generative image model, Stable Diffusion, creating an image to replace each textual premise. We evaluate our dataset both intrinsically and extrinsically. For extrinsic evaluation, we evaluate the validity of the generated images by using them as training data for a visual entailment classifier based on CLIP feature vectors. We find that synthetic training data only leads to a slight drop in quality on SNLI-VE, with an F-score 0.686 compared to 0.703 when trained on real data. We also compare the quality of our generated training data to original training data on another dataset: SICK-VTE. Again, there is only a slight drop in F-score: from 0.400 to 0.384. These results indicate that in settings with data sparsity, synthetic data can be a promising solution for training visual entailment models.
17. BRIDGES Lectures: Flows of geometric structures, especially $\mathrm{G}_2$-structures
Authors: Spiro Karigiannis β€’ Published: 2025-08-15 β€’ Source: arXiv
The BRIDGES meeting in gauge theory, extremal structures, and stability was held June 2024 at l'Institut d'\'Etudes Scientifiques de Carg\`ese in Corsica, organized by Daniele Faenzi, Eveline Legendre, Eric Loubeau, and Henrique S\'a Earp. The first week was a summer school consisting of four independent but related lecture series by Oscar Garc\'ia Prada, Spiro Karigiannis, Laurent Manivel, and Ruxandra Moraru. The present document consists of notes for the lecture series by Spiro Karigiannis on "Flows of geometric structures, especially $\mathrm{G}_2$-structures". Some assistance in the preparation of these notes by the author was provided by several participants of the summer school. See the Comments field for more information. The main theme is short time existence (STE) and uniqueness for geometric flows. We first introduce geometric structures on manifolds and geometric flows of such structures. We discuss some qualitative features of geometric flows, and consider the notions of strong and weak parabolicity. We focus on the Ricci flow, explaining carefully the DeTurck trick to establish short-time existence and uniqueness, an argument which we then extend to a general class of geometric flows of Riemannian metrics, previewing similar ideas for flows of $\mathrm{G}_2$-structures. Finally, we consider geometric flows of $\mathrm{G}_2$-structures. We review the basics of $\mathrm{G}_2$-geometry and survey several different geometric flows of $\mathrm{G}_2$-structures. In particular, we clarify in what sense STE results for the $\mathrm{G}_2$ Laplacian flow differ from STE results for other geometric flows. We conclude with a summary of some recent results by the author with Dwivedi and Gianniotis, including a classification of all possible heat-type flows of $\mathrm{G}_2$-structures, and a sufficient condition for such a flow to admit STE and uniqueness by a modified DeTurck trick.
18. CoreEditor: Consistent 3D Editing via Correspondence-constrained Diffusion
Authors: Zhe Zhu, Honghua Chen, Peng Li, Mingqiang Wei β€’ Published: 2025-08-15 β€’ Source: arXiv
Text-driven 3D editing seeks to modify 3D scenes according to textual descriptions, and most existing approaches tackle this by adapting pre-trained 2D image editors to multi-view inputs. However, without explicit control over multi-view information exchange, they often fail to maintain cross-view consistency, leading to insufficient edits and blurry details. We introduce CoreEditor, a novel framework for consistent text-to-3D editing. The key innovation is a correspondence-constrained attention mechanism that enforces precise interactions between pixels expected to remain consistent throughout the diffusion denoising process. Beyond relying solely on geometric alignment, we further incorporate semantic similarity estimated during denoising, enabling more reliable correspondence modeling and robust multi-view editing. In addition, we design a selective editing pipeline that allows users to choose preferred results from multiple candidates, offering greater flexibility and user control. Extensive experiments show that CoreEditor produces high-quality, 3D-consistent edits with sharper details, significantly outperforming prior methods.
19. A non-Hermitian Su-Schrieffer-Heeger model with the energy levels of free parafermions
Authors: Edward McCann β€’ Published: 2025-08-15 β€’ Source: arXiv
Using a parent Hermitian tight-binding model on a bipartite lattice with chiral symmetry, we theoretically generate non-Hermitian models for free fermions with $p$ orbitals per unit cell satisfying a complex generalization of chiral symmetry. The $p$ complex energy bands in $k$ space are given by a common $k$-dependent real factor, determined by the bands of the parent model, multiplied by the $p$th roots of unity. When the parent model is the Su-Schrieffer-Heeger (SSH) model, the single-particle energy levels are the same as those of free parafermion solutions to Baxter's non-Hermitian clock model. This construction relies on fully unidirectional hopping to create Bloch Hamiltonians with the form of generalized permutation matrices, but we also describe the effect of partial unidirectional hopping. For fully bidirectional hopping, the Bloch Hamiltonians are Hermitian and may be separated into even and odd parity blocks with respect to inversion of the orbitals within the unit cell. Partially unidirectional hopping breaks the inversion symmetry and mixes the even and odd blocks, and the real energy spectrum evolves into a complex one as the degree of unidirectionality increases, with details determined by the topology of the parent model and the number of orbitals per unit cell, $p$. We describe this process in detail for $p=3$ and $p=4$ with the SSH model. We also apply our approach to graphene, and show that $AA$-stacked bilayer graphene evolves into a square root Hamiltonian of monolayer graphene with the introduction of unidirectional hopping. We show that higher-order exceptional points occur at edge states and solitons in the non-Hermitian SSH model, and at the Dirac point of non-Hermitian graphene.
20. CryptoScope: Utilizing Large Language Models for Automated Cryptographic Logic Vulnerability Detection
Authors: Zhihao Li, Zimo Ji, Tao Zheng, Hao Ren, Xiao Lan β€’ Published: 2025-08-15 β€’ Source: arXiv
Cryptographic algorithms are fundamental to modern security, yet their implementations frequently harbor subtle logic flaws that are hard to detect. We introduce CryptoScope, a novel framework for automated cryptographic vulnerability detection powered by Large Language Models (LLMs). CryptoScope combines Chain-of-Thought (CoT) prompting with Retrieval-Augmented Generation (RAG), guided by a curated cryptographic knowledge base containing over 12,000 entries. We evaluate CryptoScope on LLM-CLVA, a benchmark of 92 cases primarily derived from real-world CVE vulnerabilities, complemented by cryptographic challenges from major Capture The Flag (CTF) competitions and synthetic examples across 11 programming languages. CryptoScope consistently improves performance over strong LLM baselines, boosting DeepSeek-V3 by 11.62%, GPT-4o-mini by 20.28%, and GLM-4-Flash by 28.69%. Additionally, it identifies 9 previously undisclosed flaws in widely used open-source cryptographic projects.
21. Representing Speech Through Autoregressive Prediction of Cochlear Tokens
Authors: Greta Tuckute, Klemen Kotar, Evelina Fedorenko, Daniel L. K. Yamins β€’ Published: 2025-08-15 β€’ Source: arXiv
We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete \textbf{cochlear tokens}. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream's strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the model's predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks.
22. Nonparametric learning of stochastic differential equations from sparse and noisy data
Authors: Arnab Ganguly, Riten Mitra, Jinpu Zhou β€’ Published: 2025-08-15 β€’ Source: arXiv
The paper proposes a systematic framework for building data-driven stochastic differential equation (SDE) models from sparse, noisy observations. Unlike traditional parametric approaches, which assume a known functional form for the drift, our goal here is to learn the entire drift function directly from data without strong structural assumptions, making it especially relevant in scientific disciplines where system dynamics are partially understood or highly complex. We cast the estimation problem as minimization of the penalized negative log-likelihood functional over a reproducing kernel Hilbert space (RKHS). In the sparse observation regime, the presence of unobserved trajectory segments makes the SDE likelihood intractable. To address this, we develop an Expectation-Maximization (EM) algorithm that employs a novel Sequential Monte Carlo (SMC) method to approximate the filtering distribution and generate Monte Carlo estimates of the E-step objective. The M-step then reduces to a penalized empirical risk minimization problem in the RKHS, whose minimizer is given by a finite linear combination of kernel functions via a generalized representer theorem. To control model complexity across EM iterations, we also develop a hybrid Bayesian variant of the algorithm that uses shrinkage priors to identify significant coefficients in the kernel expansion. We establish important theoretical convergence results for both the exact and approximate EM sequences. The resulting EM-SMC-RKHS procedure enables accurate estimation of the drift function of stochastic dynamical systems in low-data regimes and is broadly applicable across domains requiring continuous-time modeling under observational constraints. We demonstrate the effectiveness of our method through a series of numerical experiments.
23. It's not a FAD: first results in using Flows for unsupervised Anomaly Detection at 40 MHz at the Large Hadron Collider
Authors: Francesco Vaselli, Maurizio Pierini, Maciej Mikolaj Glowacki, Thea Aarrestad, Katya Govorkova, Vladimir Loncar, Dimitrios Danopoulos, Felice Pantaleo β€’ Published: 2025-08-15 β€’ Source: arXiv
We present the first implementation of a Continuous Normalizing Flow (CNF) model for unsupervised anomaly detection within the realistic, high-rate environment of the Large Hadron Collider's L1 trigger systems. While CNFs typically define an anomaly score via a probabilistic likelihood, calculating this score requires solving an Ordinary Differential Equation, a procedure too complex for FPGA deployment. To overcome this, we propose a novel, hardware-friendly anomaly score defined as the squared norm of the model's vector field output. This score is based on the intuition that anomalous events require a larger transformation by the flow. Our model, trained via Flow Matching on Standard Model-like data, is synthesized for an FPGA using the hls4ml library. We demonstrate that our approach effectively identifies a variety of beyond-the-Standard-Model signatures with performance comparable to existing machine learning-based triggers. The algorithm achieves a latency of a few hundred nanoseconds and requires minimal FPGA resources, establishing CNFs as a viable new tool for real-time, data-driven discovery at 40 MHz.
24. Low barrier ZrO$_x$-based Josephson junctions
Authors: Jaehong Choi, Maciej Olszewski, Luojia Zhang, Zhaslan Baraissov, Tathagata Banerjee, Kushagra Aggarwal, Sarvesh Chaudhari, TomΓ‘s A. Arias, David A. Muller, Valla Fatemi, Gregory D. Fuchs β€’ Published: 2025-08-15 β€’ Source: arXiv
The Josephson junction is a crucial element in superconducting devices, and niobium is a promising candidate for the superconducting material due to its large energy gap relative to aluminum. AlO$_x$ has long been regarded as the highest quality oxide tunnel barrier and is often used in niobium-based junctions. Here we propose ZrO$_x$ as an alternative tunnel barrier material for Nb electrodes. We theoretically estimate that zirconium oxide has excellent oxygen retention properties and experimentally verify that there is no significant oxygen diffusion leading to NbO$_x$ formation in the adjacent Nb electrode. We develop a top-down, subtractive fabrication process for Nb/Zr-ZrO$_x$/Nb Josephson junctions, which enables scalability and large-scale production of superconducting electronics. Using cross sectional scanning transmission electron microscopy, we experimentally find that depending on the Zr thickness, ZrO$_x$ tunnel barriers can be fully crystalline with chemically abrupt interfaces with niobium. Further analysis using electron energy loss spectroscopy reveals that ZrO$_x$ corresponds to tetragonal ZrO$_2$. Room temperature characterization of fabricated junctions using Simmons' model shows that ZrO$_2$ exhibits a low tunnel barrier height, which is promising in merged-element transmon applications. Low temperature transport measurements reveal sub-gap structure, while the low-voltage sub-gap resistance remains in the megaohm range.
25. DashCam Video: A complementary low-cost data stream for on-demand forest-infrastructure system monitoring
Authors: Durga Joshi, Chandi Witharana, Robert Fahey, Thomas Worthley, Zhe Zhu, Diego Cerrai β€’ Published: 2025-08-15 β€’ Source: arXiv
Our study introduces a novel, low-cost, and reproducible framework for real-time, object-level structural assessment and geolocation of roadside vegetation and infrastructure with commonly available but underutilized dashboard camera (dashcam) video data. We developed an end-to-end pipeline that combines monocular depth estimation, depth error correction, and geometric triangulation to generate accurate spatial and structural data from street-level video streams from vehicle-mounted dashcams. Depth maps were first estimated using a state-of-the-art monocular depth model, then refined via a gradient-boosted regression framework to correct underestimations, particularly for distant objects. The depth correction model achieved strong predictive performance (R2 = 0.92, MAE = 0.31 on transformed scale), significantly reducing bias beyond 15 m. Further, object locations were estimated using GPS-based triangulation, while object heights were calculated using pin hole camera geometry. Our method was evaluated under varying conditions of camera placement and vehicle speed. Low-speed vehicle with inside camera gave the highest accuracy, with mean geolocation error of 2.83 m, and mean absolute error (MAE) in height estimation of 2.09 m for trees and 0.88 m for poles. To the best of our knowledge, it is the first framework to combine monocular depth modeling, triangulated GPS-based geolocation, and real-time structural assessment for urban vegetation and infrastructure using consumer-grade video data. Our approach complements conventional RS methods, such as LiDAR and image by offering a fast, real-time, and cost-effective solution for object-level monitoring of vegetation risks and infrastructure exposure, making it especially valuable for utility companies, and urban planners aiming for scalable and frequent assessments in dynamic urban environments.
26. Inscription, twistors, and $p$-adic periods
Authors: Sean Howe β€’ Published: 2025-08-15 β€’ Source: arXiv
We introduce the theory of inscribed $v$-sheaves, a differentiable extension of the theory of diamonds and $v$-sheaves with internal tangent bundles that are often relative inscribed Banach-Colmez spaces, then apply this theory to the study of $p$-adic periods. In particular, we construct natural inscribed versions of the Hodge and Hodge-Tate period maps and their lattice refinements for de Rham torsors, then compute the derivatives of these period maps in terms of classical structures in $p$-adic Hodge theory. These torsors include infinite level global Shimura varieties and infinite level local Shimura varieties, and for these spaces we also give another moduli-theoretic construction of the inscribed structure; the construction in the local Shimura case applies more generally to the non-minuscule moduli of mixed characterisic local shtukas with one leg. The key new ingredients in our study of inscribed structures on $p$-adic Lie group torsors over smooth rigid varieties over a $p$-adic field are the Liu-Zhu period map, a refinement of the Hodge period map whose derivative is the geometric Sen morphism/canonical Higgs field, and a closely related exact tensor functor from $\mathbb{Q}_p$-local systems to a category of twistor bundles on the relative thickened Fargues-Fontaine curve. These new structures are only visible after passing to the inscribed setting. We also discuss some possible implications of our computations in the vein of ``differential topology for diamonds."
27. Investigating Sensors and Methods in Grasp State Classification in Agricultural Manipulation
Authors: Benjamin Walt, Jordan Westphal, Girish Krishnan β€’ Published: 2025-08-15 β€’ Source: arXiv
Effective and efficient agricultural manipulation and harvesting depend on accurately understanding the current state of the grasp. The agricultural environment presents unique challenges due to its complexity, clutter, and occlusion. Additionally, fruit is physically attached to the plant, requiring precise separation during harvesting. Selecting appropriate sensors and modeling techniques is critical for obtaining reliable feedback and correctly identifying grasp states. This work investigates a set of key sensors, namely inertial measurement units (IMUs), infrared (IR) reflectance, tension, tactile sensors, and RGB cameras, integrated into a compliant gripper to classify grasp states. We evaluate the individual contribution of each sensor and compare the performance of two widely used classification models: Random Forest and Long Short-Term Memory (LSTM) networks. Our results demonstrate that a Random Forest classifier, trained in a controlled lab environment and tested on real cherry tomato plants, achieved 100% accuracy in identifying slip, grasp failure, and successful picks, marking a substantial improvement over baseline performance. Furthermore, we identify a minimal viable sensor combination, namely IMU and tension sensors that effectively classifies grasp states. This classifier enables the planning of corrective actions based on real-time feedback, thereby enhancing the efficiency and reliability of fruit harvesting operations.
28. Visual Perception Engine: Fast and Flexible Multi-Head Inference for Robotic Vision Tasks
Authors: Jakub Łucki, Jonathan Becktor, Georgios Georgakis, Robert Royce, Shehryar Khattak β€’ Published: 2025-08-15 β€’ Source: arXiv
Deploying multiple machine learning models on resource-constrained robotic platforms for different perception tasks often results in redundant computations, large memory footprints, and complex integration challenges. In response, this work presents Visual Perception Engine (VPEngine), a modular framework designed to enable efficient GPU usage for visual multitasking while maintaining extensibility and developer accessibility. Our framework architecture leverages a shared foundation model backbone that extracts image representations, which are efficiently shared, without any unnecessary GPU-CPU memory transfers, across multiple specialized task-specific model heads running in parallel. This design eliminates the computational redundancy inherent in feature extraction component when deploying traditional sequential models while enabling dynamic task prioritization based on application demands. We demonstrate our framework's capabilities through an example implementation using DINOv2 as the foundation model with multiple task (depth, object detection and semantic segmentation) heads, achieving up to 3x speedup compared to sequential execution. Building on CUDA Multi-Process Service (MPS), VPEngine offers efficient GPU utilization and maintains a constant memory footprint while allowing per-task inference frequencies to be adjusted dynamically during runtime. The framework is written in Python and is open source with ROS2 C++ (Humble) bindings for ease of use by the robotics community across diverse robotic platforms. Our example implementation demonstrates end-to-end real-time performance at $\geq$50 Hz on NVIDIA Jetson Orin AGX for TensorRT optimized models.
29. Anomaly cancellation for a $U(1)$ factor
Authors: Ben Gripaios, Khoi Le Nguyen Nguyen β€’ Published: 2025-08-15 β€’ Source: arXiv
We use methods of arithmetic geometry to solve the abelian local anomaly cancellation conditions for a four-dimensional gauge theory whose Lie algebra has a single $\mathfrak{u}_1$ summand, assuming that a solution exists. The resulting polynomial equations in the integer $\mathfrak{u}_1$ charges define a projective cubic hypersurface over the field of rational numbers. Generically, such a hypersurface is (by a theorem of Koll\'{a}r) a unirational variety, making it possible to find a finitely-many-to-one parametrization of all solutions (of which there are necessarily infinitely many). (Otherwise, such a hypersurface is either reducible or is a cone over an elliptic curve and all solutions can again be found in practice.) As an example, for the Standard Model Lie algebra with its three generations of quarks and leptons (or even with just a single generation and two $\mathfrak{su}_3 \oplus \mathfrak{su}_2$-singlet right-handed neutrinos), it follows that there are infinitely many anomaly-free possibilities for the $\mathfrak{u}_1$ hypercharges.
30. Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models
Authors: Qiguang Chen, Dengyun Peng, Jinhao Liu, HuiKang Su, Jiannan Guan, Libo Qin, Wanxiang Che β€’ Published: 2025-08-15 β€’ Source: arXiv
Recent advancements in large language models (LLMs) have greatly improved their capabilities on complex reasoning tasks through Long Chain-of-Thought (CoT). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. To improve the efficiency, current methods often rely on human-defined difficulty priors, which do not align with the LLM's self-awared difficulty, leading to inefficiencies. In this paper, we introduce the Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF), which enables models to dynamically assess and adjust their reasoning depth in response to problem complexity. DR. SAF integrates three key components: Boundary Self-Awareness Alignment, Adaptive Reward Management, and a Boundary Preservation Mechanism. These components allow models to optimize their reasoning processes, balancing efficiency and accuracy without compromising performance. Our experimental results demonstrate that DR. SAF achieves a 49.27% reduction in total response tokens with minimal loss in accuracy. The framework also delivers a 6.59x gain in token efficiency and a 5x reduction in training time, making it well-suited to resource-limited settings. During extreme training, DR. SAF can even surpass traditional instruction-based models in token efficiency with more than 16% accuracy improvement.
31. Intergenerational Support for Deepfake Scams Targeting Older Adults
Authors: Karina LaRubbio, Alyssa Lanter, Seihyun Lee, Mahima Ramesh, Diana Freed β€’ Published: 2025-08-15 β€’ Source: arXiv
AI-enhanced scams now employ deepfake technology to produce convincing audio and visual impersonations of trusted family members, often grandchildren, in real time. These attacks fabricate urgent scenarios, such as legal or medical emergencies, to socially engineer older adults into transferring money. The realism of these AI-generated impersonations undermines traditional cues used to detect fraud, making them a powerful tool for financial exploitation. In this study, we explore older adults' perceptions of these emerging threats and their responses, with a particular focus on the role of youth, who may also be impacted by having their identities exploited, in supporting older family members' online safety. We conducted focus groups with 37 older adults (ages 65+) to examine their understanding of deepfake impersonation scams and the value of intergenerational technology support. Findings suggest that older adults frequently rely on trusted relationships to detect scams and develop protective practices. Based on this, we identify opportunities to engage youth as active partners in enhancing resilience across generations.
32. Propagation of Precessing Jet in Envelope of Tidal Disruption Events
Authors: Hao-Yu Yuan, Hong-Zhou Wu, Wei-Hua Lei β€’ Published: 2025-08-15 β€’ Source: arXiv
It is likely that the disk of a tidal disruption event (TDE) is misaligned with respect to the equatorial plane of the spinning supermassive black hole (SMBH), since the initial stellar orbit before disruption is most likely has an inclined orbital plane. Such misaligned disk undergoes Lense-Thirring precession around the SMBH spin axis, leading to a precessing jet if launched in the vicinity of the SMBH and aligned with the disk angular momentum. The bound debris can also build a thick envelope which powers optical emission. In this work, we study the propagation of the precessing jet in the TDE envelope. We adopt a ``zero-Bernoulli accretion'' (ZEBRA) envelope model. A episodic jet will be observed if the line of sight is just at the envelope pole direction and $\theta_{\rm LT}=\theta_{\rm env}$, since the jet can freely escape from this low density rotation funnel, where $\theta_{\rm LT}$ and $\theta_{\rm env}$ are the jet precessing angle and the angle between the envelope polar axis and the SMBH spin axis, respectively. The jet will be choked at other directions. For $\theta_{\rm LT} < \theta_{\rm env}$, the jets can also break out of the envelope for very small precession angle $\theta_{\rm LT}$ or if the jet is aligned with SMBH spin. If the jet is choked within the envelope, the radiation produced during cocoon shock breakout will imprint characteris
33. Causality Matters: How Temporal Information Emerges in Video Language Models
Authors: Yumeng Shi, Quanyu Long, Yin Wu, Wenya Wang β€’ Published: 2025-08-15 β€’ Source: arXiv
Video language models (VideoLMs) have made significant progress in multimodal understanding. However, temporal understanding, which involves identifying event order, duration, and relationships across time, still remains a core challenge. Prior works emphasize positional encodings (PEs) as a key mechanism for encoding temporal structure. Surprisingly, we find that removing or modifying PEs in video inputs yields minimal degradation in the performance of temporal understanding. In contrast, reversing the frame sequence while preserving the original PEs causes a substantial drop. To explain this behavior, we conduct substantial analysis experiments to trace how temporal information is integrated within the model. We uncover a causal information pathway: temporal cues are progressively synthesized through inter-frame attention, aggregated in the final frame, and subsequently integrated into the query tokens. This emergent mechanism shows that temporal reasoning emerges from inter-visual token interactions under the constraints of causal attention, which implicitly encodes temporal structure. Based on these insights, we propose two efficiency-oriented strategies: staged cross-modal attention and a temporal exit mechanism for early token truncation. Experiments on two benchmarks validate the effectiveness of both approaches. To the best of our knowledge, this is the first work to systematically investigate video temporal understanding in VideoLMs, offering insights for future model improvement.
34. Activate Me!: Designing Efficient Activation Functions for Privacy-Preserving Machine Learning with Fully Homomorphic Encryption
Authors: Nges Brian Njungle, Michel A. Kinsy β€’ Published: 2025-08-15 β€’ Source: arXiv
The growing adoption of machine learning in sensitive areas such as healthcare and defense introduces significant privacy and security challenges. These domains demand robust data protection, as models depend on large volumes of sensitive information for both training and inference. Fully Homomorphic Encryption (FHE) presents a compelling solution by enabling computations directly on encrypted data, maintaining confidentiality across the entire machine learning workflow. However, FHE inherently supports only linear operations, making it difficult to implement non-linear activation functions, essential components of modern neural networks. This work focuses on designing, implementing, and evaluating activation functions tailored for FHE-based machine learning. We investigate two commonly used functions: the Square function and Rectified Linear Unit (ReLU), using LeNet-5 and ResNet-20 architectures with the CKKS scheme from the OpenFHE library. For ReLU, we assess two methods: a conventional low-degree polynomial approximation and a novel scheme-switching technique that securely evaluates ReLU under FHE constraints. Our findings show that the Square function performs well in shallow networks like LeNet-5, achieving 99.4% accuracy with 128 seconds per image. In contrast, deeper models like ResNet-20 benefit more from ReLU. The polynomial approximation yields 83.8% accuracy with 1,145 seconds per image, while our scheme-switching method improves accuracy to 89.8%, albeit with a longer inference time of 1,697 seconds. These results underscore a critical trade-off in FHE-based ML: faster activation functions often reduce accuracy, whereas those preserving accuracy demand greater computational resources.
35. TrajSV: A Trajectory-based Model for Sports Video Representations and Applications
Authors: Zheng Wang, Shihao Xu, Wei Shi β€’ Published: 2025-08-15 β€’ Source: arXiv
Sports analytics has received significant attention from both academia and industry in recent years. Despite the growing interest and efforts in this field, several issues remain unresolved, including (1) data unavailability, (2) lack of an effective trajectory-based framework, and (3) requirement for sufficient supervision labels. In this paper, we present TrajSV, a trajectory-based framework that addresses various issues in existing studies. TrajSV comprises three components: data preprocessing, Clip Representation Network (CRNet), and Video Representation Network (VRNet). The data preprocessing module extracts player and ball trajectories from sports broadcast videos. CRNet utilizes a trajectory-enhanced Transformer module to learn clip representations based on these trajectories. Additionally, VRNet learns video representations by aggregating clip representations and visual features with an encoder-decoder architecture. Finally, a triple contrastive loss is introduced to optimize both video and clip representations in an unsupervised manner. The experiments are conducted on three broadcast video datasets to verify the effectiveness of TrajSV for three types of sports (i.e., soccer, basketball, and volleyball) with three downstream applications (i.e., sports video retrieval, action spotting, and video captioning). The results demonstrate that TrajSV achieves state-of-the-art performance in sports video retrieval, showcasing a nearly 70% improvement. It outperforms baselines in action spotting, achieving state-of-the-art results in 9 out of 17 action categories, and demonstrates a nearly 20% improvement in video captioning. Additionally, we introduce a deployed system along with the three applications based on TrajSV.