πŸ€– AI Research Papers

August 12, 2025

πŸ€– AI-Generated Research Summary

Comprehensive Summary of Recent Research in AI, LLMs, Agents, and Workflows

This summary synthesizes key insights from a collection of 35 recent research papers spanning AI, large language models (LLMs), agents, and workflow systems. The analysis is structured to highlight major research trends, breakthrough findings, methodological approaches, practical applications, and future research directions.


1. Key Research Trends

a. LLM Optimization and Efficiency

b. Multimodal and Multisensory AI

c. AI Agents and Workflow Automation

d. Fairness, Security, and Trust in AI

e. Synthetic Data and Data Augmentation

f. Retrieval-Augmented and Contextual AI


2. Breakthrough Findings


3. Methodological Approaches


4. Applications and Use Cases


5. Future Directions


Conclusion

This collection of papers reflects a vibrant and rapidly evolving AI research landscape. Key advances are being made in LLM efficiency, multimodal integration, workflow automation, fairness, and security. The field is moving toward more robust, trustworthy, and context-aware AI systems, with a strong emphasis on practical deployment in diverse and challenging real-world scenarios. Researchers and practitioners are encouraged to build on these trends, particularly in the areas of efficient LLM deployment, secure and fair AI workflows, and the integration of novel data modalities.

πŸ“š arXiv (35 papers)
1. LightSwitch: Multi-view Relighting with Material-guided Diffusion
Authors: Yehonathan Litman, Fernando De la Torre, Shubham Tulsiani β€’ Published: 2025-08-08 β€’ Source: arXiv
Recent approaches for 3D relighting have shown promise in integrating 2D image relighting generative priors to alter the appearance of a 3D representation while preserving the underlying structure. Nevertheless, generative priors used for 2D relighting that directly relight from an input image do not take advantage of intrinsic properties of the subject that can be inferred or cannot consider multi-view data at scale, leading to subpar relighting. In this paper, we propose Lightswitch, a novel finetuned material-relighting diffusion framework that efficiently relights an arbitrary number of input images to a target lighting condition while incorporating cues from inferred intrinsic properties. By using multi-view and material information cues together with a scalable denoising scheme, our method consistently and efficiently relights dense multi-view data of objects with diverse material compositions. We show that our 2D relighting prediction quality exceeds previous state-of-the-art relighting priors that directly relight from images. We further demonstrate that LightSwitch matches or outperforms state-of-the-art diffusion inverse rendering methods in relighting synthetic and real objects in as little as 2 minutes.
2. Effective Training Data Synthesis for Improving MLLM Chart Understanding
Authors: Yuwei Yang, Zeyu Zhang, Yunzhong Hou, Zhuowan Li, Gaowen Liu, Ali Payani, Yuan-Sen Ting, Liang Zheng β€’ Published: 2025-08-08 β€’ Source: arXiv
Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30%-50% on challenging benchmarks. Previous studies on fine-tuning MLLMs with synthetic charts are often restricted by their inadequate similarity to the real charts, which could compromise model training and performance on complex real-world charts. In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. In particular, we design a five-step data synthesis pipeline, where we separate data and function creation for single plot generation, condition the generation of later subplots on earlier ones for multi-subplot figures, visually diversify the generated figures, filter out low quality data, and finally generate the question-answer (QA) pairs with GPT-4o. This approach allows us to streamline the generation of fine-tuning datasets and introduce the effective chart dataset (ECD), which contains 10k+ chart images and 300k+ QA pairs, covering 25 topics and featuring 250+ chart type combinations with high visual complexity. We show that ECD consistently improves the performance of various MLLMs on a range of real-world and synthetic test sets. Code, data and models are available at: https://github.com/yuweiyang-anu/ECD.
3. Computational Methods and Verification Theorem for Portfolio-Consumption Optimization under Exponential O-U Dynamics
Authors: Zhaoxiang Zhong, Haiming Song β€’ Published: 2025-08-08 β€’ Source: arXiv
In this paper, we focus on the problem of optimal portfolio-consumption policies in a multi-asset financial market, where the n risky assets follow Exponential Ornstein-Uhlenbeck processes, along with one risk-free bond. The investor's preferences are modeled using Constant Relative Risk Aversion utility with state-dependent stochastic discounting. The problem can be formulated as a high-dimensional stochastic optimal control problem, wherein the associated value function satisfies a Hamilton-Jacobi-Bellman (HJB) equation, which constitutes a necessary condition for optimality. We apply a variable separation technique to transform the HJB equation to a system of ordinary differential equations (ODEs). Then a class of hybrid numerical approaches that integrate exponential Rosenbrock-type methods with Runge-Kutta methods is proposed to solve the ODE system. More importantly, we establish a rigorous verification theorem that provides sufficient conditions for the existence of value function and admissible optimal control, which can be verified numerically. A series of experiments are performed, demonstrating that our proposed method outperforms the conventional grid-based method in both accuracy and computational cost. Furthermore, the numerically derived optimal policy achieves superior performance over all other considered admissible policies.
4. Voting-Based Semi-Parallel Proof-of-Work Protocol
Authors: Mustafa Doger, Sennur Ulukus β€’ Published: 2025-08-08 β€’ Source: arXiv
Parallel Proof-of-Work (PoW) protocols are suggested to improve the safety guarantees, transaction throughput and confirmation latencies of Nakamoto consensus. In this work, we first consider the existing parallel PoW protocols and develop hard-coded incentive attack structures. Our theoretical results and simulations show that the existing parallel PoW protocols are more vulnerable to incentive attacks than the Nakamoto consensus, e.g., attacks have smaller profitability threshold and they result in higher relative rewards. Next, we introduce a voting-based semi-parallel PoW protocol that outperforms both Nakamoto consensus and the existing parallel PoW protocols from most practical perspectives such as communication overheads, throughput, transaction conflicts, incentive compatibility of the protocol as well as a fair distribution of transaction fees among the voters and the leaders. We use state-of-the-art analysis to evaluate the consistency of the protocol and consider Markov decision process (MDP) models to substantiate our claims about the resilience of our protocol against incentive attacks.
5. WGAST: Weakly-Supervised Generative Network for Daily 10 m Land Surface Temperature Estimation via Spatio-Temporal Fusion
Authors: Sofiane Bouaziz, Adel Hafiane, Raphael Canals, Rachid Nedjai β€’ Published: 2025-08-08 β€’ Source: arXiv
Urbanization, climate change, and agricultural stress are increasing the demand for precise and timely environmental monitoring. Land Surface Temperature (LST) is a key variable in this context and is retrieved from remote sensing satellites. However, these systems face a trade-off between spatial and temporal resolution. While spatio-temporal fusion methods offer promising solutions, few have addressed the estimation of daily LST at 10 m resolution. In this study, we present WGAST, a Weakly-Supervised Generative Network for Daily 10 m LST Estimation via Spatio-Temporal Fusion of Terra MODIS, Landsat 8, and Sentinel-2. WGAST is the first end-to-end deep learning framework designed for this task. It adopts a conditional generative adversarial architecture, with a generator composed of four stages: feature extraction, fusion, LST reconstruction, and noise suppression. The first stage employs a set of encoders to extract multi-level latent representations from the inputs, which are then fused in the second stage using cosine similarity, normalization, and temporal attention mechanisms. The third stage decodes the fused features into high-resolution LST, followed by a Gaussian filter to suppress high-frequency noise. Training follows a weakly supervised strategy based on physical averaging principles and reinforced by a PatchGAN discriminator. Experiments demonstrate that WGAST outperforms existing methods in both quantitative and qualitative evaluations. Compared to the best-performing baseline, on average, WGAST reduces RMSE by 17.18% and improves SSIM by 11.00%. Furthermore, WGAST is robust to cloud-induced LST and effectively captures fine-scale thermal patterns, as validated against 33 ground-based sensors. The code is available at https://github.com/Sofianebouaziz1/WGAST.git.
6. On the Parallel Complexity of Identifying Groups and Quasigroups via Decompositions
Authors: Dan Johnson, Michael Levet, Petr VojtΔ›chovskΓ½, Brett Widholm β€’ Published: 2025-08-08 β€’ Source: arXiv
In this paper, we investigate the computational complexity of isomorphism testing for finite groups and quasigroups, given by their multiplication tables. We crucially take advantage of their various decompositions to show the following: - We first consider the class $\mathcal{C}$ of groups that admit direct product decompositions, where each indecompsable factor is $O(1)$-generated, and either perfect or centerless. We show any group in $\mathcal{C}$ is identified by the $O(1)$-dimensional count-free Weisfeiler--Leman (WL) algorithm with $O(\log \log n)$ rounds, and the $O(1)$-dimensional counting WL algorithm with $O(1)$ rounds. Consequently, the isomorphism problem for $\mathcal{C}$ is in $\textsf{L}$. The previous upper bound for this class was $\textsf{TC}^{1}$, using $O(\log n)$ rounds of the $O(1)$-dimensional counting WL (Grochow and Levet, FCT 2023). - We next consider more generally, the class of groups where each indecomposable factor is $O(1)$-generated. We exhibit an $\textsf{AC}^{3}$ canonical labeling procedure for this class. Here, we accomplish this by showing that in the multiplication table model, the direct product decomposition can be computed in $\textsf{AC}^{3}$, parallelizing the work of Kayal and Nezhmetdinov (ICALP 2009). - Isomorphism testing between a central quasigroup $G$ and an arbitrary quasigroup $H$ is in $\textsf{NC}$. Here, we take advantage of the fact that central quasigroups admit an affine decomposition in terms of an underlying Abelian group. Only the trivial bound of $n^{\log(n)+O(1)}$-time was previously known for isomorphism testing of central quasigroups.
7. HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning
Authors: Guimin Hu, Daniel Hershcovich, Hasti Seifi β€’ Published: 2025-08-08 β€’ Source: arXiv
Haptic captioning is the task of generating natural language descriptions from haptic signals, such as vibrations, for use in virtual reality, accessibility, and rehabilitation applications. While previous multimodal research has focused primarily on vision and audio, haptic signals for the sense of touch remain underexplored. To address this gap, we formalize the haptic captioning task and propose HapticLLaMA, a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category. We investigate two types of haptic tokenizers, a frequency-based tokenizer and an EnCodec-based tokenizer, that convert haptic signals into sequences of discrete units, enabling their integration with the LLaMA model. HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement learning from human feedback (RLHF). We assess HapticLLaMA's captioning performance using both automated n-gram metrics and human evaluation. HapticLLaMA demonstrates strong capability in interpreting haptic vibration signals, achieving a METEOR score of 59.98 and a BLEU-4 score of 32.06 respectively. Additionally, over 61% of the generated captions received human ratings above 3.5 on a 7-point scale, with RLHF yielding a 10% improvement in the overall rating distribution, indicating stronger alignment with human haptic perception. These findings highlight the potential of large language models to process and adapt to sensory data.
8. Revisiting the Gas Dynamics of Henize 2-10: Possible Drivers of the Starburst
Authors: Josephine M. Dalsin, Allison H. Costa, Remy Indebetouw, Kelsey E. Johnson, Natalie O. Johnson, Sabrina Stierwalt β€’ Published: 2025-08-08 β€’ Source: arXiv
The triggers of starburst episodes are a key component to our understanding of the baryon cycle in galaxies. Galaxy mergers are a commonly suggested catalyst for starbursts, but once the galaxies coalesce into a single kinematically disturbed system, their merger history can be difficult to assess. This is particularly true for dwarf galaxies, which are expected to dominate the merger rate at all redshifts due to their large numbers. One such dwarf galaxy undergoing an enigmatic starburst episode is Henize 2-10, which appears to be isolated. Possible scenarios that might have caused the starburst episode include a previous merger or stochastic processes within the galaxy itself, such as self-regulation via feedback processes. We present new VLA 21-cm observations and unpublished archival CARMA CO data to investigate the dynamical state and star formation activity in the galaxy. We do not detect an HI tail consistent with the structure reported by Kobulnicky et al. (1995), which was suggested as evidence for a merger or interaction, but rather these new observations indicate an extended HI distribution. We also find that the HI appears dynamically decoupled from an extended CO feature (inferred to be a tidal tail in previous work), suggesting large-scale dynamical processes of some type are affecting the gas in this system. We provide a meta-analysis of available results to enhance our understanding of what might be triggering the starburst episode in Henize 2-10, and speculate that the large CO feature could be falling into the galaxy and potentially trigger starburst activity.
9. Generative AI and the Future of the Digital Commons: Five Open Questions and Knowledge Gaps
Authors: Arman Noroozian, Lorena Aldana, Marta Arisi, Hadi Asghari, Renata Avila, Pietro Giovanni Bizzaro, Ramya Chandrasekhar, Cristian Consonni, Deborah De Angelis, Francesca De Chiara, Maria del Rio-Chanona, Melanie Dulong de Rosnay, Maria Eriksson, Frederic Font, Emilia Gomez, ValΓ©rian Guillier, Lisa Gutermuth, David Hartmann, Lucie-AimΓ©e Kaffee, Paul Keller, Felix Stalder, Joao Vinagre, Denny VrandečiΔ‡, Amanda Wasielewski β€’ Published: 2025-08-08 β€’ Source: arXiv
The rapid advancement of Generative AI (GenAI) relies heavily on the digital commons, a vast collection of free and open online content that is created, shared, and maintained by communities. However, this relationship is becoming increasingly strained due to financial burdens, decreased contributions, and misalignment between AI models and community norms. As we move deeper into the GenAI era, it is essential to examine the interdependent relationship between GenAI, the long-term sustainability of the digital commons, and the equity of current AI development practices. We highlight five critical questions that require urgent attention: 1. How can we prevent the digital commons from being threatened by undersupply as individuals cease contributing to the commons and turn to Generative AI for information? 2. How can we mitigate the risk of the open web closing due to restrictions on access to curb AI crawlers? 3. How can technical standards and legal frameworks be updated to reflect the evolving needs of organizations hosting common content? 4. What are the effects of increased synthetic content in open knowledge databases, and how can we ensure their integrity? 5. How can we account for and distribute the infrastructural and environmental costs of providing data for AI training? We emphasize the need for more responsible practices in AI development, recognizing the digital commons not only as content but as a collaborative and decentralized form of knowledge governance, which relies on the practice of "commoning" - making, maintaining, and protecting shared and open resources. Ultimately, our goal is to stimulate discussion and research on the intersection of Generative AI and the digital commons, with the aim of developing an "AI commons" and public infrastructures for AI development that support the long-term health of the digital commons.
10. An Online Multi-dimensional Knapsack Approach for Slice Admission Control
Authors: Jesutofunmi Ajayi, Antonio Di Maio, Torsten Braun, Dimitrios Xenakis β€’ Published: 2025-08-08 β€’ Source: arXiv
Network Slicing has emerged as a powerful technique to enable cost-effective, multi-tenant communications and services over a shared physical mobile network infrastructure. One major challenge of service provisioning in slice-enabled networks is the uncertainty in the demand for the limited network resources that must be shared among existing slices and potentially new Network Slice Requests. In this paper, we consider admission control of Network Slice Requests in an online setting, with the goal of maximizing the long-term revenue received from admitted requests. We model the Slice Admission Control problem as an Online Multidimensional Knapsack Problem and present two reservation-based policies and their algorithms, which have a competitive performance for Online Multidimensional Knapsack Problems. Through Monte Carlo simulations, we evaluate the performance of our online admission control method in terms of average revenue gained by the Infrastructure Provider, system resource utilization, and the ratio of accepted slice requests. We compare our approach with those of the online First Come First Serve greedy policy. The simulation's results prove that our proposed online policies increase revenues for Infrastructure Providers by up to 12.9 % while reducing the average resource consumption by up to 1.7% In particular, when the tenants' economic inequality increases, an Infrastructure Provider who adopts our proposed online admission policies gains higher revenues compared to an Infrastructure Provider who adopts First Come First Serve.
11. Characterization and automated optimization of laser-driven proton beams from converging liquid sheet jet targets
Authors: G. D. Glenn, F. Treffert, H. Ahmed, S. Astbury, M. Borghesi, N. Bourgeois, C. B. Curry, S. J. D. Dann, S. DiIorio, N. P. Dover, T. Dzelzainis, O. Ettlinger, M. Gauthier, L. Giuffrida, R. J. Gray, J. S. Green, G. S. Hicks, C. Hyland, V. Istokskaia, M. King, B. Loughran, D. Margarone, O. McCusker, P. McKenna, Z. Najmudin, C. ParisuaΓ±a, P. Parsons, C. Spindloe, M. J. V. Streeter, D. R. Symes, A. G. R. Thomas, N. Xu, S. H. Glenzer, C. A. J. Palmer β€’ Published: 2025-08-08 β€’ Source: arXiv
Compact, stable, and versatile laser-driven ion sources hold great promise for applications ranging from medicine to materials science and fundamental physics. While single-shot sources have demonstrated favorable beam properties, including the peak fluxes necessary for several applications, high repetition rate operation will be necessary to generate and sustain the high average flux needed for many of the most exciting applications of laser-driven ion sources. Further, to navigate through the high-dimensional space of laser and target parameters towards experimental optima, it is essential to develop ion acceleration platforms compatible with machine learning learning techniques and capable of autonomous real-time optimization. Here we present a multi-Hz ion acceleration platform employing a liquid sheet jet target. We characterize the laser-plasma interaction and the laser-driven proton beam across a variety of key parameters governing the interaction using an extensive suite of online diagnostics. We also demonstrate real-time, closed-loop optimization of the ion beam maximum energy by tuning the laser wavefront using a Bayesian optimization scheme. This approach increased the maximum proton energy by 11% compared to a manually-optimized wavefront by enhancing the energy concentration within the laser focal spot, demonstrating the potential for closed-loop optimization schemes to tune future ion accelerators for robust high repetition rate operation.
12. A literature-derived dataset of migration barriers for quantifying ionic transport in battery materials
Authors: Reshma Devi, Avaneesh Balasubramanian, Keith T. Butler, Gopalakrishnan Sai Gautam β€’ Published: 2025-08-08 β€’ Source: arXiv
The rate performance of any electrode or solid electrolyte material used in a battery is critically dependent on the migration barrier ($E_m$) governing the motion of the intercalant ion, which is a difficult-to-estimate quantity both experimentally and computationally. The foundation for constructing and validating accurate machine learning (ML) models that are capable of predicting $E_m$, and hence accelerating the discovery of novel electrodes and solid electrolytes, lies in the availability of high-quality dataset(s) containing $E_m$. Addressing this critical requirement, we present a comprehensive dataset comprising 619 distinct literature-reported $E_m$ values calculated using density functional theory based nudged elastic band computations, across 443 compositions and 27 structural groups consisting of various compounds that have been explored as electrodes or solid electrolytes in batteries. Our dataset includes compositions that correspond to fully charged and/or discharged states of electrode materials, with intermediate compositions incorporated in select instances. Crucially, for each compound, our dataset provides structural information, including the initial and final positions of the migrating ion, along with its corresponding $E_m$ in easy-to-use .xlsx and JSON formats. We envision our dataset to be a highly useful resource for the scientific community, facilitating the development of advanced ML models that can predict $E_m$ precisely and accelerate materials discovery.
13. Round Aztec windows, a dual of the Aztec diamond theorem and a curious symmetry of the correlation of diagonal slits
Authors: Mihai Ciucu β€’ Published: 2025-08-08 β€’ Source: arXiv
Fairly shortly after the publication of the Aztec diamond theorem of Elkies, Kuperberg, Larsen and Propp in 1992, interest arose in finding the number of domino tilings of an Aztec diamond with an ``Aztec window,'' i.e.\ a hole in the shape of a smaller Aztec diamond at its center. Several intriguing patterns were discovered for the number of tilings of such regions, but the numbers themselves were not ``round'' -- they didn't seem to be given by a simple product formula. In this paper we consider a very closely related shape of holes (namely, odd Aztec rectangles), and prove that a large variety of regions obtained from Aztec rectangles by making such holes in them possess the sought-after property that the number of their domino tilings is given by a simple product formula. We find the same to be true for certain symmetric cruciform regions. We also consider graphs obtained from a toroidal Aztec diamond by making such holes in them, and prove a simple formula that governs the way the number of their perfect matchings changes under a natural evolution of the holes. This yields in particular a natural dual of the Aztec diamond theorem. Some implications for the correlation of such holes are also presented, including an unexpected symmetry for the correlation of diagonal slits on the square grid.
14. eSASRec: Enhancing Transformer-based Recommendations in a Modular Fashion
Authors: Daria Tikhonovich, Nikita Zelinskiy, Aleksandr V. Petrov, Mayya Spirina, Andrei Semenov, Andrey V. Savchenko, Sergei Kuliev β€’ Published: 2025-08-08 β€’ Source: arXiv
Since their introduction, Transformer-based models, such as SASRec and BERT4Rec, have become common baselines for sequential recommendations, surpassing earlier neural and non-neural methods. A number of following publications have shown that the effectiveness of these models can be improved by, for example, slightly updating the architecture of the Transformer layers, using better training objectives, and employing improved loss functions. However, the additivity of these modular improvements has not been systematically benchmarked - this is the gap we aim to close in this paper. Through our experiments, we identify a very strong model that uses SASRec's training objective, LiGR Transformer layers, and Sampled Softmax Loss. We call this combination eSASRec (Enhanced SASRec). While we primarily focus on realistic, production-like evaluation, in our preliminarily study we find that common academic benchmarks show eSASRec to be 23% more effective compared to the most recent state-of-the-art models, such as ActionPiece. In our main production-like benchmark, eSASRec resides on the Pareto frontier in terms of the accuracy-coverage tradeoff (alongside the recent industrial models HSTU and FuXi. As the modifications compared to the original SASRec are relatively straightforward and no extra features are needed (such as timestamps in HSTU), we believe that eSASRec can be easily integrated into existing recommendation pipelines and can can serve as a strong yet very simple baseline for emerging complicated algorithms. To facilitate this, we provide the open-source implementations for our models and benchmarks in repository https://github.com/blondered/transformer_benchmark
15. SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning
Authors: Lingkun Long, Rubing Yang, Yushi Huang, Desheng Hui, Ao Zhou, Jianlei Yang β€’ Published: 2025-08-08 β€’ Source: arXiv
Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer, limiting overall efficiency. In this work, we propose SlimInfer, an innovative framework that aims to accelerate inference by directly pruning less critical prompt tokens during the forward pass. Our key insight is an information diffusion phenomenon: As information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This diffusion process suggests that LLMs can maintain their semantic integrity when excessive tokens, even including these critical ones, are pruned in hidden states. Motivated by this, SlimInfer introduces a dynamic fine-grained pruning mechanism that accurately removes redundant tokens of hidden state at intermediate layers. This layer-wise pruning naturally enables an asynchronous KV cache manager that prefetches required token blocks without complex predictors, reducing both memory usage and I/O costs. Extensive experiments show that SlimInfer can achieve up to $\mathbf{2.53\times}$ time-to-first-token (TTFT) speedup and $\mathbf{1.88\times}$ end-to-end latency reduction for LLaMA3.1-8B-Instruct on a single RTX 4090, without sacrificing performance on LongBench. Our code will be released upon acceptance.
16. The Fair Game: Auditing & Debiasing AI Algorithms Over Time
Authors: Debabrota Basu, Udvas Das β€’ Published: 2025-08-08 β€’ Source: arXiv
An emerging field of AI, namely Fair Machine Learning (ML), aims to quantify different types of bias (also known as unfairness) exhibited in the predictions of ML algorithms, and to design new algorithms to mitigate them. Often, the definitions of bias used in the literature are observational, i.e. they use the input and output of a pre-trained algorithm to quantify a bias under concern. In reality,these definitions are often conflicting in nature and can only be deployed if either the ground truth is known or only in retrospect after deploying the algorithm. Thus,there is a gap between what we want Fair ML to achieve and what it does in a dynamic social environment. Hence, we propose an alternative dynamic mechanism,"Fair Game",to assure fairness in the predictions of an ML algorithm and to adapt its predictions as the society interacts with the algorithm over time. "Fair Game" puts together an Auditor and a Debiasing algorithm in a loop around an ML algorithm. The "Fair Game" puts these two components in a loop by leveraging Reinforcement Learning (RL). RL algorithms interact with an environment to take decisions, which yields new observations (also known as data/feedback) from the environment and in turn, adapts future decisions. RL is already used in algorithms with pre-fixed long-term fairness goals. "Fair Game" provides a unique framework where the fairness goals can be adapted over time by only modifying the auditor and the different biases it quantifies. Thus,"Fair Game" aims to simulate the evolution of ethical and legal frameworks in the society by creating an auditor which sends feedback to a debiasing algorithm deployed around an ML system. This allows us to develop a flexible and adaptive-over-time framework to build Fair ML systems pre- and post-deployment.
17. Accelerating Quantum Monte Carlo Calculations with Set-Equivariant Architectures and Transfer Learning
Authors: Manuel Gallego, SebastiΓ‘n Roca-Jerat, David Zueco, JesΓΊs Carrete β€’ Published: 2025-08-08 β€’ Source: arXiv
Machine-learning (ML) ans\"atze have greatly expanded the accuracy and reach of variational quantum Monte Carlo (QMC) calculations, in particular when exploring the manifold quantum phenomena exhibited by spin systems. However, the scalability of QMC is still compromised by several other bottlenecks, and specifically those related to the actual evaluation of observables based on random deviates that lies at the core of the approach. Here we show how the set-transformer architecture can be used to dramatically accelerate or even bypass that step, especially for time-consuming operators such as powers of the magnetization. We illustrate the procedure with a range of examples of increasing complexity, from the classical Ising model to quantum systems with long-range interactions, and comprising both regressions (to predict observables) and classifications (to detect phase transitions). Moreover, we show how transfer learning can be leveraged to reduce the training cost by reusing knowledge from different systems and smaller system sizes.
18. Multiorbital character of the density wave instability in La$_4$Ni$_3$O$_{10}$
Authors: A. Suthar, V. Sundaramurthy, M. Bejas, Congcong Le, P. Puphal, P. Sosa-Lizama, A. Schulz, J. Nuss, M. Isobe, P. A. van Aken, Y. E. Suyolcu, M. Minola, A. P. Schnyder, Xianxin Wu, B. Keimer, G. Khaliullin, A. Greco, M. Hepting β€’ Published: 2025-08-08 β€’ Source: arXiv
Ruddlesden-Popper nickelates exhibit high-temperature superconductivity closely intertwined with charge and spin density wave order. However, fundamental questions persist regarding the interplay between the associated density wave (DW) fluctuations and superconductivity, as well as the orbital character and symmetry underlying the DW instabilities. Here we utilize polarized Raman scattering to investigate the phononic and electronic Raman responses of the trilayer nickelate La$_4$Ni$_3$O$_{10}$ across its concomitant charge and spin density wave transitions. In addition to distinct phonon anomalies occurring below the transition temperature, we observe a depletion of continuum spectral weight up to 114 meV and a pronounced peak centered at this energy. By combining momentum-selective information from polarized electronic Raman scattering with model calculations involving both Ni-3$d_{x^2 - y^2}$ and Ni-3$d_{z^2}$ orbitals, we identify 114 meV as the energy scale $2\Delta_\mathrm{DW}$ of the DW gap, characterized by incoherent opening and non-mean-field behavior. Furthermore, the model calculations reveal that the corresponding $2\Delta_\mathrm{DW}$ peak exhibits a multiorbital origin, thus shedding light on the nature of the DW instabilities in La$_4$Ni$_3$O$_{10}$.
19. Collective heat engines via different interactions: Minimal models, thermodynamics and phase transitions
Authors: Iago N. Mamede, VitΓ³ria T. Henkes, Carlos E. Fiore β€’ Published: 2025-08-08 β€’ Source: arXiv
We investigate the dynamics and thermodynamics of a framework composed of interacting units in which parameters (temperatures and energies) assume distinct values due to the contact with distinct (cold and hot) thermal reservoirs. The influence of different ingredients, such as the contact between thermal baths (simultaneous versus not simultaneous contact), the coupling between them (equal or different couplings) and the topology of interactions (all-to-all and local interactions). Closed expressions for transition lines have obtained, expressed by a linear combination of interaction energies times reciprocal temperatures for the simultaneous thermal contact baths and deviates from it when the contact is not simultaneous. The interplay between performance and dissipation is investigated under different conditions, giving rise to a richness of operation regimes, such as heat-engine and heat pump. The relationship between thermodynamic quantities (power, efficiency and dissipation) allows a careful choice of parameters to ensure the desirable compromise between them. Finally, the influence of different interactions energies (Ising, Potts versus Blume-Emery-Griffiths (BEG) like) are investigated, revealing that Potts interactions in general present superior performances than BEG ones.
20. CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment
Authors: Shengzhu Yang, Jiawei Du, Shuai Lu, Weihang Zhang, Ningli Wang, Huiqi Li β€’ Published: 2025-08-08 β€’ Source: arXiv
Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive language-image pretraining (CLIP): they hinder the model's ability to learn robust and generalizable representations. In this work, we propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures to improve multimodal semantic alignment, providing stronger supervision and enhancing alignment robustness. Furthermore, two shared pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning in a parameter-compromise manner. Extensive experiments on diverse downstream tasks demonstrate the effectiveness and generality of CLIPin as a plug-and-play component compatible with various contrastive frameworks. Code is available at https://github.com/T6Yang/CLIPin.
21. MotionSwap
Authors: Om Patil, Jinesh Modi, Suryabha Mukhopadhyay, Meghaditya Giri, Chhavi Malhotra β€’ Published: 2025-08-08 β€’ Source: arXiv
Face swapping technology has gained significant attention in both academic research and commercial applications. This paper presents our implementation and enhancement of SimSwap, an efficient framework for high fidelity face swapping. We introduce several improvements to the original model, including the integration of self and cross-attention mechanisms in the generator architecture, dynamic loss weighting, and cosine annealing learning rate scheduling. These enhancements lead to significant improvements in identity preservation, attribute consistency, and overall visual quality. Our experimental results, spanning 400,000 training iterations, demonstrate progressive improvements in generator and discriminator performance. The enhanced model achieves better identity similarity, lower FID scores, and visibly superior qualitative results compared to the baseline. Ablation studies confirm the importance of each architectural and training improvement. We conclude by identifying key future directions, such as integrating StyleGAN3, improving lip synchronization, incorporating 3D facial modeling, and introducing temporal consistency for video-based applications.
22. Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation
Authors: Youguang Xing, Xu Luo, Junlin Xie, Lianli Gao, Hengtao Shen, Jingkuan Song β€’ Published: 2025-08-08 β€’ Source: arXiv
Generalist robot policies trained on large-scale datasets such as Open X-Embodiment (OXE) demonstrate strong performance across a wide range of tasks. However, they often struggle to generalize beyond the distribution of their training data. In this paper, we investigate the underlying cause of this limited generalization capability. We identify shortcut learning -- the reliance on task-irrelevant features -- as a key impediment to generalization. Through comprehensive theoretical and empirical analysis, we uncover two primary contributors to shortcut learning: (1) limited diversity within individual sub-datasets, and (2) significant distributional disparities across sub-datasets, leading to dataset fragmentation. These issues arise from the inherent structure of large-scale datasets like OXE, which are typically composed of multiple sub-datasets collected independently across varied environments and embodiments. Our findings provide critical insights into dataset collection strategies that can reduce shortcut learning and enhance the generalization ability of generalist robot policies. Moreover, in scenarios where acquiring new large-scale data is impractical, we demonstrate that carefully selected robotic data augmentation strategies can effectively reduce shortcut learning in existing offline datasets, thereby improving generalization capabilities of generalist robot policies, e.g., $\pi_0$, in both simulation and real-world environments. More information at https://lucky-light-sun.github.io/proj/shortcut-learning-in-grps/.
23. Quantifying Conversation Drift in MCP via Latent Polytope
Authors: Haoran Shi, Hongwei Yao, Shuo Shao, Shaopeng Jiao, Ziqi Peng, Zhan Qin, Cong Wang β€’ Published: 2025-08-08 β€’ Source: arXiv
The Model Context Protocol (MCP) enhances large language models (LLMs) by integrating external tools, enabling dynamic aggregation of real-time data to improve task execution. However, its non-isolated execution context introduces critical security and privacy risks. In particular, adversarially crafted content can induce tool poisoning or indirect prompt injection, leading to conversation hijacking, misinformation propagation, or data exfiltration. Existing defenses, such as rule-based filters or LLM-driven detection, remain inadequate due to their reliance on static signatures, computational inefficiency, and inability to quantify conversational hijacking. To address these limitations, we propose SecMCP, a secure framework that detects and quantifies conversation drift, deviations in latent space trajectories induced by adversarial external knowledge. By modeling LLM activation vectors within a latent polytope space, SecMCP identifies anomalous shifts in conversational dynamics, enabling proactive detection of hijacking, misleading, and data exfiltration. We evaluate SecMCP on three state-of-the-art LLMs (Llama3, Vicuna, Mistral) across benchmark datasets (MS MARCO, HotpotQA, FinQA), demonstrating robust detection with AUROC scores exceeding 0.915 while maintaining system usability. Our contributions include a systematic categorization of MCP security threats, a novel latent polytope-based methodology for quantifying conversation drift, and empirical validation of SecMCP's efficacy.
24. A Classification-Aware Super-Resolution Framework for Ship Targets in SAR Imagery
Authors: Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Oktay Karakus β€’ Published: 2025-08-08 β€’ Source: arXiv
High-resolution imagery plays a critical role in improving the performance of visual recognition tasks such as classification, detection, and segmentation. In many domains, including remote sensing and surveillance, low-resolution images can limit the accuracy of automated analysis. To address this, super-resolution (SR) techniques have been widely adopted to attempt to reconstruct high-resolution images from low-resolution inputs. Related traditional approaches focus solely on enhancing image quality based on pixel-level metrics, leaving the relationship between super-resolved image fidelity and downstream classification performance largely underexplored. This raises a key question: can integrating classification objectives directly into the super-resolution process further improve classification accuracy? In this paper, we try to respond to this question by investigating the relationship between super-resolution and classification through the deployment of a specialised algorithmic strategy. We propose a novel methodology that increases the resolution of synthetic aperture radar imagery by optimising loss functions that account for both image quality and classification performance. Our approach improves image quality, as measured by scientifically ascertained image quality indicators, while also enhancing classification accuracy.
25. Blockchain-Enabled Federated Learning
Authors: Murtaza Rangwala, Venugopal K R, Rajkumar Buyya β€’ Published: 2025-08-08 β€’ Source: arXiv
Blockchain-enabled federated learning (BCFL) addresses fundamental challenges of trust, privacy, and coordination in collaborative AI systems. This chapter provides comprehensive architectural analysis of BCFL systems through a systematic four-dimensional taxonomy examining coordination structures, consensus mechanisms, storage architectures, and trust models. We analyze design patterns from blockchain-verified centralized coordination to fully decentralized peer-to-peer networks, evaluating trade-offs in scalability, security, and performance. Through detailed examination of consensus mechanisms designed for federated learning contexts, including Proof of Quality and Proof of Federated Learning, we demonstrate how computational work can be repurposed from arbitrary cryptographic puzzles to productive machine learning tasks. The chapter addresses critical storage challenges by examining multi-tier architectures that balance blockchain's transaction constraints with neural networks' large parameter requirements while maintaining cryptographic integrity. A technical case study of the TrustMesh framework illustrates practical implementation considerations in BCFL systems through distributed image classification training, demonstrating effective collaborative learning across IoT devices with highly non-IID data distributions while maintaining complete transparency and fault tolerance. Analysis of real-world deployments across healthcare consortiums, financial services, and IoT security applications validates the practical viability of BCFL systems, achieving performance comparable to centralized approaches while providing enhanced security guarantees and enabling new models of trustless collaborative intelligence.
26. Chain-of-Alpha: Unleashing the Power of Large Language Models for Alpha Mining in Quantitative Trading
Authors: Lang Cao, Zekun Xi, Long Liao, Ziwei Yang, Zheng Cao β€’ Published: 2025-08-08 β€’ Source: arXiv
Alpha factor mining is a fundamental task in quantitative trading, aimed at discovering interpretable signals that can predict asset returns beyond systematic market risk. While traditional methods rely on manual formula design or heuristic search with machine learning, recent advances have leveraged Large Language Models (LLMs) for automated factor discovery. However, existing LLM-based alpha mining approaches remain limited in terms of automation, generality, and efficiency. In this paper, we propose Chain-of-Alpha, a novel, simple, yet effective and efficient LLM-based framework for fully automated formulaic alpha mining. Our method features a dual-chain architecture, consisting of a Factor Generation Chain and a Factor Optimization Chain, which iteratively generate, evaluate, and refine candidate alpha factors using only market data, while leveraging backtest feedback and prior optimization knowledge. The two chains work synergistically to enable high-quality alpha discovery without human intervention and offer strong scalability. Extensive experiments on real-world A-share benchmarks demonstrate that Chain-of-Alpha outperforms existing baselines across multiple metrics, presenting a promising direction for LLM-driven quantitative research.
27. Automatic Semantic Alignment of Flow Pattern Representations for Exploration with Large Language Models
Authors: Weihan Zhang, Jun Tao β€’ Published: 2025-08-08 β€’ Source: arXiv
Explorative flow visualization allows domain experts to analyze complex flow structures by interactively investigating flow patterns. However, traditional visual interfaces often rely on specialized graphical representations and interactions, which require additional effort to learn and use. Natural language interaction offers a more intuitive alternative, but teaching machines to recognize diverse scientific concepts and extract corresponding structures from flow data poses a significant challenge. In this paper, we introduce an automated framework that aligns flow pattern representations with the semantic space of large language models (LLMs), eliminating the need for manual labeling. Our approach encodes streamline segments using a denoising autoencoder and maps the generated flow pattern representations to LLM embeddings via a projector layer. This alignment empowers semantic matching between textual embeddings and flow representations through an attention mechanism, enabling the extraction of corresponding flow patterns based on textual descriptions. To enhance accessibility, we develop an interactive interface that allows users to query and visualize flow structures using natural language. Through case studies, we demonstrate the effectiveness of our framework in enabling intuitive and intelligent flow exploration.
28. Large Language Model Data Generation for Enhanced Intent Recognition in German Speech
Authors: Theresa Pekarek Rosin, Burak Can Kaplan, Stefan Wermter β€’ Published: 2025-08-08 β€’ Source: arXiv
Intent recognition (IR) for speech commands is essential for artificial intelligence (AI) assistant systems; however, most existing approaches are limited to short commands and are predominantly developed for English. This paper addresses these limitations by focusing on IR from speech by elderly German speakers. We propose a novel approach that combines an adapted Whisper ASR model, fine-tuned on elderly German speech (SVC-de), with Transformer-based language models trained on synthetic text datasets generated by three well-known large language models (LLMs): LeoLM, Llama3, and ChatGPT. To evaluate the robustness of our approach, we generate synthetic speech with a text-to-speech model and conduct extensive cross-dataset testing. Our results show that synthetic LLM-generated data significantly boosts classification performance and robustness to different speaking styles and unseen vocabulary. Notably, we find that LeoLM, a smaller, domain-specific 13B LLM, surpasses the much larger ChatGPT (175B) in dataset quality for German intent recognition. Our approach demonstrates that generative AI can effectively bridge data gaps in low-resource domains. We provide detailed documentation of our data generation and training process to ensure transparency and reproducibility.
29. Efficient Deep Neural Receiver with Post-Training Quantization
Authors: SaiKrishna Saketh Yellapragada, Esa Ollila, Mario Costa β€’ Published: 2025-08-08 β€’ Source: arXiv
Deep learning has recently garnered significant interest in wireless communications due to its superior performance compared to traditional model-based algorithms. Deep convolutional neural networks (CNNs) have demonstrated notable improvements in block error rate (BLER) under various channel models and mobility scenarios. However, the high computational complexity and resource demands of deep CNNs pose challenges for deployment in resource-constrained edge systems. The 3rd Generation Partnership Project (3GPP) Release 20 highlights the pivotal role of artificial intelligence (AI) integration in enabling advanced radio-access networks for 6G systems. The hard real-time processing demands of 5G and 6G require efficient techniques such as post-training quantization (PTQ), quantization-aware training (QAT), pruning, and hybrid approaches to meet latency requirements. In this paper, we focus on PTQ to reduce model complexity by lowering the bit-width of weights, thereby enhancing computational efficiency. Our analysis employs symmetric uniform quantization, applying both per-tensor and per-channel PTQ to a neural receiver achieving performance comparable to full-precision models. Specifically, 8-bit per-channel quantization maintains BLER performance with minimal degradation, while 4-bit quantization shows great promise but requires further optimization to achieve target BLER levels. These results highlight the potential of ultra-low bitwidth PTQ for efficient neural receiver deployment in 6G systems.
30. SCAR: State-Space Compression for AI-Driven Resource Management in 6G-Enabled Vehicular Infotainment Systems
Authors: Ioan-Sorin Comsa, Purav Shah, Karthik Vaidhyanathan, Deepak Gangadharan, Christof Imhof, Per Bergamin, Aryan Kaushik, Gabriel-Miro Muntean, Ramona Trestian β€’ Published: 2025-08-08 β€’ Source: arXiv
The advent of 6G networks opens new possibilities for connected infotainment services in vehicular environments. However, traditional Radio Resource Management (RRM) techniques struggle with the increasing volume and complexity of data such as Channel Quality Indicators (CQI) from autonomous vehicles. To address this, we propose SCAR (State-Space Compression for AI-Driven Resource Management), an Edge AI-assisted framework that optimizes scheduling and fairness in vehicular infotainment. SCAR employs ML-based compression techniques (e.g., clustering and RBF networks) to reduce CQI data size while preserving essential features. These compressed states are used to train 6G-enabled Reinforcement Learning policies that maximize throughput while meeting fairness objectives defined by the NGMN. Simulations show that SCAR increases time in feasible scheduling regions by 14\% and reduces unfair scheduling time by 15\% compared to RL baselines without CQI compression. Furthermore, Simulated Annealing with Stochastic Tunneling (SAST)-based clustering reduces CQI clustering distortion by 10\%, confirming its efficiency. These results demonstrate SCAR's scalability and fairness benefits for dynamic vehicular networks.
31. Classification is a RAG problem: A case study on hate speech detection
Authors: Richard Willats, Josh Pennington, Aravind Mohan, Bertie Vidgen β€’ Published: 2025-08-08 β€’ Source: arXiv
Robust content moderation requires classification systems that can quickly adapt to evolving policies without costly retraining. We present classification using Retrieval-Augmented Generation (RAG), which shifts traditional classification tasks from determining the correct category in accordance with pre-trained parameters to evaluating content in relation to contextual knowledge retrieved at inference. In hate speech detection, this transforms the task from "is this hate speech?" to "does this violate the hate speech policy?" Our Contextual Policy Engine (CPE) - an agentic RAG system - demonstrates this approach and offers three key advantages: (1) robust classification accuracy comparable to leading commercial systems, (2) inherent explainability via retrieved policy segments, and (3) dynamic policy updates without model retraining. Through three experiments, we demonstrate strong baseline performance and show that the system can apply fine-grained policy control by correctly adjusting protection for specific identity groups without requiring retraining or compromising overall performance. These findings establish that RAG can transform classification into a more flexible, transparent, and adaptable process for content moderation and wider classification problems.
32. Clinically-guided Data Synthesis for Laryngeal Lesion Detection
Authors: Chiara Baldini, Kaisar Kushibar, Richard Osuala, Simone Balocco, Oliver Diaz, Karim Lekadir, Leonardo S. Mattos β€’ Published: 2025-08-08 β€’ Source: arXiv
Although computer-aided diagnosis (CADx) and detection (CADe) systems have made significant progress in various medical domains, their application is still limited in specialized fields such as otorhinolaryngology. In the latter, current assessment methods heavily depend on operator expertise, and the high heterogeneity of lesions complicates diagnosis, with biopsy persisting as the gold standard despite its substantial costs and risks. A critical bottleneck for specialized endoscopic CADx/e systems is the lack of well-annotated datasets with sufficient variability for real-world generalization. This study introduces a novel approach that exploits a Latent Diffusion Model (LDM) coupled with a ControlNet adapter to generate laryngeal endoscopic image-annotation pairs, guided by clinical observations. The method addresses data scarcity by conditioning the diffusion process to produce realistic, high-quality, and clinically relevant image features that capture diverse anatomical conditions. The proposed approach can be leveraged to expand training datasets for CADx/e models, empowering the assessment process in laryngology. Indeed, during a downstream task of detection, the addition of only 10% synthetic data improved the detection rate of laryngeal lesions by 9% when the model was internally tested and 22.1% on out-of-domain external data. Additionally, the realism of the generated images was evaluated by asking 5 expert otorhinolaryngologists with varying expertise to rate their confidence in distinguishing synthetic from real images. This work has the potential to accelerate the development of automated tools for laryngeal disease diagnosis, offering a solution to data scarcity and demonstrating the applicability of synthetic data in real-world scenarios.
33. Comparing Knowledge Injection Methods for LLMs in a Low-Resource Regime
Authors: Hugo Abonizio, Thales Almeida, Roberto Lotufo, Rodrigo Nogueira β€’ Published: 2025-08-08 β€’ Source: arXiv
Large language models (LLMs) often require vast amounts of text to effectively acquire new knowledge. While continuing pre-training on large corpora or employing retrieval-augmented generation (RAG) has proven successful, updating an LLM with only a few thousand or million tokens remains challenging. In this work, we investigate the task of injecting small, unstructured information into LLMs and its relation to the catastrophic forgetting phenomenon. We use a dataset of recent news -- ensuring no overlap with the model's pre-training data -- to evaluate the knowledge acquisition by probing the model with question-answer pairs related the learned information. Starting from a continued pre-training baseline, we explored different augmentation algorithms to generate synthetic data to improve the knowledge acquisition capabilities. Our experiments show that simply continuing pre-training on limited data yields modest improvements, whereas exposing the model to diverse textual variations significantly improves the learning of new facts -- particularly with methods that induce greater variability through diverse prompting. Furthermore, we shed light on the forgetting phenomenon in small-data regimes, illustrating the delicate balance between learning new content and retaining existing capabilities. We also confirm the sensitivity of RAG-based approaches for knowledge injection, which often lead to greater degradation on control datasets compared to parametric methods. Finally, we demonstrate that models can generate effective synthetic training data themselves, suggesting a pathway toward self-improving model updates. All code and generated data used in our experiments are publicly available, providing a resource for studying efficient knowledge injection in LLMs with limited data at https://github.com/hugoabonizio/knowledge-injection-methods.
34. Retrieval Augmented Large Language Model System for Comprehensive Drug Contraindications
Authors: Byeonghun Bang, Jongsuk Yoon, Dong-Jin Chang, Seho Park, Yong Oh Lee β€’ Published: 2025-08-08 β€’ Source: arXiv
The versatility of large language models (LLMs) has been explored across various sectors, but their application in healthcare poses challenges, particularly in the domain of pharmaceutical contraindications where accurate and reliable information is required. This study enhances the capability of LLMs to address contraindications effectively by implementing a Retrieval Augmented Generation (RAG) pipeline. Utilizing OpenAI's GPT-4o-mini as the base model, and the text-embedding-3-small model for embeddings, our approach integrates Langchain to orchestrate a hybrid retrieval system with re-ranking. This system leverages Drug Utilization Review (DUR) data from public databases, focusing on contraindications for specific age groups, pregnancy, and concomitant drug use. The dataset includes 300 question-answer pairs across three categories, with baseline model accuracy ranging from 0.49 to 0.57. Post-integration of the RAG pipeline, we observed a significant improvement in model accuracy, achieving rates of 0.94, 0.87, and 0.89 for contraindications related to age groups, pregnancy, and concomitant drug use, respectively. The results indicate that augmenting LLMs with a RAG framework can substantially reduce uncertainty in prescription and drug intake decisions by providing more precise and reliable drug contraindication information.
35. Transformer-Based Explainable Deep Learning for Breast Cancer Detection in Mammography: The MammoFormer Framework
Authors: Ojonugwa Oluwafemi Ejiga Peter, Daniel Emakporuena, Bamidele Dayo Tunde, Maryam Abdulkarim, Abdullahi Bn Umar β€’ Published: 2025-08-08 β€’ Source: arXiv
Breast cancer detection through mammography interpretation remains difficult because of the minimal nature of abnormalities that experts need to identify alongside the variable interpretations between readers. The potential of CNNs for medical image analysis faces two limitations: they fail to process both local information and wide contextual data adequately, and do not provide explainable AI (XAI) operations that doctors need to accept them in clinics. The researcher developed the MammoFormer framework, which unites transformer-based architecture with multi-feature enhancement components and XAI functionalities within one framework. Seven different architectures consisting of CNNs, Vision Transformer, Swin Transformer, and ConvNext were tested alongside four enhancement techniques, including original images, negative transformation, adaptive histogram equalization, and histogram of oriented gradients. The MammoFormer framework addresses critical clinical adoption barriers of AI mammography systems through: (1) systematic optimization of transformer architectures via architecture-specific feature enhancement, achieving up to 13% performance improvement, (2) comprehensive explainable AI integration providing multi-perspective diagnostic interpretability, and (3) a clinically deployable ensemble system combining CNN reliability with transformer global context modeling. The combination of transformer models with suitable feature enhancements enables them to achieve equal or better results than CNN approaches. ViT achieves 98.3% accuracy alongside AHE while Swin Transformer gains a 13.0% advantage through HOG enhancements