1. UniAPL: A Unified Adversarial Preference Learning Framework for Instruct-Following
Authors: FaQiang Qian, WeiKun Zhang, Ziliang Wang, Kang An, Xuhui Zheng, Liangjian Wen, Mengya Gao, Yong Dai, Yichao Wu β’
Published: 2025-09-29 β’
Source: arXiv
Shaping powerful LLMs to be beneficial and safe is central to AI alignment. We argue that post-training alignment is fundamentally a unified Preference Learning problem, involving two modalities: demonstrated preferences (e.g., Supervised Fine-Tuning, SFT) and comparative preferences (e.g., Reinforcement Learning, RL).The standard sequential pipeline-SFT followed by RL-is flawed due to a critical distributional mismatch: SFT uses static expert data, but as the policy evolves, its generation distribution drifts, making SFT knowledge brittle. Subsequent RL then explores without direct access to the rich, ground-truth knowledge in expert demonstrations, leading to inefficient, ungrounded updates. This separation prevents mutual regularization between data sources. To address this, we reframe alignment as a constrained optimization problem and propose Unified Adversarial Preference Learning (UniAPL),a novel framework that dynamically aligns the policy's distribution with the expert's. UniAPL implements a single-stage unified training objective, jointly learning from mixed batches of SFT and preference data. In every gradient step, dense expert demonstrations directly ground and regularize online exploration, inherently resolving distributional mismatch and maximizing data synergy.We evaluate UniAPL on instruction-following tasks using Qwen3-235B-Instruct-2507 as the teacher. Our models match or exceed strong GRPO baselines: +5.77% on Qwen3-0.6B (matching a 32B model) and +3.75% on Qwen3-4B,even outperforming the teacher. Analyses of response length and log-probability distributions confirm that UniAPL outputs closely mimic expert demonstrations, achieving both stronger performance and better behavioral alignment.
2. Momentum-resolved two-dimensional spectroscopy as a probe of nonlinear quantum field dynamics
Authors: Duilio De Santis, Alex GΓ³mez Salvador, Nataliia Bazhan, Sebastian Erne, Maximilian PrΓΌfer, Claudio Guarcello, Davide Valenti, JΓΆrg Schmiedmayer, Eugene Demler β’
Published: 2025-09-29 β’
Source: arXiv
Emergent collective excitations constitute a hallmark of interacting quantum many-body systems, yet in solid-state platforms their study has been largely limited by the constraints of linear-response probes and by finite momentum resolution. We propose to overcome these limitations by combining the spatial resolution of ultracold atomic systems with the nonlinear probing capabilities of two-dimensional spectroscopy (2DS). As a concrete illustration, we analyze momentum-resolved 2DS of the quantum sine-Gordon model describing the low energy dynamics of two weakly coupled one-dimensional Bose-Einstein condensates. This approach reveals distinctive many-body signatures, most notably asymmetric cross-peaks reflecting the interplay between isolated ($B_2$ breather) and continuum ($B_1$ pair) modes. The protocol further enables direct characterization of anharmonicity and disorder, establishing momentum-resolved 2DS as both a powerful diagnostic for quantum simulators and a versatile probe of correlated quantum matter.
3. Investigating Language and Retrieval Bias in Multilingual Previously Fact-Checked Claim Detection
Authors: Ivan Vykopal, Antonia Karamolegkou, Jaroslav KopΔan, Qiwei Peng, TomΓ‘Ε‘ JavΕ―rek, Michal Gregor, MariΓ‘n Ε imko β’
Published: 2025-09-29 β’
Source: arXiv
Multilingual Large Language Models (LLMs) offer powerful capabilities for cross-lingual fact-checking. However, these models often exhibit language bias, performing disproportionately better on high-resource languages such as English than on low-resource counterparts. We also present and inspect a novel concept - retrieval bias, when information retrieval systems tend to favor certain information over others, leaving the retrieval process skewed. In this paper, we study language and retrieval bias in the context of Previously Fact-Checked Claim Detection (PFCD). We evaluate six open-source multilingual LLMs across 20 languages using a fully multilingual prompting strategy, leveraging the AMC-16K dataset. By translating task prompts into each language, we uncover disparities in monolingual and cross-lingual performance and identify key trends based on model family, size, and prompting strategy. Our findings highlight persistent bias in LLM behavior and offer recommendations for improving equity in multilingual fact-checking. To investigate retrieval bias, we employed multilingual embedding models and look into the frequency of retrieved claims. Our analysis reveals that certain claims are retrieved disproportionately across different posts, leading to inflated retrieval performance for popular claims while under-representing less common ones.
4. Loop-Level Double Copy Relations from Forward Limits
Authors: Qu Cao, Song He, Yong Zhang, Fan Zhu β’
Published: 2025-09-29 β’
Source: arXiv
We study double copy relations for loop integrands in gauge theories and gravity based on their constructions from single cuts, which are in turn obtained from forward limits of lower-loop cases. While such a construction from forward limits has been realized for loop integrands in gauge theories, we demonstrate its extension to gravity by reconstructing one-loop gravity integrands from forward limits of trees. Under mild symmetry assumptions on tree-level kinematic numerators (and their forward limits), our method directly leads to double copy relations for one-loop integrands: these include the field-theoretic Kawai-Lewellen-Tye (KLT) relations, whose kernel is the inverse of a matrix with rank $(n{-}1)!$ formed by those in bi-adjoint $\phi^3$ theory, and the Bern-Carrasco-Johansson (BCJ) double copy relations with crossing-symmetric kinematic numerators (we provide local and crossing-symmetric Yang-Mills BCJ numerators for $n=3,4,5$ explicitly). By exploiting the "universal expansion" for one-loop integrands in generic gauge theories, we also obtain an analogous expansion for gravity (including supergravity theories).
5. From $f(x)$ and $g(x)$ to $f(g(x))$: LLMs Learn New Skills in RL by Composing Old Ones
Authors: Lifan Yuan, Weize Chen, Yuchen Zhang, Ganqu Cui, Hanbin Wang, Ziming You, Ning Ding, Zhiyuan Liu, Maosong Sun, Hao Peng β’
Published: 2025-09-29 β’
Source: arXiv
Does RL teach LLMs genuinely new skills, or does it merely activate existing ones? This question lies at the core of ongoing debates about the role of RL in LLM post-training. On one side, strong empirical results can be achieved with RL even without preceding supervised finetuning; on the other, critics argue that RL contributes little beyond reweighting existing reasoning strategies. This work provides concrete evidence that LLMs can acquire genuinely new skills during RL by composing existing ones, mirroring one of the central mechanisms by which humans acquire new cognitive skills. To mitigate data contamination and other confounding factors, and to allow precise control over task complexity, we develop a synthetic framework for our investigation. Specifically, we define a skill as the ability to infer the output of a string transformation function f(x) given x. When an LLM has already learned f and g prior to RL, our experiments reveal that RL enables it to learn unseen compositions of them h(x)=g(f(x)). Further, this compositional ability generalizes to more difficult problems such as compositions of >2 functions unseen during RL training. Surprisingly, our experiments show that compositional skill acquired on a source task transfers to a different target task. This transfer happens even without compositional training on the target, requiring only prior knowledge of the target's atomic skills. Our qualitative analysis shows that RL fundamentally changes the reasoning behaviors of the models. In contrast, next-token training with the same data yields none of these findings. Our systematic experiments provide fresh insights into LLM learning, suggesting the value of first building base models with basic skills, then using RL to incentivize advanced, generalizable skills for complex problems.
6. gCAMB: A GPU-accelerated Boltzmann solver for next-generation cosmological surveys
Authors: L. Storchi, P. Campeti, M. Lattanzi, N. Antonini, E. Calore, P. Lubrano β’
Published: 2025-09-29 β’
Source: arXiv
Inferring cosmological parameters from Cosmic Microwave Background (CMB) data requires repeated and computationally expensive calculations of theoretical angular power spectra using Boltzmann solvers like CAMB. This creates a significant bottleneck, particularly for non-standard cosmological models and the high-accuracy demands of future surveys. While emulators based on deep neural networks can accelerate this process by several orders of magnitude, they first require large, pre-computed training datasets, which are costly to generate and model-specific. To address this challenge, we introduce gCAMB, a version of the CAMB code ported to GPUs, which preserves all the features of the original CPU-only code. By offloading the most computationally intensive modules to the GPU, gCAMB significantly accelerates the generation of power spectra, saving massive computational time, halving the power consumption in high-accuracy settings and, among other purposes, facilitating the creation of extensive training sets needed for robust cosmological analyses. We make the gCAMB software available to the community at https://github.com/lstorchi/CAMB/tree/gpuport.
7. Towards generalizable deep ptychography neural networks
Authors: Albert Vong, Steven Henke, Oliver Hoidn, Hanna Ruth, Junjing Deng, Alexander Hexemer, Apurva Mehta, Arianna Gleason, Levi Hancock, Nicholas Schwarz β’
Published: 2025-09-29 β’
Source: arXiv
X-ray ptychography is a data-intensive imaging technique expected to become ubiquitous at next-generation light sources delivering many-fold increases in coherent flux. The need for real-time feedback under accelerated acquisition rates motivates surrogate reconstruction models like deep neural networks, which offer orders-of-magnitude speedup over conventional methods. However, existing deep learning approaches lack robustness across diverse experimental conditions. We propose an unsupervised training workflow emphasizing probe learning by combining experimentally-measured probes with synthetic, procedurally generated objects. This probe-centric approach enables a single physics-informed neural network to reconstruct unseen experiments across multiple beamlines; among the first demonstrations of multi-probe generalization. We find probe learning is equally important as in-distribution learning; models trained using this synthetic workflow achieve reconstruction fidelity comparable to those trained exclusively on experimental data, even when changing the type of synthetic training object. The proposed approach enables training of experiment-steering models that provide real-time feedback under dynamic experimental conditions.
8. Equilibrium states for non relativistic Bose gases with condensation
Authors: Stefano Galanda, Nicola Pinamonti β’
Published: 2025-09-29 β’
Source: arXiv
In this paper we present the construction of the equilibrium states at positive temperature in the presence of a condensation phase for a Gas of non relativistic Bose particles on an infinite space interacting through a localised two body interaction. We use methods of quantum field theory in the algebraic formulation to obtain this result and in order to prove convergence of the partition function and of the generating function of the correlation functions, we introduce an auxiliary stochastic Gaussian field which mediates the interaction of the Bose particles (Hubbard-Stratonovich transformation). The construction of the equilibrium state and of the partition function in the presence of the condensate, treating the auxiliary stochastic field as external potential, can be achieved using and adapting ideas and methods of Araki. Explicit formulas for the relative entropy of the equilibrium state with the external potential with respect to the equilibrium state of the free theory are obtained adapting known Feynman-Kac formulas for the propagators of the theory. If the two-body interaction is sufficiently weak, the proof of the convergence of the partition function after evaluation of the external stochastic field on a suitable Gaussian state can be given utilizing the properties of the relative entropy mentioned above. Limits where the localisation of the two-body interaction is removed are eventually discussed in combination of the limits of vanishing temperature and or in the weakly interacting regime.
9. Benchmarking ECG Foundational Models: A Reality Check Across Clinical Tasks
Authors: M A Al-Masud, Juan Miguel Lopez Alcaraz, Nils Strodthoff β’
Published: 2025-09-29 β’
Source: arXiv
The 12-lead electrocardiogram (ECG) is a long-standing diagnostic tool. Yet machine learning for ECG interpretation remains fragmented, often limited to narrow tasks or datasets. Foundation models promise broader adaptability, but their generalization across diverse ECG tasks is not well understood. We benchmarked eight ECG foundation models on 26 clinically relevant tasks using 12 public datasets comprising 1,650 regression and classification targets. Models were evaluated under fine-tuning and frozen settings, with scaling analyses across dataset sizes. Results show heterogeneous performance across domains: in the most widely studied domain, adult ECG interpretation, three foundation models consistently outperformed strong supervised baselines. In contrast, ECG-CPC, a compact structured state-space model pretrained on HEEDB, dominated other categories where most foundation models failed to surpass supervised learning. Foundation models also displayed distinct scaling behaviors with dataset size, which are critical for small-scale clinical applications. Overall, while foundation models show promise for adult ECG analysis, substantial gaps remain in cardiac structure, outcome prediction, and patient characterization. Notably, ECG-CPC's strong performance despite being orders of magnitude smaller and consuming minimal computational resources highlights untapped opportunities for advancing ECG foundation models.
10. Unsupervised Representation Learning for 3D Mesh Parameterization with Semantic and Visibility Objectives
Authors: AmirHossein Zamani, Bruno Roy, Arianna Rampini β’
Published: 2025-09-29 β’
Source: arXiv
Recent 3D generative models produce high-quality textures for 3D mesh objects. However, they commonly rely on the heavy assumption that input 3D meshes are accompanied by manual mesh parameterization (UV mapping), a manual task that requires both technical precision and artistic judgment. Industry surveys show that this process often accounts for a significant share of asset creation, creating a major bottleneck for 3D content creators. Moreover, existing automatic methods often ignore two perceptually important criteria: (1) semantic awareness (UV charts should align semantically similar 3D parts across shapes) and (2) visibility awareness (cutting seams should lie in regions unlikely to be seen). To overcome these shortcomings and to automate the mesh parameterization process, we present an unsupervised differentiable framework that augments standard geometry-preserving UV learning with semantic- and visibility-aware objectives. For semantic-awareness, our pipeline (i) segments the mesh into semantic 3D parts, (ii) applies an unsupervised learned per-part UV-parameterization backbone, and (iii) aggregates per-part charts into a unified UV atlas. For visibility-awareness, we use ambient occlusion (AO) as an exposure proxy and back-propagate a soft differentiable AO-weighted seam objective to steer cutting seams toward occluded regions. By conducting qualitative and quantitative evaluations against state-of-the-art methods, we show that the proposed method produces UV atlases that better support texture generation and reduce perceptible seam artifacts compared to recent baselines. Our implementation code is publicly available at: https://github.com/AHHHZ975/Semantic-Visibility-UV-Param.
11. From Dark Radiation to Dark Energy: Unified Cosmological Evolution in K-essence Models
Authors: Eladio Moreno, Josue De-Santiago β’
Published: 2025-09-29 β’
Source: arXiv
We study a class of Unified Dark Matter (UDM) models based on generalized K-essence, where a single scalar field with non-canonical kinetic terms accounts for dark radiation, dark matter, and dark energy. Starting from the purely kinetic Lagrangian proposed by Scherrer (2004), we extend the analysis to quadratic and exponential scalar potentials and explore their phenomenology. All models are implemented in a modified version of \texttt{Hi\_CLASS} and confronted with data from \textit{Planck} 2018, DESI DR1, and Big Bang Nucleosynthesis. The scenarios reproduce the full sequence of cosmic epochs: an early radiation-like phase, a matter-dominated era, and late-time accelerated expansion. The new models predict slightly higher values of the Hubble constant compared to $\Lambda$CDM, thereby partially alleviating the respective tensions from $\sim 4.4 \sigma$ to $\sim 3.4 \sigma$. The quadratic potential requires an ultralight mass that makes it effectively indistinguishable from the Scherrer solution. Overall, generalized K-essence provides a minimal and observationally viable realization of UDM, offering a unified description of the dark sector with distinctive signatures in both early- and late-time cosmology.
12. Cause-and-effect approach to turbulence forecasting
Authors: Γlvaro MartΓnez-SΓ‘nchez, AdriΓ‘n Lozano-DurΓ‘n β’
Published: 2025-09-29 β’
Source: arXiv
Traditional approaches to turbulence forecasting often rely on correlation-based criteria for input selection. These methods may select variables that correlate with the target without truly driving its dynamics, which limits interpretability, generalization, and efficiency. In this work, we introduce a causality-based approach for input selection in turbulence forecasting based on the Synergistic-Unique-Redundant Decomposition (SURD) of causality. This method decomposes the information from candidate inputs into unique, redundant, and synergistic causal contributions and links them to the fundamental limits of predictive accuracy achievable by any model. In practice, we implement the approach using neural mutual-information estimators and demonstrate its application to wall-shear-stress forecasting from direct numerical simulation data of turbulent channel flow. Our findings show that input variables with strong unique or synergistic causal contributions enable compact forecasting models with high predictive power, whereas redundant variables can be excluded without degrading accuracy. We first validate these capabilities in two benchmark cases involving collider effects, and then apply the methodology to three turbulent flow configurations with different interaction types. In each case, we demonstrate how SURD causalities guide optimal input selection by constructing forecasting models based on various input combinations. We also compare the results with standard space-time correlation analysis and show that SURD provides a more reliable basis for input selection, as it captures nonlinear dependencies, distinguishes redundant, unique, and synergistic interactions, and remains invariant under invertible transformations of the variables. Overall, we believe this enables more interpretable and compact models by reducing input dimensionality without sacrificing performance.
13. Learning from Convenience Samples: A Case Study on Fine-Tuning LLMs for Survey Non-response in the German Longitudinal Election Study
Authors: Tobias Holtdirk, Dennis Assenmacher, Arnim Bleier, Claudia Wagner β’
Published: 2025-09-29 β’
Source: arXiv
Survey researchers face two key challenges: the rising costs of probability samples and missing data (e.g., non-response or attrition), which can undermine inference and increase the use of convenience samples. Recent work explores using large language models (LLMs) to simulate respondents via persona-based prompts, often without labeled data. We study a more practical setting where partial survey responses exist: we fine-tune LLMs on available data to impute self-reported vote choice under both random and systematic nonresponse, using the German Longitudinal Election Study. We compare zero-shot prompting and supervised fine-tuning against tabular classifiers (e.g., CatBoost) and test how different convenience samples (e.g., students) used for fine-tuning affect generalization. Our results show that when data are missing completely at random, fine-tuned LLMs match tabular classifiers but outperform zero-shot approaches. When only biased convenience samples are available, fine-tuning small (3B to 8B) open-source LLMs can recover both individual-level predictions and population-level distributions more accurately than zero-shot and often better than tabular methods. This suggests fine-tuned LLMs offer a promising strategy for researchers working with non-probability samples or systematic missingness, and may enable new survey designs requiring only easily accessible subpopulations.
14. Two-dimensional THz spectroscopy in electronic systems: a many-body diagrammatic approach
Authors: Jacopo Fiore, NiccolΓ² Sellati, Mattia Udina, Lara Benfatto β’
Published: 2025-09-29 β’
Source: arXiv
The term two-dimensional coherent spectroscopy (2DCS) usually refers to experimental setups where a coherently generated electric field in a sample is recorded over many runs as a function of two time variables: the delay $\tau$ between two consequent excitation pulses and the time $t$ over which the signal is emitted. While its implementation in the femtosecond time domain for studying vibrational molecular states has been developed for over two decades, its experimental application in the THz domain to interacting electronic systems remains in its infancy. This work provides a general theoretical framework for describing and interpreting 2DCS using a many-body language based on a perturbative diagrammatic expansion, as widely applied in linear spectroscopy. Focusing on centrosymmetric systems, we show that interpreting the 2D maps can be recast into two complementary problems. The first is the evaluation of a third-order response function to the gauge field. In the velocity gauge, this leads to semi-analytical expressions that both reduce computational complexity and assist in assigning spectral features to microscopic processes, as shown using a toy model of electrons undergoing a charge-density wave transition. The second is a careful treatment of multi-wave propagation effects, which, in bulk systems, can obscure the intrinsic nonlinear response, demonstrated here for soft superconducting Josephson plasmons. Our results provide a solid foundation for extending 2DCS to complex interacting systems and offer a flexible method to realistically model nonlinear responses across arbitrary spectral widths.
15. AlphaSAGE: Structure-Aware Alpha Mining via GFlowNets for Robust Exploration
Authors: Binqi Chen, Hongjun Ding, Ning Shen, Jinsheng Huang, Taian Guo, Luchen Liu, Ming Zhang β’
Published: 2025-09-29 β’
Source: arXiv
The automated mining of predictive signals, or alphas, is a central challenge in quantitative finance. While Reinforcement Learning (RL) has emerged as a promising paradigm for generating formulaic alphas, existing frameworks are fundamentally hampered by a triad of interconnected issues. First, they suffer from reward sparsity, where meaningful feedback is only available upon the completion of a full formula, leading to inefficient and unstable exploration. Second, they rely on semantically inadequate sequential representations of mathematical expressions, failing to capture the structure that determine an alpha's behavior. Third, the standard RL objective of maximizing expected returns inherently drives policies towards a single optimal mode, directly contradicting the practical need for a diverse portfolio of non-correlated alphas. To overcome these challenges, we introduce AlphaSAGE (Structure-Aware Alpha Mining via Generative Flow Networks for Robust Exploration), a novel framework is built upon three cornerstone innovations: (1) a structure-aware encoder based on Relational Graph Convolutional Network (RGCN); (2) a new framework with Generative Flow Networks (GFlowNets); and (3) a dense, multi-faceted reward structure. Empirical results demonstrate that AlphaSAGE outperforms existing baselines in mining a more diverse, novel, and highly predictive portfolio of alphas, thereby proposing a new paradigm for automated alpha mining. Our code is available at https://github.com/BerkinChen/AlphaSAGE.
16. Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models
Authors: Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, Zhi-Ming Ma β’
Published: 2025-09-29 β’
Source: arXiv
Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where pre-training and RL post-training share the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives--score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce \textbf{Advantage Weighted Matching (AWM)}, a policy-gradient method for diffusion. It uses the same score/flow-matching loss as pretraining to obtain a lower-variance objective and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence. This simple yet effective design yields substantial benefits: on GenEval, OCR, and PickScore benchmarks, AWM delivers up to a $24\times$ speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is available at https://github.com/scxue/advantage_weighted_matching.
17. Path Diffuser: Diffusion Model for Data-Driven Traffic Simulator
Authors: Da Saem Lee, Akash Karthikeyan, Yash Vardhan Pant, Sebastian Fischmeister β’
Published: 2025-09-29 β’
Source: arXiv
Simulating diverse and realistic traffic scenarios is critical for developing and testing autonomous planning. Traditional rule-based planners lack diversity and realism, while learning-based simulators often replay, forecast, or edit scenarios using historical agent trajectories. However, they struggle to generate new scenarios, limiting scalability and diversity due to their reliance on fully annotated logs and historical data. Thus, a key challenge for a learning-based simulator's performance is that it requires agents' past trajectories and pose information in addition to map data, which might not be available for all agents on the road.Without which, generated scenarios often produce unrealistic trajectories that deviate from drivable areas, particularly under out-of-distribution (OOD) map scenes (e.g., curved roads). To address this, we propose Path Diffuser (PD): a two-stage, diffusion model for generating agent pose initializations and their corresponding trajectories conditioned on the map, free of any historical context of agents' trajectories. Furthermore, PD incorporates a motion primitive-based prior, leveraging Frenet frame candidate trajectories to enhance diversity while ensuring road-compliant trajectory generation. We also explore various design choices for modeling complex multi-agent interactions. We demonstrate the effectiveness of our method through extensive experiments on the Argoverse2 Dataset and additionally evaluate the generalizability of the approach on OOD map variants. Notably, Path Diffuser outperforms the baseline methods by 1.92x on distribution metrics, 1.14x on common-sense metrics, and 1.62x on road compliance from adversarial benchmarks.
18. DiffTester: Accelerating Unit Test Generation for Diffusion LLMs via Repetitive Pattern
Authors: Lekang Yang, Yuetong Liu, Yitong Zhang, Jia Li β’
Published: 2025-09-29 β’
Source: arXiv
Software development relies heavily on extensive unit testing, which makes the efficiency of automated Unit Test Generation (UTG) particularly important. However, most existing LLMs generate test cases one token at a time in each forward pass, which leads to inefficient UTG. Recently, diffusion LLMs (dLLMs) have emerged, offering promising parallel generation capabilities and showing strong potential for efficient UTG. Despite this advantage, their application to UTG is still constrained by a clear trade-off between efficiency and test quality, since increasing the number of tokens generated in each step often causes a sharp decline in the quality of test cases. To overcome this limitation, we present DiffTester, an acceleration framework specifically tailored for dLLMs in UTG. The key idea of DiffTester is that unit tests targeting the same focal method often share repetitive structural patterns. By dynamically identifying these common patterns through abstract syntax tree analysis during generation, DiffTester adaptively increases the number of tokens produced at each step without compromising the quality of the output. To enable comprehensive evaluation, we extend the original TestEval benchmark, which was limited to Python, by introducing additional programming languages including Java and C++. Extensive experiments on three benchmarks with two representative models show that DiffTester delivers significant acceleration while preserving test coverage. Moreover, DiffTester generalizes well across different dLLMs and programming languages, providing a practical and scalable solution for efficient UTG in software development. Code and data are publicly available at https://github.com/wellbeingyang/DLM4UTG-open .
19. The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability
Authors: Linlu Gong, Ante Wang, Yunghwei Lai, Weizhi Ma, Yang Liu β’
Published: 2025-09-29 β’
Source: arXiv
An effective physician should possess a combination of empathy, expertise, patience, and clear communication when treating a patient. Recent advances have successfully endowed AI doctors with expert diagnostic skills, particularly the ability to actively seek information through inquiry. However, other essential qualities of a good doctor remain overlooked. To bridge this gap, we present MAQuE(Medical Agent Questioning Evaluation), the largest-ever benchmark for the automatic and comprehensive evaluation of medical multi-turn questioning. It features 3,000 realistically simulated patient agents that exhibit diverse linguistic patterns, cognitive limitations, emotional responses, and tendencies for passive disclosure. We also introduce a multi-faceted evaluation framework, covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience. Experiments on different LLMs reveal substantial challenges across the evaluation aspects. Even state-of-the-art models show significant room for improvement in their inquiry capabilities. These models are highly sensitive to variations in realistic patient behavior, which considerably impacts diagnostic accuracy. Furthermore, our fine-grained metrics expose trade-offs between different evaluation perspectives, highlighting the challenge of balancing performance and practicality in real-world clinical settings.
20. World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training
Authors: Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, Qing Zhang β’
Published: 2025-09-29 β’
Source: arXiv
Vision-Language-Action (VLA) models trained via imitation learning suffer from significant performance degradation in data-scarce scenarios due to their reliance on large-scale demonstration datasets. Although reinforcement learning (RL)-based post-training has proven effective in addressing data scarcity, its application to VLA models is hindered by the non-resettable nature of real-world environments. This limitation is particularly critical in high-risk domains such as industrial automation, where interactions often induce state changes that are costly or infeasible to revert. Furthermore, existing VLA approaches lack a reliable mechanism for detecting task completion, leading to redundant actions that reduce overall task success rates. To address these challenges, we propose World-Env, an RL-based post-training framework that replaces physical interaction with a low-cost, world model-based virtual simulator. World-Env consists of two key components: (1) a video-based world simulator that generates temporally consistent future visual observations, and (2) a vision-language model (VLM)-guided instant reflector that provides continuous reward signals and predicts action termination. This simulated environment enables VLA models to safely explore and generalize beyond their initial imitation learning distribution. Our method achieves notable performance gains with as few as five expert demonstrations per task. Experiments on complex robotic manipulation tasks demonstrate that World-Env effectively overcomes the data inefficiency, safety constraints, and inefficient execution of conventional VLA models that rely on real-world interaction, offering a practical and scalable solution for post-training in resource-constrained settings.
21. Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents
Authors: Jiahua Li, Kun Wei, Zhe Xu, Zibo Su, Xu Yang, Cheng Deng β’
Published: 2025-09-29 β’
Source: arXiv
Long videos, characterized by temporal complexity and sparse task-relevant information, pose significant reasoning challenges for AI systems. Although various Large Language Model (LLM)-based approaches have advanced long video understanding, they still struggle to achieve both completeness and efficiency in capturing task-critical information. Inspired by human progressive visual cognition, we propose CogniGPT, a framework that leverages an interactive loop between Multi-Granular Perception Agent (MGPA) and Verification-Enhanced Reflection Agent (VERA) for efficient and reliable long video understanding. Specifically, MGPA mimics human visual divergent and focused attention to capture task-related information, while VERA verifies perceived key clues to mitigate hallucination and optimize subsequent perception strategies. Through this interactive process, CogniGPT explores a minimal set of informative and reliable task-related clues. Extensive experiments on EgoSchema, Video-MME, NExT-QA, and MovieChat datasets demonstrate CogniGPT's superiority in both accuracy and efficiency. Notably, on EgoSchema, it surpasses existing training-free methods using only 11.2 frames and achieves performance comparable to Gemini 1.5-Pro.
22. When Autonomous Vehicle Meets V2X Cooperative Perception: How Far Are We?
Authors: An Guo, Shuoxiao Zhang, Enyi Tang, Xinyu Gao, Haomin Pang, Haoxiang Tian, Yanzhou Mu, Wu Wen, Chunrong Fang, Zhenyu Chen β’
Published: 2025-09-29 β’
Source: arXiv
With the tremendous advancement of deep learning and communication technology, Vehicle-to-Everything (V2X) cooperative perception has the potential to address limitations in sensing distant objects and occlusion for a single-agent perception system. V2X cooperative perception systems are software systems characterized by diverse sensor types and cooperative agents, varying fusion schemes, and operation under different communication conditions. Therefore, their complex composition gives rise to numerous operational challenges. Furthermore, when cooperative perception systems produce erroneous predictions, the types of errors and their underlying causes remain insufficiently explored. To bridge this gap, we take an initial step by conducting an empirical study of V2X cooperative perception. To systematically evaluate the impact of cooperative perception on the ego vehicle's perception performance, we identify and analyze six prevalent error patterns in cooperative perception systems. We further conduct a systematic evaluation of the critical components of these systems through our large-scale study and identify the following key findings: (1) The LiDAR-based cooperation configuration exhibits the highest perception performance; (2) Vehicle-to-infrastructure (V2I) and vehicle-to-vehicle (V2V) communication exhibit distinct cooperative perception performance under different fusion schemes; (3) Increased cooperative perception errors may result in a higher frequency of driving violations; (4) Cooperative perception systems are not robust against communication interference when running online. Our results reveal potential risks and vulnerabilities in critical components of cooperative perception systems. We hope that our findings can better promote the design and repair of cooperative perception systems.
23. CineWild: Balancing Art and Robotics for Ethical Wildlife Documentary Filmmaking
Authors: Pablo Pueyo, Fernando Caballero, Ana Cristina Murillo, Eduardo Montijano β’
Published: 2025-09-29 β’
Source: arXiv
Drones, or unmanned aerial vehicles (UAVs), have become powerful tools across domains-from industry to the arts. In documentary filmmaking, they offer dynamic, otherwise unreachable perspectives, transforming how stories are told. Wildlife documentaries especially benefit, yet drones also raise ethical concerns: the risk of disturbing the animals they aim to capture. This paper introduces CineWild, an autonomous UAV framework that combines robotics, cinematography, and ethics. Built on model predictive control, CineWild dynamically adjusts flight paths and camera settings to balance cinematic quality with animal welfare. Key features include adaptive zoom for filming from acoustic and visual safe distances, path-planning that avoids an animal's field of view, and smooth, low-noise maneuvers. CineWild exemplifies interdisciplinary innovation-bridging engineering, visual storytelling, and environmental ethics. We validate the system through simulation studies and will release the code upon acceptance.
24. DRCP: Diffusion on Reinforced Cooperative Perception for Perceiving Beyond Limits
Authors: Lantao Li, Kang Yang, Rui Song, Chen Sun β’
Published: 2025-09-29 β’
Source: arXiv
Cooperative perception enabled by Vehicle-to-Everything communication has shown great promise in enhancing situational awareness for autonomous vehicles and other mobile robotic platforms. Despite recent advances in perception backbones and multi-agent fusion, real-world deployments remain challenged by hard detection cases, exemplified by partial detections and noise accumulation which limit downstream detection accuracy. This work presents Diffusion on Reinforced Cooperative Perception (DRCP), a real-time deployable framework designed to address aforementioned issues in dynamic driving environments. DRCP integrates two key components: (1) Precise-Pyramid-Cross-Modality-Cross-Agent, a cross-modal cooperative perception module that leverages camera-intrinsic-aware angular partitioning for attention-based fusion and adaptive convolution to better exploit external features; and (2) Mask-Diffusion-Mask-Aggregation, a novel lightweight diffusion-based refinement module that encourages robustness against feature perturbations and aligns bird's-eye-view features closer to the task-optimal manifold. The proposed system achieves real-time performance on mobile platforms while significantly improving robustness under challenging conditions. Code will be released in late 2025.
25. Accurate Cobb Angle Estimation via SVD-Based Curve Detection and Vertebral Wedging Quantification
Authors: Chang Shi, Nan Meng, Yipeng Zhuang, Moxin Zhao, Jason Pui Yin Cheung, Hua Huang, Xiuyuan Chen, Cong Nie, Wenting Zhong, Guiqiang Jiang, Yuxin Wei, Jacob Hong Man Yu, Si Chen, Xiaowen Ou, Teng Zhang β’
Published: 2025-09-29 β’
Source: arXiv
Adolescent idiopathic scoliosis (AIS) is a common spinal deformity affecting approximately 2.2% of boys and 4.8% of girls worldwide. The Cobb angle serves as the gold standard for AIS severity assessment, yet traditional manual measurements suffer from significant observer variability, compromising diagnostic accuracy. Despite prior automation attempts, existing methods use simplified spinal models and predetermined curve patterns that fail to address clinical complexity. We present a novel deep learning framework for AIS assessment that simultaneously predicts both superior and inferior endplate angles with corresponding midpoint coordinates for each vertebra, preserving the anatomical reality of vertebral wedging in progressive AIS. Our approach combines an HRNet backbone with Swin-Transformer modules and biomechanically informed constraints for enhanced feature extraction. We employ Singular Value Decomposition (SVD) to analyze angle predictions directly from vertebral morphology, enabling flexible detection of diverse scoliosis patterns without predefined curve assumptions. Using 630 full-spine anteroposterior radiographs from patients aged 10-18 years with rigorous dual-rater annotation, our method achieved 83.45% diagnostic accuracy and 2.55{\deg} mean absolute error. The framework demonstrates exceptional generalization capability on out-of-distribution cases. Additionally, we introduce the Vertebral Wedging Index (VWI), a novel metric quantifying vertebral deformation. Longitudinal analysis revealed VWI's significant prognostic correlation with curve progression while traditional Cobb angles showed no correlation, providing robust support for early AIS detection, personalized treatment planning, and progression monitoring.
26. RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
Authors: Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, Zuyan Liu, Bohan Zeng, Ruizhe Chen, Qixun Wang, Zhuoran Zhang, Xinlong Chen, Chengzhuo Tong, Bozhou Li, Chaoyou Fu, Qiang Liu, Haotian Wang, Wenjing Yang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang, Ziwei Liu β’
Published: 2025-09-29 β’
Source: arXiv
The integration of visual understanding and generation into unified multimodal models represents a significant stride toward general-purpose AI. However, a fundamental question remains unanswered by existing benchmarks: does this architectural unification actually enable synergetic interaction between the constituent capabilities? Existing evaluation paradigms, which primarily assess understanding and generation in isolation, are insufficient for determining whether a unified model can leverage its understanding to enhance its generation, or use generative simulation to facilitate deeper comprehension. To address this critical gap, we introduce RealUnify, a benchmark specifically designed to evaluate bidirectional capability synergy. RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks. It is structured around two core axes: 1) Understanding Enhances Generation, which requires reasoning (e.g., commonsense, logic) to guide image generation, and 2) Generation Enhances Understanding, which necessitates mental simulation or reconstruction (e.g., of transformed or disordered visual inputs) to solve reasoning tasks. A key contribution is our dual-evaluation protocol, which combines direct end-to-end assessment with a diagnostic stepwise evaluation that decomposes tasks into distinct understanding and generation phases. This protocol allows us to precisely discern whether performance bottlenecks stem from deficiencies in core abilities or from a failure to integrate them. Through large-scale evaluations of 12 leading unified models and 6 specialized baselines, we find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient. These results highlight the need for new training strategies and inductive biases to fully unlock the potential of unified modeling.
27. Uncertainty-Guided Expert-AI Collaboration for Efficient Soil Horizon Annotation
Authors: Teodor Chiaburu, Vipin Singh, Frank HauΓer, Felix BieΓmann β’
Published: 2025-09-29 β’
Source: arXiv
Uncertainty quantification is essential in human-machine collaboration, as human agents tend to adjust their decisions based on the confidence of the machine counterpart. Reliably calibrated model uncertainties, hence, enable more effective collaboration, targeted expert intervention and more responsible usage of Machine Learning (ML) systems. Conformal prediction has become a well established model-agnostic framework for uncertainty calibration of ML models, offering statistically valid confidence estimates for both regression and classification tasks. In this work, we apply conformal prediction to $\textit{SoilNet}$, a multimodal multitask model for describing soil profiles. We design a simulated human-in-the-loop (HIL) annotation pipeline, where a limited budget for obtaining ground truth annotations from domain experts is available when model uncertainty is high. Our experiments show that conformalizing SoilNet leads to more efficient annotation in regression tasks and comparable performance scores in classification tasks under the same annotation budget when tested against its non-conformal counterpart. All code and experiments can be found in our repository: https://github.com/calgo-lab/BGR
28. PhysicsMinions: Winning Gold Medals in the Latest Physics Olympiads with a Coevolutionary Multimodal Multi-Agent System
Authors: Fangchen Yu, Junchi Yao, Ziyi Wang, Haiyuan Wan, Youling Huang, Bo Zhang, Shuyue Hu, Dongzhan Zhou, Ning Ding, Ganqu Cui, Lei Bai, Wanli Ouyang, Peng Ye β’
Published: 2025-09-29 β’
Source: arXiv
Physics is central to understanding and shaping the real world, and the ability to solve physics problems is a key indicator of real-world physical intelligence. Physics Olympiads, renowned as the crown of competitive physics, provide a rigorous testbed requiring complex reasoning and deep multimodal understanding, yet they remain largely underexplored in AI research. Existing approaches are predominantly single-model based, and open-source MLLMs rarely reach gold-medal-level performance. To address this gap, we propose PhysicsMinions, a coevolutionary multi-agent system for Physics Olympiad. Its architecture features three synergistic studios: a Visual Studio to interpret diagrams, a Logic Studio to formulate solutions, and a Review Studio to perform dual-stage verification. The system coevolves through an iterative refinement loop where feedback from the Review Studio continuously guides the Logic Studio, enabling the system to self-correct and converge towards the ground truth. Evaluated on the HiPhO benchmark spanning 7 latest physics Olympiads, PhysicsMinions delivers three major breakthroughs: (i) Strong generalization: it consistently improves both open-source and closed-source models of different sizes, delivering clear benefits over their single-model baselines; (ii) Historic breakthroughs: it elevates open-source models from only 1-2 to 6 gold medals across 7 Olympiads, achieving the first-ever open-source gold medal in the latest International Physics Olympiad (IPhO) under the average-score metric; and (iii) Scaling to human expert: it further advances the open-source Pass@32 score to 26.8/30 points on the latest IPhO, ranking 4th of 406 contestants and far surpassing the top single-model score of 22.7 (ranked 22nd). Generally, PhysicsMinions offers a generalizable framework for Olympiad-level problem solving, with the potential to extend across disciplines.
29. Blockchain-Driven Federation for Distributed Edge Systems: Design and Experimental Validation
Authors: Adam Zahir, Milan Groshev, Carlos J. Bernardos, Antonio de la Oliva β’
Published: 2025-09-29 β’
Source: arXiv
Edge computing brings computation near end users, enabling the provisioning of novel use cases. To satisfy end-user requirements, the concept of edge federation has recently emerged as a key mechanism for dynamic resources and services sharing across edge systems managed by different administrative domains. However, existing federation solutions often rely on pre-established agreements and face significant limitations, including operational complexity, delays caused by manual operations, high overhead costs, and dependence on trusted third parties. In this context, blockchain can create dynamic federation agreements that enable service providers to securely interact and share services without prior trust. This article first describes the problem of edge federation, using the standardized ETSI multi-access edge computing framework as a reference architecture, and how it is being addressed. Then, it proposes a novel solution using blockchain and smart contracts to enable distributed MEC systems to dynamically negotiate and execute federation in a secure, automated, and scalable manner. We validate our framework's feasibility through a performance evaluation using a private Ethereum blockchain, built on the open-source Hyperledger Besu platform. The testbed includes a large number of MEC systems and compares two blockchain consensus algorithms. Experimental results demonstrate that our solution automates the entire federation lifecycle-from negotiation to deployment-with a quantifiable overhead, achieving federation in approximately 18 seconds in a baseline scenario. The framework scales efficiently in concurrent request scenarios, where multiple MEC systems initiate federation requests simultaneously. This approach provides a promising direction for addressing the complexities of dynamic, multi-domain federations across the edge-to-cloud continuum.
30. Intelligent Optimization of Wireless Access Point Deployment for Communication-Based Train Control Systems Using Deep Reinforcement Learning
Authors: Kunyu Wu, Qiushi Zhao, Zihan Feng, Yunxi Mu, Hao Qin, Xinyu Zhang, Xingqi Zhang β’
Published: 2025-09-29 β’
Source: arXiv
Urban railway systems increasingly rely on communication based train control (CBTC) systems, where optimal deployment of access points (APs) in tunnels is critical for robust wireless coverage. Traditional methods, such as empirical model-based optimization algorithms, are hindered by excessive measurement requirements and suboptimal solutions, while machine learning (ML) approaches often struggle with complex tunnel environments. This paper proposes a deep reinforcement learning (DRL) driven framework that integrates parabolic wave equation (PWE) channel modeling, conditional generative adversarial network (cGAN) based data augmentation, and a dueling deep Q network (Dueling DQN) for AP placement optimization. The PWE method generates high-fidelity path loss distributions for a subset of AP positions, which are then expanded by the cGAN to create high resolution path loss maps for all candidate positions, significantly reducing simulation costs while maintaining physical accuracy. In the DRL framework, the state space captures AP positions and coverage, the action space defines AP adjustments, and the reward function encourages signal improvement while penalizing deployment costs. The dueling DQN enhances convergence speed and exploration exploitation balance, increasing the likelihood of reaching optimal configurations. Comparative experiments show that the proposed method outperforms a conventional Hooke Jeeves optimizer and traditional DQN, delivering AP configurations with higher average received power, better worst-case coverage, and improved computational efficiency. This work integrates high-fidelity electromagnetic simulation, generative modeling, and AI-driven optimization, offering a scalable and data-efficient solution for next-generation CBTC systems in complex tunnel environments.
31. DyMoDreamer: World Modeling with Dynamic Modulation
Authors: Boxuan Zhang, Runqing Wang, Wei Xiao, Weipu Zhang, Jian Sun, Gao Huang, Jie Chen, Gang Wang β’
Published: 2025-09-29 β’
Source: arXiv
A critical bottleneck in deep reinforcement learning (DRL) is sample inefficiency, as training high-performance agents often demands extensive environmental interactions. Model-based reinforcement learning (MBRL) mitigates this by building world models that simulate environmental dynamics and generate synthetic experience, improving sample efficiency. However, conventional world models process observations holistically, failing to decouple dynamic objects and temporal features from static backgrounds. This approach is computationally inefficient, especially for visual tasks where dynamic objects significantly influence rewards and decision-making performance. To address this, we introduce DyMoDreamer, a novel MBRL algorithm that incorporates a dynamic modulation mechanism to improve the extraction of dynamic features and enrich the temporal information. DyMoDreamer employs differential observations derived from a novel inter-frame differencing mask, explicitly encoding object-level motion cues and temporal dynamics. Dynamic modulation is modeled as stochastic categorical distributions and integrated into a recurrent state-space model (RSSM), enhancing the model's focus on reward-relevant dynamics. Experiments demonstrate that DyMoDreamer sets a new state-of-the-art on the Atari $100$k benchmark with a $156.6$\% mean human-normalized score, establishes a new record of $832$ on the DeepMind Visual Control Suite, and gains a $9.5$\% performance improvement after $1$M steps on the Crafter benchmark. Our code is released at https://github.com/Ultraman-Tiga1/DyMoDreamer.
32. FESTIM v2.0: Upgraded framework for multi-species hydrogen transport and enhanced performance
Authors: James Dark, RΓ©mi Delaporte-Mathurin, JΓΈrgen S. Dokken Huihua Yang, Chirag Khurana, Kaelyn Dunnell, Gabriele Ferrero, Vladimir Kulagin, Samuele Meschini β’
Published: 2025-09-29 β’
Source: arXiv
FESTIM is an open-source finite element framework for modelling the transport of hydrogen isotopes in materials. It provides a flexible and extensible tool for simulating diffusion, trapping, surface interactions, and other processes that govern hydrogen behaviour. This paper presents FESTIM v2.0, a major release that broadens both the physical scope and the software infrastructure of the framework. On the physics side, the formulation adopts a modular structure that supports multi-species transport, advanced trapping and reaction schemes, isotope exchange, decay, and advection. Interface and boundary conditions have been generalised, and interoperability with external solvers enables multiphysics workflows, including coupling with fluid dynamics and neutron transport codes. On the software side, FESTIM v2.0 has been migrated to DOLFINx, the next-generation FEniCS platform, providing improved performance, interoperability, and long-term sustainability. Taken together, these advances position FESTIM v2.0 as a versatile platform for investigating hydrogen transport in materials across scientific and engineering applications.
33. SymBoltz.jl: a symbolic-numeric, approximation-free and differentiable linear Einstein-Boltzmann solver
Authors: Herman Sletmoen β’
Published: 2025-09-29 β’
Source: arXiv
SymBoltz is a new Julia package that solves the linear Einstein-Boltzmann equations. It features a symbolic-numeric interface for specifying equations, is free of approximation switching schemes and is compatible with automatic differentiation. Cosmological models are built from replaceable physical components in a way that scales well in model space. The modeler should simply write down their equations, and SymBoltz solves them and eliminates much of the friction in the process. SymBoltz enables up to 100x shorter model definitions compared to browsing equivalent files in CLASS. Symbolic knowledge enables powerful automation of tasks, such as separating computational stages like the background and perturbations, generating the Jacobian matrix and its sparsity pattern, and interpolating arbitrary expressions from the solution. Modern implicit solvers integrate the full stiff equations at all times, reducing slowdowns by taking long time steps, reusing the Jacobian and LU-factorizing it over several time steps, and using fast linear system solvers. Automatic differentiation gives exact derivatives of any output with respect to any input, which is important for gradient-based Markov chain Monte Carlo methods in large parameter spaces, training of emulators, Fisher forecasting and sensitivity analysis. These features are useful in their own rights, but also reinforce each other in a synergy. Results agree with established codes like CLASS and CAMB. With more work, SymBoltz can grow into an integrated symbolic-numeric cosmological modeling environment with a large library of models that delivers differentiable output as fast as other codes. SymBoltz is available at https://github.com/hersle/SymBoltz.jl with single-command installation and extensive documentation, and welcomes questions, suggestions and contributions.
34. Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation
Authors: Huu Tien Nguyen, Dac Thai Nguyen, The Minh Duc Nguyen, Trung Thanh Nguyen, Thao Nguyen Truong, Huy Hieu Pham, Johan Barthelemy, Minh Quan Tran, Thanh Tam Nguyen, Quoc Viet Hung Nguyen, Quynh Anh Chau, Hong Son Mai, Thanh Trung Nguyen, Phi Le Nguyen β’
Published: 2025-09-29 β’
Source: arXiv
Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset comprising 1,567,062 paired CT-PET images and corresponding 2,757 full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs' learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks, including medical report generation and visual question answering. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, particularly in low-resource languages, and improving their clinical relevance in Vietnamese healthcare.
35. Diamonds in the rough: Transforming SPARCs of imagination into a game concept by leveraging medium sized LLMs
Authors: Julian Geheeb, Farhan Abid Ivan, Daniel Dyrda, Miriam AnschΓΌtz, Georg Groh β’
Published: 2025-09-29 β’
Source: arXiv
Recent research has demonstrated that large language models (LLMs) can support experts across various domains, including game design. In this study, we examine the utility of medium-sized LLMs, models that operate on consumer-grade hardware typically available in small studios or home environments. We began by identifying ten key aspects that contribute to a strong game concept and used ChatGPT to generate thirty sample game ideas. Three medium-sized LLMs, LLaMA 3.1, Qwen 2.5, and DeepSeek-R1, were then prompted to evaluate these ideas according to the previously identified aspects. A qualitative assessment by two researchers compared the models' outputs, revealing that DeepSeek-R1 produced the most consistently useful feedback, despite some variability in quality. To explore real-world applicability, we ran a pilot study with ten students enrolled in a storytelling course for game development. At the early stages of their own projects, students used our prompt and DeepSeek-R1 to refine their game concepts. The results indicate a positive reception: most participants rated the output as high quality and expressed interest in using such tools in their workflows. These findings suggest that current medium-sized LLMs can provide valuable feedback in early game design, though further refinement of prompting methods could improve consistency and overall effectiveness.