In a landmark demonstration of AI’s growing sophistication in high-level mathematics, Google DeepMind’s Aletheia agent successfully tackled the "FirstProof" challenge, a set of ten professional-grade research problems designed to test the limits of machine autonomy. Powered by the Gemini 3 Deep Think model, Aletheia operated without any human intervention to produce rigorous, LaTeX-formatted proofs for six of the ten problems, passing the scrutiny of panels of expert mathematicians. The study reveals a significant leap in AI reliability, as the agent utilized "self-filtering" to avoid submitting guesses for problems it couldn't solve, focusing instead on producing "publishable-quality" solutions for highly complex geometry and algebra. By documenting exactly how these solutions were generated and verified, the researchers provide a transparent roadmap for how autonomous AI "researchers" might soon become indispensable partners in expanding the frontiers of mathematical discovery.
Failed to generate LLM review.
基于研究论文 "Aletheia tackles FirstProof autonomously",以下是潜在的研究方向、未来工作领域以及重点问题的总结,侧重于具有可操作性和创新性的观点。
这些研究项目直接建立在论文提出的方法和结果之上。
[FIXABLE](可修复)的判定并进行自主修订。一个直接的延伸是开发一种多步骤的迭代自纠正循环。智能体自身的批判(或独立的批判模块)可以反馈给生成器,使其通过多个周期改进“草率”或“不足”的证明(如 P7 和 P8 的初步尝试),直到达到 [CORRECT](正确)判定。这将模拟人类修改论文的过程。这些是由论文发现和局限性引发的更具前瞻性和范式转换意义的想法。
论文的方法论和结果隐含地揭示了几个具体的、尚未解决的挑战。
Aletheia 智能体架构可以被调整并应用于其他领域。
虽然测试时训练(Test-Time Training, TTT)传统上被视为模型即时“记忆”新信息的一种方式,但这项研究揭示了一个令人惊讶的“记忆悖论”:更强的内部学习能力实际上会导致整体性能下降。通过打破“这些模型类似于数字存储与检索系统”的普遍假设,作者证明了 TTT 在数学上等同于一种高级形式的线性注意力机制(linear attention)。
这一发现使研究人员能够剥离不必要的架构冗余,将复杂的模型简化为更高效、可并行化的版本,在不损失性能的情况下实现高达 4.0 倍的提速。最终,该论文将 TTT 重新定义为一种高速特征混合器(feature mixer),而非某种形式的临时记忆。这一转变也为开发更快、更精简且更具扩展性的 AI 架构铺平了道路。
生成 LLM 评审失败。
规则:
- 翻译需符合中文语言习惯,避免生硬直译
- 论文标题保留英文(如有必要,可附带中文说明)
- 模型名称(GPT、Claude、Gemini 等)保留英文
- 原始链接和 URL 保持不变
- 保留所有 Markdown 格式(标题、加粗、列表等)
- 仅输出翻译后的文本,不含任何解释说明
对此研究论文的分析非常卓越。基于其核心发现,以下是潜在的研究方向和未来工作领域,重点关注具有可操作性和创新性的想法。
这些方向直接建立在论文的定理、消融实验和所述局限性的基础之上。
1.1. 研究非线性最终层: 论文的理论分析仅限于具有线性、无偏置最终层的 TTT 模型。一个关键的延伸是分析最终层为非线性的 TTT 变体(例如,包含偏置项、ReLU 或 sigmoid 激活函数)。
1.2. 分析端到端 TTT (TTT-E2E): 论文专门关注带有键值绑定损失的 TTT (TTT-KVB)。一个主要的开放性问题是,这种“隐形线性注意力”的解释是否适用于 TTT-E2E,即最终任务损失的梯度通过内循环进行反向传播的情况。
g_t(k) 现在将取决于最终的模型输出和任务损失,使其成为整个序列历史的函数,而不仅仅是局部键值对。这可能会产生一种更复杂的、依赖历史的注意力形式,或许能解释其在长上下文任务中的有效性。1.3. 逆转等价性:通过 TTT 设计新型线性注意力: 论文展示了 TTT → 线性注意力。其反向路径也是一个引人注目的设计范式。
1.4. “动态核”的作用: 论文中表现最好的变体(Variant 1)仅更新最后一层,而将特征提取器 phi(·) 冻结为“静态核”。这与“动态且依赖历史的核应该更强大”的直觉相矛盾。
phi_t(·) 的表达能力的同时,减轻降低性能的训练-测试不匹配(train-test mismatch)问题?Θ_t 的剧烈变化。例如,在主训练目标中添加损失项 ||Θ_t - Θ_{t-1}||²。这将鼓励动态核平滑演变,在保留其自适应优势的同时,避免在测试时引起灾难性的分布偏移。这些想法将论文的核心洞察——即优化过程可以是一个计算算子——推广到新的领域。
2.1. “优化器即算子”范式: 论文分析了带动量的 SGD。这可以推广到探索不同的内循环优化器如何被编译成不同的计算算子。
m_t) 和方差项 (v_t) 可能会转化为生成的类线性注意力机制中可学习的、针对每个特征的衰减和归一化因子。这可能会产生一类新的自适应注意力模型,这些模型是通过优化理论的视角“发现”的,而非人工设计。2.2. 统一标准(Softmax)注意力: 论文统一了 TTT 和线性注意力。终极目标是在单一的“优化即计算”框架下统一线性注意力和标准 Softmax 注意力。
softmax(QK^T/sqrt(d_k))V)?exp(Q K^T)。成功的结果将把“注意力”重新定义为不同内循环优化问题的一系列解。2.3. 超越梯度:序列建模的“计算脚手架”: “梯度上升异常(Gradient Ascent Anomaly)”表明,更新的机制而非目标的最小化才是关键。这为非基于梯度的更新规则打开了大门。
S_t 通过简单的、可学习的非梯度更新规则进行更新,例如赫布更新(Hebbian update, S_t = S_{t-1} + f(k_t) g(v_t)^T)或显式门控更新(S_t = gate * S_{t-1} + (1-gate) * update)。这摆脱了“测试时训练”的比喻,转向更直接的“快速权重编程”或记忆编辑视角,并具有更高的效率潜力。这些是论文中值得深入研究的具体实证谜团和矛盾点。
3.1. Q/K 分布不对称性的目的: 论文显示,对于 TTT,查询(Query)和键(Key)来自不同的分布,这对于检索来说是“病态”的,但对于其线性注意力形式来说却是正常的。尚未探索的问题是:模型为什么学习这种不对称性,以及它是否可以被控制。
phi(q) 与 phi(k) 中编码的信息。例如,phi(k) 是否学习编码位置或结构信息以构建状态 S_t,而 phi(q) 学习编码语义内容以从中读取?可以尝试在训练期间通过对比损失(contrastive losses)来强制执行或防止这种不对称性,并衡量其对性能的影响。3.2. 探索“梯度上升异常”的边界: 梯度上升的效果与下降相当甚至更好的发现令人震惊。理解这是普遍属性还是特定任务和模型的产物至关重要。
key -> value 映射必须非常精确的任务,梯度上升的结构化“噪声”可能是有害的,从而揭示这一异常现象的局限性。这些方向探索了将 TTT 重新理解为高效、自适应线性注意力机制后,在哪些领域最具影响力。
4.1. 终身学习与流式数据处理: TTT 机制的在线自适应特性使其成为数据分布不断变化场景的首选。
S_t 充当了数据流历史的压缩自适应摘要。4.2. 即时个性化: 在不改变核心权重的情况下在上下文中更新模型状态的能力,是实现高效个性化的理想选择。
S 变成“会话缓存”或“用户画像”,无需昂贵的微调即可定制响应。这可应用于推荐系统、个性化聊天机器人或辅助代码生成。4.3. 具有自适应记忆的强化学习智能体: RL 智能体的状态表示需要快速适应情节(episode)内的变化。
(state, action) 对可以被视为 TTT 层的 (key, value) 输入。展开的优化过程将允许智能体建立情节的自适应“短期记忆”,从而可能提高在非平稳环境或需要长期信用分配任务中的表现。Training robots to perform tasks using only camera images is notoriously slow and expensive, often requiring millions of simulations that can take days to process. To bridge this gap, researchers introduced Squint, a high-speed learning method that can train a robot to master complex manipulation tasks—like stacking blocks or placing cans—in as little as 15 minutes on a single standard gaming GPU. By "squinting" (rendering high-resolution images and then downsampling them) and optimizing how the AI reuses its past experiences, the system achieves a 91% success rate when transferred directly from the simulator to a real-world robotic arm. This breakthrough suggests a future where sophisticated robotic behaviors can be developed with minimal hardware in less time than it takes to grab a cup of coffee.
Failed to generate LLM review.
Failed to generate research directions.
As digital information shifts from simple text to a mix of images, videos, and audio, modern search engines are struggling to store the massive amounts of data required to retrieve these "multimodal" documents efficiently. To solve this, researchers developed Attention-Guided Clustering (AGC), a smart compression technique that identifies the most important parts of a document and condenses them into a tiny, high-impact storage footprint. By prioritizing the most descriptive elements of a video or image rather than saving every redundant frame, this method can shrink an index to just a fraction of its original size while actually maintaining—or even improving—search accuracy. This breakthrough makes high-performance, "any-modality" search practical for massive real-world collections like YouTube or web-scale digital archives without requiring astronomical storage costs.
Failed to generate LLM review.
Failed to generate research directions.
Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking
Ravi Ghadia 1 Maksim Abraham 1 Sergei Vorobyov 1 Max Ryabinin 1
Abstract
Efficiently processing long sequences with Trans-
former models usually requires splitting the com-
putations across accelerators via context paral-
lelism. The dominant approaches in this fam-
ily of methods, such as Ring Attention or Deep-
Speed Ulysses, enable scaling over the context
dimension but do not focus on memory efficiency,
which limits t
Failed to generate LLM review.
Failed to generate research directions.
Learning from Trials and Errors:
Reflective Test-Time Planning for Embodied LLMs
Yining Hong 1 Huang Huang 1 Manling Li 2 Li Fei-Fei 1 Jiajun Wu 1 Yejin Choi 1
Website: https://reflective-test-time-planning.github.io
§ Code: https://github.com/Reflective-Test-Time-Planning/Reflective-Test-Time-Planning
(a) Task
Put the toy car in the green box
Bad choice: the teddy bear is
already in the green box.
Score: 22
The orange box is
too small. The toy
car doesn’t fit into
the orange box.
Score:
Failed to generate LLM review.
Failed to generate research directions.
Statistical Query Lower Bounds for Smoothed Agnostic Learning
Ilias Diakonikolas∗
University of Wisconsin-Madison
ilias@cs.wisc.edu
Daniel M. Kane†
University of California, San Diego
dakane@cs.ucsd.edu
February 25, 2026
Abstract
We study the complexity of smoothed agnostic learning, recently introduced by [CKK+24],
in which the learner competes with the best classifier in a target class under slight Gaussian
perturbations of the inputs. Specifically, we focus on the prototypical task of agnostica
Failed to generate LLM review.
Failed to generate research directions.
2026-2-25
On Data Engineering for Scaling LLM Terminal
Capabilities
Renjie Pi∗, Grace Lam*, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, Wei Ping†
Abstract
Despite rapid recent progress in the terminal capabilities of large language models, the training data
strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap
through a systematic study of data engineering practices for terminal agents, making two key con-
tributions: (1) Terminal-Task-Gen, a li
Failed to generate LLM review.
Failed to generate research directions.
XMorph: Explainable Brain Tumor Analysis
Via LLM-Assisted Hybrid Deep Intelligence
Sepehr Salem Ghahfarokhi1, M. Moein Esfahani2, Raj Sunderraman1, Vince Calhoun2, Mohammed Alser1
1Department of Computer Science, Georgia State University, Atlanta, GA, USA
2TReNDS Center, Georgia State University, Atlanta, GA, USA
Corresponding authors: ssalemghahfarokhi1@gsu.edu, malser@gsu.edu
Abstract—Deep learning has significantly advanced automated
brain tumor diagnosis, yet clinical adoption remains limite
Failed to generate LLM review.
Failed to generate research directions.
Published as a conference paper at ICLR 2026
THE DIFFUSION DUALITY, CHAPTER II:
Ψ-SAMPLERS AND EFFICIENT CURRICULUM
Justin Deschenaux1∗
Caglar Gulcehre1,2
Subham Sekhar Sahoo3∗
1EPFL, Lausanne, Switzerland
2Microsoft AI
3Cornell Tech, NY
ABSTRACT
Uniform-state discrete diffusion models excel at few-step generation and guidance
due to their ability to self-correct, making them preferred over autoregressive or
Masked diffusion models in these settings. However, their sampling quality plateaus
with
Failed to generate LLM review.
Failed to generate research directions.
2026-02-25
Why Pass@k Optimization Can Degrade Pass@1:
Prompt Interference in LLM Post-training
Anas Barakat1, Souradip Chakraborty2, Khushbu Pahwa*, Amrit Singh Bedi3
1Singapore University of Technology and Design
2University of Maryland, College Park
3University of Central Florida
Pass@k is a widely used performance metric for verifiable large language model tasks, including
mathematical reasoning, code generation, and short-answer reasoning. It defines success if any of 𝑘
independently sample
Failed to generate LLM review.
Failed to generate research directions.
Efficient Hierarchical Any-Angle Path Planning
on Multi-Resolution 3D Grids
Victor Reijgwart, Cesar Cadena, Roland Siegwart and Lionel Ott
Autonomous Systems Lab, ETH Z¨urich, Switzerland
Email: vreijgwart@rai-inst.com, [cesarc | rolandsi | lioott]@ethz.ch
Abstract—Hierarchical, multi-resolution volumetric mapping
approaches are widely used to represent large and complex
environments as they can efficiently capture their occupancy
and connectivity information. Yet widely used path planning
metho
Failed to generate LLM review.
Failed to generate research directions.
NORD: A Data-Efficient Vision-Language-Action Model that Drives without
Reasoning
Ishaan Rawal1,2*
Shubh Gupta1
Yihan Hu1
Wei Zhan1,3†
1Applied Intuition
2Texas A&M University
3UC Berkeley
Abstract
Vision-Language-Action (VLA) models are advancing au-
tonomous driving by replacing modular pipelines with uni-
fied end-to-end architectures. However, current VLAs face
two expensive requirements: (1) massive dataset collec-
tion, and (2) dense reasoning annotations. In this work,
we address both cha
Failed to generate LLM review.
Failed to generate research directions.
SELAUR: Self Evolving LLM Agent via
Uncertainty-aware Rewards
Dengjia Zhang1, Xiaoou Liu2, Lu Cheng3, Yaqing Wang4, Kenton Murray1,
and Hua Wei2
1 Johns Hopkins University, Baltimore MD, USA {dzhang98,kenton}@jhu.edu
2 Arizona State University, Tempe AZ, USA {xiaoouli,hua.wei}@asu.edu
3 University of Illinois Chicago, Chicago IL, USA lucheng@uic.edu
4 Purdue University, West Lafayette IN, USA wang5075@purdue.edu
Abstract. Large language models (LLMs) are increasingly deployed as
multi-step decis
Failed to generate LLM review.
Failed to generate research directions.
CG-DMER: HYBRID CONTRASTIVE-GENERATIVE FRAMEWORK FOR DISENTANGLED
MULTIMODAL ECG REPRESENTATION LEARNING
Ziwei Niu1,3
Hao Sun4
Shujun Bian1
Xihong Yang2
Lanfen Lin3
Yuxin Liu1
Yueming Jin1,2 Q
1 Department of Biomedical Engineering, National University of Singapore, Singapore, Singapore
2 Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Singapore
3 College of Computer Science and Technology, Zhejiang University, Hangzhou, China
4 College of informat
Failed to generate LLM review.
Failed to generate research directions.
Not Just How Much, But Where: Decomposing Epistemic Uncertainty into
Per-Class Contributions
Mame Diarra Toure1
David A. Stephens1
1Department of Mathematics and Statistics , McGill University
Abstract
In safety-critical classification, the cost of failure is
often asymmetric. Yet Bayesian deep learning sum-
marises epistemic uncertainty with a single scalar,
mutual information (MI), which cannot distinguish
whether a model’s ignorance involves a benign or
safety-critical class. We decompose MI
Failed to generate LLM review.
Failed to generate research directions.
Scaling State-Space Models on Multiple GPUs with
Tensor Parallelism
Anurag Dutt
Stony Brook University
adutt@cs.stonybrook.edu
Nimit Shah
Stony Brook University
nimishah@cs.stonybrook.edu
Hazem Masarani
Stony Brook University
hazem.masarani@stonybrook.edu
Anshul Gandhi
Stony Brook University
anshul@cs.stonybrook.edu
Abstract—Selective state space models (SSMs) have rapidly
become a compelling backbone for large language models,
especially for long-context workloads. Yet in deployment, their
infe
Failed to generate LLM review.
Failed to generate research directions.
当医生利用 AI 预测患者的健康风险时,经常会提出“如果……会怎样”的假设性问题——例如“如果这位患者没有糖尿病会怎样?”——以此来探寻改善预后的方法。然而,这篇论文揭示了一个“时间旅行者困境”(Time Traveler Dilemma):传统的 AI 方法往往会提出在生物学上不可能实现的场景,比如“移除”一个患者已经患病多年的慢性病。
为了解决这一问题,研究人员开发了“序列反事实框架”(Sequential Counterfactual Framework)。这种新方法遵循时间流向和医学现实,通过区分“可改变因素”(如化验结果)与“不可改变因素”(如慢性病诊断)来建立模型。通过对数千名 COVID-19 患者的数据进行测试,该团队展示了我们如何跨越不可能的“假设”,生成切实可行且符合现实的医学见解,从而精准揭示早期干预如何能在危险的健康连锁反应开始前将其阻断。
Failed to generate LLM review.
Failed to generate research directions.
, 2022, pp. 1–18
doi: DOI HERE
Advance Access Publication Date: Day Month Year
Paper
PAPER
PVminer: A Domain-Specific Tool to Detect the
Patient Voice in Patient Generated Data
Samah Fodeh ,1,2∗Linhai Ma ,1 Yan Wang,1 Srivani Talakokkul,1
Ganesh Puthiaraju,1 Afshan Khan,1 Ashley Hagaman,3 Sarah Lowe3
and Aimee Roundtree4
1Department Of Emergency Medicine, Yale School of Medicine, 464 Congress Ave, 06519, CT, USA, 2Department of Biomedical Informatics
& Data Science, Yale School of Medicine, 100
Failed to generate LLM review.
Failed to generate research directions.
Published as a conference paper at ICLR 2026
A BENCHMARK FOR DEEP INFORMATION SYNTHESIS
Debjit Paul1, Daniel Murphy2, Milan Gritta1, Ronald Cardenas1,
Victor Prokhorov1, Jun Wang3, Gerasimos Lampouras1
Dataset Contributors:
Lena Sophia Bolliger4, Aysim Toker1, Roy Miles1, Andreea-Maria Oncescu1, Jasivan
Alex Sivakumar5, Philipp Borchert1, Ismail Elezi1, Meiru Zhang6, Ka Yiu Lee1, Guchun Zhang1
1Huawei Noah’s Ark Lab, UK
2Imperial College London
3UCL Centre for Artificial Intelligence
4University
Failed to generate review summary.
Failed to generate LLM review.
Failed to generate research directions.