In a landmark demonstration of AI’s growing sophistication in high-level mathematics, Google DeepMind’s Aletheia agent successfully tackled the "FirstProof" challenge, a set of ten professional-grade research problems designed to test the limits of machine autonomy. Powered by the Gemini 3 Deep Think model, Aletheia operated without any human intervention to produce rigorous, LaTeX-formatted proofs for six of the ten problems, passing the scrutiny of panels of expert mathematicians. The study reveals a significant leap in AI reliability, as the agent utilized "self-filtering" to avoid submitting guesses for problems it couldn't solve, focusing instead on producing "publishable-quality" solutions for highly complex geometry and algebra. By documenting exactly how these solutions were generated and verified, the researchers provide a transparent roadmap for how autonomous AI "researchers" might soon become indispensable partners in expanding the frontiers of mathematical discovery.
Failed to generate LLM review.
Based on the research paper "Aletheia tackles FirstProof autonomously," here are potential research directions, areas for future work, and highlighted problems, focusing on actionable and innovative ideas.
These are research projects that build directly on the methods and results presented in the paper.
[FIXABLE] verdict and an autonomous revision. A direct extension would be to develop a multi-step, iterative self-correction loop. The agent's own critique (or a separate critique module) could be fed back to the generator, allowing it to refine "sketchy" or "inadequate" proofs (like the initial attempts for P7 and P8) through several cycles until a [CORRECT] verdict is reached. This would mimic the human process of redrafting a paper.These are more speculative, paradigm-shifting ideas sparked by the paper's findings and limitations.
The paper's methodology and results implicitly reveal several concrete, unsolved challenges.
The Aletheia agent architecture could be adapted and applied to other fields.
While Test-Time Training (TTT) is traditionally celebrated as a way for models to "memorize" new information on the fly, this research reveals a surprising "memorization paradox" where better internal learning actually leads to worse overall performance. By dismantling the common assumption that these models act like a digital storage-and-retrieval system, the authors prove that TTT is mathematically equivalent to a sophisticated version of linear attention. This discovery allows the researchers to strip away unnecessary architectural bloat, simplifying complex models into more efficient, parallelized versions that achieve up to a 4.0× speedup without losing power. Ultimately, the paper reframes TTT not as a form of temporary memory, but as a high-speed feature mixer that paves the way for faster, leaner, and more scalable AI architectures.
Failed to generate LLM review.
Excellent analysis of the research paper. Based on its core findings, here are potential research directions and areas for future work, focusing on actionable and innovative ideas.
These directions build directly on the paper's theorems, ablations, and stated limitations.
1.1. Investigating Non-Linear Final Layers: The paper's theoretical analysis is restricted to TTT models with a linear, bias-free final layer. A critical extension is to analyze TTT variants where the final layer is non-linear (e.g., includes a bias term, a ReLU, or a sigmoid activation).
1.2. Analyzing End-to-End TTT (TTT-E2E): The paper exclusively focuses on TTT with a key-value binding loss (TTT-KVB). A major open question is whether the same "secretly linear attention" interpretation holds for TTT-E2E, where gradients from the final task loss are backpropagated through the inner loop.
g_t(k) will now depend on the final model output and task loss, making it a function of the entire sequence history, not just the local key-value pair. This could lead to a more complex, history-dependent form of attention that might explain its effectiveness in long-context tasks.1.3. Reversing the Equivalence: Designing Novel Linear Attention via TTT: The paper shows TTT → Linear Attention. The reverse direction is a compelling design paradigm.
1.4. The Role of the "Dynamic Kernel": The paper's best-performing variant (Variant 1) only updates the last layer, freezing the feature extractor phi(·) into a "static kernel". This contradicts the intuition that a dynamic, history-dependent kernel should be more powerful.
phi_t(·) while mitigating the train-test mismatch that degrades performance?Θ_t between successive steps. For example, add a loss term ||Θ_t - Θ_{t-1}||² to the main training objective. This would encourage the dynamic kernel to evolve smoothly, potentially retaining its adaptive benefits without causing catastrophic distribution shift at test time.These ideas generalize the paper's core insight—that an optimization process can be a computational operator—to new territories.
2.1. The "Optimizer as Operator" Paradigm: The paper analyzes SGD with momentum. This can be generalized to explore how different inner-loop optimizers compile down to different computational operators.
m_t) and variance (v_t) terms of Adam will likely translate into learnable, per-feature decay and normalization factors within the resulting linear attention-like mechanism. This could lead to a new class of adaptive attention models that are "discovered" through the lens of optimization theory, rather than designed by hand.2.2. Unifying Standard (Softmax) Attention: The paper unifies TTT and linear attention. The ultimate goal would be to unify both linear and standard softmax attention under a single "optimization-as-computation" framework.
softmax(QK^T/sqrt(d_k))V)?exp(Q K^T) term. A successful result would reframe "attention" as a family of solutions to different inner-loop optimization problems.2.3. Beyond Gradients: "Computational Scaffolding" for Sequence Modeling: The "Gradient Ascent Anomaly" suggests that the mechanics of the update, not the objective's minimization, are what matter. This opens the door to non-gradient-based update rules.
S_t is updated via a simple, learnable, non-gradient update rule, such as a Hebbian update (S_t = S_{t-1} + f(k_t) g(v_t)^T) or a gated update (S_t = gate * S_{t-1} + (1-gate) * update). This moves away from the "training at test-time" analogy and toward a more direct "fast weight programming" or memory-editing perspective, with potential for even greater efficiency.These are specific empirical puzzles and contradictions from the paper that warrant deeper investigation.
3.1. The Purpose of Q/K Distributional Asymmetry: The paper shows that for TTT, queries and keys come from different distributions, which is "pathological" for retrieval but normal for its linear attention form. The unexplored question is why the model learns this asymmetry and whether it can be controlled.
phi(q) vs. phi(k). For instance, does phi(k) learn to encode positional or structural information for building the state S_t, while phi(q) learns to encode semantic content for reading from it? One could try to enforce or prevent this asymmetry with contrastive losses during training and measure the impact on performance.3.2. Exploring the Boundaries of the "Gradient Ascent Anomaly": The finding that gradient ascent works as well as, or better than, descent is striking. It's crucial to understand if this is a universal property or an artifact of the specific tasks and models tested.
key -> value mapping must be very precise, the structured "noise" of gradient ascent is detrimental, revealing the limits of this anomaly.These directions explore where the re-framed understanding of TTT as an efficient, adaptive linear attention mechanism could be most impactful.
4.1. Lifelong Learning and Streaming Data Processing: The online, adaptive nature of the TTT mechanism makes it a prime candidate for scenarios with continuously shifting data distributions.
S_t acts as a compressed, adaptive summary of the stream's history.4.2. On-the-Fly Personalization: The ability to update a model's state in-context without changing its core weights is ideal for efficient personalization.
S becomes a "session cache" or "user profile," tailoring responses without expensive fine-tuning. This could be applied to recommender systems, personalized chatbots, or assistive code generation.4.3. Reinforcement Learning Agents with Adaptive Memory: An RL agent's state representation needs to adapt quickly to changes within an episode.
(state, action) pairs from the trajectory can be treated as the (key, value) inputs to the TTT layer. The unrolled optimization would allow the agent to build an adaptive "short-term memory" of the episode, potentially improving performance in non-stationary environments or tasks requiring long-term credit assignment.Training robots to perform tasks using only camera images is notoriously slow and expensive, often requiring millions of simulations that can take days to process. To bridge this gap, researchers introduced Squint, a high-speed learning method that can train a robot to master complex manipulation tasks—like stacking blocks or placing cans—in as little as 15 minutes on a single standard gaming GPU. By "squinting" (rendering high-resolution images and then downsampling them) and optimizing how the AI reuses its past experiences, the system achieves a 91% success rate when transferred directly from the simulator to a real-world robotic arm. This breakthrough suggests a future where sophisticated robotic behaviors can be developed with minimal hardware in less time than it takes to grab a cup of coffee.
Failed to generate LLM review.
Failed to generate research directions.
As digital information shifts from simple text to a mix of images, videos, and audio, modern search engines are struggling to store the massive amounts of data required to retrieve these "multimodal" documents efficiently. To solve this, researchers developed Attention-Guided Clustering (AGC), a smart compression technique that identifies the most important parts of a document and condenses them into a tiny, high-impact storage footprint. By prioritizing the most descriptive elements of a video or image rather than saving every redundant frame, this method can shrink an index to just a fraction of its original size while actually maintaining—or even improving—search accuracy. This breakthrough makes high-performance, "any-modality" search practical for massive real-world collections like YouTube or web-scale digital archives without requiring astronomical storage costs.
Failed to generate LLM review.
Failed to generate research directions.
Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking
Ravi Ghadia 1 Maksim Abraham 1 Sergei Vorobyov 1 Max Ryabinin 1
Abstract
Efficiently processing long sequences with Trans-
former models usually requires splitting the com-
putations across accelerators via context paral-
lelism. The dominant approaches in this fam-
ily of methods, such as Ring Attention or Deep-
Speed Ulysses, enable scaling over the context
dimension but do not focus on memory efficiency,
which limits t
Failed to generate LLM review.
Failed to generate research directions.
Learning from Trials and Errors:
Reflective Test-Time Planning for Embodied LLMs
Yining Hong 1 Huang Huang 1 Manling Li 2 Li Fei-Fei 1 Jiajun Wu 1 Yejin Choi 1
Website: https://reflective-test-time-planning.github.io
§ Code: https://github.com/Reflective-Test-Time-Planning/Reflective-Test-Time-Planning
(a) Task
Put the toy car in the green box
Bad choice: the teddy bear is
already in the green box.
Score: 22
The orange box is
too small. The toy
car doesn’t fit into
the orange box.
Score:
Failed to generate LLM review.
Failed to generate research directions.
Statistical Query Lower Bounds for Smoothed Agnostic Learning
Ilias Diakonikolas∗
University of Wisconsin-Madison
ilias@cs.wisc.edu
Daniel M. Kane†
University of California, San Diego
dakane@cs.ucsd.edu
February 25, 2026
Abstract
We study the complexity of smoothed agnostic learning, recently introduced by [CKK+24],
in which the learner competes with the best classifier in a target class under slight Gaussian
perturbations of the inputs. Specifically, we focus on the prototypical task of agnostica
Failed to generate LLM review.
Failed to generate research directions.
2026-2-25
On Data Engineering for Scaling LLM Terminal
Capabilities
Renjie Pi∗, Grace Lam*, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, Wei Ping†
Abstract
Despite rapid recent progress in the terminal capabilities of large language models, the training data
strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap
through a systematic study of data engineering practices for terminal agents, making two key con-
tributions: (1) Terminal-Task-Gen, a li
Failed to generate LLM review.
Failed to generate research directions.
XMorph: Explainable Brain Tumor Analysis
Via LLM-Assisted Hybrid Deep Intelligence
Sepehr Salem Ghahfarokhi1, M. Moein Esfahani2, Raj Sunderraman1, Vince Calhoun2, Mohammed Alser1
1Department of Computer Science, Georgia State University, Atlanta, GA, USA
2TReNDS Center, Georgia State University, Atlanta, GA, USA
Corresponding authors: ssalemghahfarokhi1@gsu.edu, malser@gsu.edu
Abstract—Deep learning has significantly advanced automated
brain tumor diagnosis, yet clinical adoption remains limite
Failed to generate LLM review.
Failed to generate research directions.
Published as a conference paper at ICLR 2026
THE DIFFUSION DUALITY, CHAPTER II:
Ψ-SAMPLERS AND EFFICIENT CURRICULUM
Justin Deschenaux1∗
Caglar Gulcehre1,2
Subham Sekhar Sahoo3∗
1EPFL, Lausanne, Switzerland
2Microsoft AI
3Cornell Tech, NY
ABSTRACT
Uniform-state discrete diffusion models excel at few-step generation and guidance
due to their ability to self-correct, making them preferred over autoregressive or
Masked diffusion models in these settings. However, their sampling quality plateaus
with
Failed to generate LLM review.
Failed to generate research directions.
2026-02-25
Why Pass@k Optimization Can Degrade Pass@1:
Prompt Interference in LLM Post-training
Anas Barakat1, Souradip Chakraborty2, Khushbu Pahwa*, Amrit Singh Bedi3
1Singapore University of Technology and Design
2University of Maryland, College Park
3University of Central Florida
Pass@k is a widely used performance metric for verifiable large language model tasks, including
mathematical reasoning, code generation, and short-answer reasoning. It defines success if any of 𝑘
independently sample
Failed to generate LLM review.
Failed to generate research directions.
Efficient Hierarchical Any-Angle Path Planning
on Multi-Resolution 3D Grids
Victor Reijgwart, Cesar Cadena, Roland Siegwart and Lionel Ott
Autonomous Systems Lab, ETH Z¨urich, Switzerland
Email: vreijgwart@rai-inst.com, [cesarc | rolandsi | lioott]@ethz.ch
Abstract—Hierarchical, multi-resolution volumetric mapping
approaches are widely used to represent large and complex
environments as they can efficiently capture their occupancy
and connectivity information. Yet widely used path planning
metho
Failed to generate LLM review.
Failed to generate research directions.
NORD: A Data-Efficient Vision-Language-Action Model that Drives without
Reasoning
Ishaan Rawal1,2*
Shubh Gupta1
Yihan Hu1
Wei Zhan1,3†
1Applied Intuition
2Texas A&M University
3UC Berkeley
Abstract
Vision-Language-Action (VLA) models are advancing au-
tonomous driving by replacing modular pipelines with uni-
fied end-to-end architectures. However, current VLAs face
two expensive requirements: (1) massive dataset collec-
tion, and (2) dense reasoning annotations. In this work,
we address both cha
Failed to generate LLM review.
Failed to generate research directions.
SELAUR: Self Evolving LLM Agent via
Uncertainty-aware Rewards
Dengjia Zhang1, Xiaoou Liu2, Lu Cheng3, Yaqing Wang4, Kenton Murray1,
and Hua Wei2
1 Johns Hopkins University, Baltimore MD, USA {dzhang98,kenton}@jhu.edu
2 Arizona State University, Tempe AZ, USA {xiaoouli,hua.wei}@asu.edu
3 University of Illinois Chicago, Chicago IL, USA lucheng@uic.edu
4 Purdue University, West Lafayette IN, USA wang5075@purdue.edu
Abstract. Large language models (LLMs) are increasingly deployed as
multi-step decis
Failed to generate LLM review.
Failed to generate research directions.
CG-DMER: HYBRID CONTRASTIVE-GENERATIVE FRAMEWORK FOR DISENTANGLED
MULTIMODAL ECG REPRESENTATION LEARNING
Ziwei Niu1,3
Hao Sun4
Shujun Bian1
Xihong Yang2
Lanfen Lin3
Yuxin Liu1
Yueming Jin1,2 Q
1 Department of Biomedical Engineering, National University of Singapore, Singapore, Singapore
2 Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Singapore
3 College of Computer Science and Technology, Zhejiang University, Hangzhou, China
4 College of informat
Failed to generate LLM review.
Failed to generate research directions.
Not Just How Much, But Where: Decomposing Epistemic Uncertainty into
Per-Class Contributions
Mame Diarra Toure1
David A. Stephens1
1Department of Mathematics and Statistics , McGill University
Abstract
In safety-critical classification, the cost of failure is
often asymmetric. Yet Bayesian deep learning sum-
marises epistemic uncertainty with a single scalar,
mutual information (MI), which cannot distinguish
whether a model’s ignorance involves a benign or
safety-critical class. We decompose MI
Failed to generate LLM review.
Failed to generate research directions.
Scaling State-Space Models on Multiple GPUs with
Tensor Parallelism
Anurag Dutt
Stony Brook University
adutt@cs.stonybrook.edu
Nimit Shah
Stony Brook University
nimishah@cs.stonybrook.edu
Hazem Masarani
Stony Brook University
hazem.masarani@stonybrook.edu
Anshul Gandhi
Stony Brook University
anshul@cs.stonybrook.edu
Abstract—Selective state space models (SSMs) have rapidly
become a compelling backbone for large language models,
especially for long-context workloads. Yet in deployment, their
infe
Failed to generate LLM review.
Failed to generate research directions.
When doctors use AI to predict a patient’s health risks, they often ask "what if" questions—like "what if this patient didn’t have diabetes?"—to understand how to improve outcomes. However, this paper reveals a "Time Traveler Dilemma," where standard AI methods propose biologically impossible scenarios, such as "removing" a chronic disease that a patient has actually lived with for years. To fix this, researchers developed the Sequential Counterfactual Framework, a new approach that respects the flow of time and medical reality by distinguishing between what we can change (like lab results) and what we cannot (like chronic diagnoses). By testing this on thousands of COVID-19 patients, the team demonstrated how we can move past impossible "what ifs" to generate realistic, actionable medical insights that show exactly how early interventions can stop dangerous health cascades before they start.
Failed to generate LLM review.
Failed to generate research directions.
, 2022, pp. 1–18
doi: DOI HERE
Advance Access Publication Date: Day Month Year
Paper
PAPER
PVminer: A Domain-Specific Tool to Detect the
Patient Voice in Patient Generated Data
Samah Fodeh ,1,2∗Linhai Ma ,1 Yan Wang,1 Srivani Talakokkul,1
Ganesh Puthiaraju,1 Afshan Khan,1 Ashley Hagaman,3 Sarah Lowe3
and Aimee Roundtree4
1Department Of Emergency Medicine, Yale School of Medicine, 464 Congress Ave, 06519, CT, USA, 2Department of Biomedical Informatics
& Data Science, Yale School of Medicine, 100
Failed to generate LLM review.
Failed to generate research directions.
Published as a conference paper at ICLR 2026
A BENCHMARK FOR DEEP INFORMATION SYNTHESIS
Debjit Paul1, Daniel Murphy2, Milan Gritta1, Ronald Cardenas1,
Victor Prokhorov1, Jun Wang3, Gerasimos Lampouras1
Dataset Contributors:
Lena Sophia Bolliger4, Aysim Toker1, Roy Miles1, Andreea-Maria Oncescu1, Jasivan
Alex Sivakumar5, Philipp Borchert1, Ismail Elezi1, Meiru Zhang6, Ka Yiu Lee1, Guchun Zhang1
1Huawei Noah’s Ark Lab, UK
2Imperial College London
3UCL Centre for Artificial Intelligence
4University
Failed to generate review summary.
Failed to generate LLM review.
Failed to generate research directions.