Researchers at Anthropic introduce an automated pipeline for extracting "persona vectors" from large language models' activation spaces, enabling both the monitoring and causal control of character traits. They demonstrate that these vectors can prevent undesirable persona shifts during finetuning and effectively screen training data to predict and mitigate the induction of negative traits like evil, sycophancy, or hallucination.
This survey from a large collaborative group, including Princeton AI Lab, provides the first systematic and comprehensive review focusing on self-evolving agents, proposing a unified theoretical framework to categorize their mechanisms and temporalities. It distinguishes self-evolving agents from related AI paradigms and outlines critical research challenges and evaluation methodologies necessary for their advancement towards artificial super intelligence.
CoT-Self-Instruct, developed by FAIR at Meta, introduces a method for generating high-quality synthetic data for Large Language Models by combining Chain-of-Thought reasoning for instruction creation with robust, automated filtering mechanisms. This approach enables models trained on the synthetic data to achieve superior performance on both reasoning and general instruction-following benchmarks, often surpassing existing synthetic methods and human-annotated datasets.
ByteDance Seed AI4Math's Seed-Prover and Seed-Geometry are AI systems that successfully proved 5 out of 6 problems in the IMO 2025 competition, establishing new state-of-the-art results across several formal mathematical benchmarks including MiniF2F and PutnamBench. The systems achieve this through lemma-style proving, multi-tiered inference strategies that integrate iterative refinement and broad conjecture generation, and a fast, specialized geometry engine.
RLVMR, developed by Tencent, trains Large Language Model agents to perform complex, long-horizon tasks by providing dense, verifiable meta-reasoning rewards during reinforcement learning. This approach leads to enhanced task success and generalization while significantly reducing inefficient exploration, such as repetitive and invalid actions, on benchmarks like ALFWorld and ScienceWorld.
The Falcon LLM Team at the Technology Innovation Institute introduces Falcon-H1, a series of hybrid-head language models that integrate Transformer attention with Mamba-2 SSMs, achieving strong performance across various tasks while demonstrating enhanced parameter and training efficiency. The models set new benchmarks for efficiency and capability, particularly in reasoning-intensive domains and long-context processing, often matching or exceeding larger models.
Meta CLIP 2 from FAIR (Meta AI Research) presents a comprehensive recipe for training Contrastive Language-Image Pretraining (CLIP) models using worldwide, web-scale image-text data. This approach effectively breaks the 'curse of multilinguality,' demonstrating that non-English data can enhance English performance (e.g., ViT-H/14 ImageNet accuracy improved from 80.5% to 81.3%) while setting new state-of-the-art results on numerous multilingual benchmarks.
SWE-Exp introduces an experience-enhanced framework that enables Large Language Model agents to learn from past software issue resolution attempts, achieving a Pass@1 score of 41.6% on the SWE-bench-Verified dataset. It systematically captures and reuses knowledge via a multi-faceted experience bank and a dual-agent architecture, transforming agents from memoryless explorers into strategic, experience-driven problem solvers.
UniLIP adapts CLIP to serve as a unified image tokenizer, enabling state-of-the-art performance across multimodal understanding, image reconstruction, generation, and editing tasks. This approach maintains CLIP's strong semantic comprehension while achieving consistent pixel-level detail for generative tasks.
Researchers at the University of Maryland, College Park, empirically demonstrated a 'DEMOS’ POSITION IN PROMPT bias (DPP bias)' in large language models, showing that merely repositioning in-context demonstrations within a prompt can cause accuracy to fluctuate by up to 20 percentage points and flip nearly half of a model's predictions. The study reveals a consistent advantage for early demo placements and identifies that placing demonstrations at the very end of the user message often degrades performance.