Papers1262

Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

Youtu-LLM is a small (1.96B) language model that was trained from scratch to think, plan, and act like an agent instead of just copying bigger models.

#lightweight LLM#agentic mid-training#trajectory data

Not triaged yet

Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space

Intermediate

Xingwei Qu, Shaowen Wang et al.Dec 31arXiv

Language is lumpy: easy stretches and tricky jumps are mixed together, but old models spend the same effort on every word.

#Dynamic Large Concept Models#semantic boundaries#latent reasoning

Not triaged yet

Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization

Intermediate

Yuchen Shi, Yuzheng Cai et al.Dec 31arXiv

Youtu-Agent is a build-and-grow factory for AI agents that cuts manual setup and keeps agents improving over time.

#LLM agents#automated agent generation#modular architecture

Not triaged yet

Recursive Language Models

Beginner

Alex L. Zhang, Tim Kraska et al.Dec 31arXiv

Recursive Language Models (RLMs) let an AI read and work with prompts that are much longer than its normal memory by treating the prompt like a big external document it can open, search, and study with code.

#Recursive Language Models#RLM#Long-context reasoning

Not triaged yet

PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Intermediate

Yuanhao Cai, Kunpeng Li et al.Dec 31arXiv

This paper teaches text-to-video models to follow real-world physics, so people, balls, water, glass, and fire act the way they should.

#text-to-video generation#physical consistency#direct preference optimization

Not triaged yet

Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems

Beginner

Song Wang, Lingdong Kong et al.Dec 30arXiv

Robots like cars and drones see the world with many different sensors (cameras, LiDAR, radar, and even event cameras), and this paper shows a clear roadmap for teaching them to understand space by learning from all of these together.

#Spatial Intelligence#Multi-Modal Pre-Training#Self-Supervised Learning

Not triaged yet

SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Intermediate

Yong Xien Chng, Tao Hu et al.Dec 30arXiv

SenseNova-MARS is a vision-language model that can think step-by-step and use three tools—text search, image search, and image cropping—during its reasoning.

#multimodal agent#vision-language model#reinforcement learning

Not triaged yet

Figure It Out: Improve the Frontier of Reasoning with Executable Visual States

Intermediate

Meiqi Chen, Fandong Meng et al.Dec 30arXiv

FIGR is a new way for AI to ‘think by drawing,’ using code to build clean, editable diagrams while it reasons.

#executable visual states#diagrammatic reasoning#reinforcement learning for reasoning

Not triaged yet

Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

Intermediate

Zhe Huang, Hao Wen et al.Dec 30arXiv

Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.

#multimodal large language model#video understanding#visual hallucination

Not triaged yet

GR-Dexter Technical Report

Intermediate

Ruoshi Wen, Guangzeng Chen et al.Dec 30arXiv

GR-Dexter is a full package—new robot hands, a smart AI brain, and lots of carefully mixed data—that lets a two-handed robot follow language instructions to do long, tricky tasks.

#vision-language-action#dexterous manipulation#bimanual robotics

Not triaged yet

Guiding a Diffusion Transformer with the Internal Dynamics of Itself

Beginner

Xingyu Zhou, Qifan Li et al.Dec 30arXiv

This paper shows a simple way to make image-generating AIs (diffusion Transformers) produce clearer, more accurate pictures by letting the model guide itself from the inside.

#Internal Guidance#Diffusion Transformer#Intermediate Supervision

Not triaged yet

DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

Beginner

Zefeng He, Xiaoye Qu et al.Dec 30arXiv

DiffThinker turns hard picture-based puzzles into an image-to-image drawing task instead of a long texting task.

#DiffThinker#Generative Multimodal Reasoning#Diffusion Models

Not triaged yet

75 76 77 78 79