Papers791

DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs

Shidong Cao, Hongzhan Lin et al.Jan 7arXiv

DiffCoT treats a model’s step-by-step thinking (Chain-of-Thought) like a messy draft that can be cleaned up over time, not something fixed forever.

#Chain-of-Thought#Diffusion models#Autoregressive decoding

Evolving Programmatic Skill Networks

Intermediate

Haochen Shi, Xingdi Yuan et al.Jan 7arXiv

This paper teaches a computer agent to grow a toolbox of skills that are real, runnable programs, not just text ideas.

#Programmatic Skill Network#continual learning#symbolic programs

EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

Intermediate

Mingyang Wei, Dehai Min et al.Jan 6arXiv

EpiQAL is a new benchmark that tests how well AI models answer population-level disease questions using real research papers.

#Epidemiological reasoning#Question answering#Benchmarking LLMs

The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models

Intermediate

Yan Wang, Yitao Xu et al.Jan 6arXiv

Mixture-of-Experts (MoE) language models don’t split cleanly into domain specialists; instead, a small, stable group of experts gets chosen again and again across many subjects.

#Mixture-of-Experts#Standing Committee#Sparse routing

InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields

Intermediate

Hao Yu, Haotong Lin et al.Jan 6arXiv

InfiniDepth is a new way to predict depth that treats every image location as a smooth, continuous place you can ask for depth, not just the fixed pixels of a grid.

#monocular depth estimation#neural implicit fields#arbitrary resolution depth

LTX-2: Efficient Joint Audio-Visual Foundation Model

Intermediate

Yoav HaCohen, Benny Brazowski et al.Jan 6arXiv

LTX-2 is an open-source model that makes video and sound together from a text prompt, so the picture and audio match in time and meaning.

#text-to-video#text-to-audio#audiovisual generation

Unified Thinker: A General Reasoning Modular Core for Image Generation

Intermediate

Sashuai Zhou, Qiang Zhou et al.Jan 6arXiv

Unified Thinker separates “thinking” (planning) from “drawing” (image generation) so complex instructions get turned into clear, doable steps before any pixels are painted.

#reasoning-aware image generation#structured planning#edit-only prompt

SOP: A Scalable Online Post-Training System for Vision-Language-Action Models

Intermediate

Mingjie Pan, Siyuan Feng et al.Jan 6arXiv

This paper introduces SOP, a system that lets many real robots learn new skills online at the same time while keeping one shared brain (policy).

#Scalable Online Post-Training#Vision-Language-Action Models#Actor–Learner Architecture

MMFormalizer: Multimodal Autoformalization in the Wild

Intermediate

Jing Xiong, Qi Han et al.Jan 6arXiv

MMFormalizer is a new system that turns problems with pictures and words (like physics scenes or geometry diagrams) into strict, checkable math statements and proofs.

#multimodal autoformalization#Lean theorem prover#dimensional grounding

Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts

Intermediate

Dhruv Trehan, Paras ChopraJan 6arXiv

The authors built a simple six-agent system to see if today’s AI models could plan, run, and write a research paper mostly on their own.

#autonomous research pipeline#implementation drift#training data bias

Mechanistic Interpretability of Large-Scale Counting in LLMs through a System-2 Strategy

Intermediate

Hosein Hasani, Mohammadali Banayeeanzade et al.Jan 6arXiv

Large language models (LLMs) are good at many math problems but often mess up simple counting when the list gets long.

#mechanistic interpretability#counting in LLMs#System-2 prompting

DreamStyle: A Unified Framework for Video Stylization

Intermediate

Mengtian Li, Jinshu Chen et al.Jan 6arXiv

DreamStyle is a single video-stylizing model that can follow text, copy a style image, or continue from a stylized first frame—without switching tools.

#video stylization#image-to-video (I2V)#token-specific LoRA

36 37 38 39 40