Papers1055

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Yonggan Fu, Lexington Whalen et al.Dec 16arXiv

Autoregressive (AR) models write one word at a time, which is accurate but slow, especially when your computer or GPU can’t keep many tasks in memory at once.

#diffusion language models#autoregressive models#AR-to-dLM conversion

Not triaged yet

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

Intermediate

HyperAI Team, Yuchen Liu et al.Dec 16arXiv

HyperVL is a small but smart model that understands images and text, designed to run fast on phones and tablets.

#HyperVL#on-device multimodal#edge AI

Not triaged yet

OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

Intermediate

Mengzhang Cai, Xin Gao et al.Dec 16arXiv

OpenDataArena (ODA) is a fair, open platform that measures how valuable different post‑training datasets are for large language models by holding everything else constant.

#OpenDataArena#post-training datasets#data-centric AI

Not triaged yet

FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition

Intermediate

Jonas Golde, Patrick Haller et al.Dec 15arXiv

FINERWEB is a new, carefully built dataset pipeline that teaches computers to spot names of people, places, and more across 91 languages and 25 writing systems.

#multilingual NER#named entity recognition#LLM supervision

Not triaged yet

SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

Intermediate

Jitesh Jain, Jialuo Li et al.Dec 15arXiv

SAGE is a smart video-watching agent that decides when to answer quickly and when to take multiple steps, just like how people skim or rewind long videos.

#any-horizon reasoning#video agents#temporal grounding

Not triaged yet

LitePT: Lighter Yet Stronger Point Transformer

Intermediate

Yuanwen Yue, Damien Robert et al.Dec 15arXiv

LitePT is a new AI backbone for 3D point clouds that uses convolutions in early layers and attention in later layers to be both fast and accurate.

#LitePT#Point Transformer#3D point cloud

Not triaged yet

Towards Scalable Pre-training of Visual Tokenizers for Generation

Intermediate

Jingfeng Yao, Yuda Song et al.Dec 15arXiv

The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.

#visual tokenizer#latent space#Vision Transformer

Not triaged yet

Feedforward 3D Editing via Text-Steerable Image-to-3D

Intermediate

Ziqi Ma, Hongqiao Chen et al.Dec 15arXiv

Steer3D lets you change a 3D object just by typing what you want, like “add a roof rack,” and it does it in one quick pass.

#3D editing#image-to-3D#ControlNet

Not triaged yet

Towards Interactive Intelligence for Digital Humans

Intermediate

Yiyi Cai, Xuangeng Chu et al.Dec 15arXiv

Digital humans used to just copy motions; this paper makes them think, speak, and move in sync like real people.

#interactive intelligence#digital human#multimodal avatar

Not triaged yet

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

Intermediate

Enshen Zhou, Cheng Chi et al.Dec 15arXiv

RoboTracer is a vision-language model that turns tricky, word-only instructions into safe, step-by-step 3D paths (spatial traces) robots can follow.

#RoboTracer#spatial trace#3D spatial referring

Not triaged yet

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Intermediate

Boxin Wang, Chankyu Lee et al.Dec 15arXiv

The paper introduces Nemotron-Cascade, a step-by-step (cascaded) reinforcement learning recipe that trains an AI across domains like alignment, instructions, math, coding, and software engineering—one at a time.

#Cascaded Reinforcement Learning#RLHF#Instruction-Following RL

Not triaged yet

LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Intermediate

Jianxiong Gao, Zhaoxi Chen et al.Dec 15arXiv

LongVie 2 is a video world model that can generate controllable videos for 3–5 minutes while keeping the look and motion steady over time.

#long video generation#world model#multimodal control

Not triaged yet

76 77 78 79 80