Papers4

#Direct Preference Optimization (DPO)

Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

The paper introduces Rubric-ARM, a system that teaches two AI helpers—a rubric maker and a judge—to learn together using reinforcement learning so they can better decide which answers people would prefer.

#Rubric-based reward modeling#LLM-as-a-judge#Alternating reinforcement learning

Not triaged yet

GameTalk: Training LLMs for Strategic Conversation

Intermediate

Victor Conchello Vendrell, Max Ruiz Luyten et al.Jan 22arXiv

Large language models usually get judged one message at a time, but many real tasks need smart planning across a whole conversation.

#strategic conversation#reinforcement learning for LLMs#multi-turn dialogue

Not triaged yet

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Intermediate

Jiawei Liu, Junqiao Li et al.Dec 24arXiv

DreaMontage is a new AI method that makes long, single-shot videos that feel smooth and connected, even when you give it scattered images or short clips in the middle.

#arbitrary frame conditioning#one-shot video generation#Diffusion Transformer

Not triaged yet

T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground

Intermediate

Dmitrii Stoianov, Danil Taranets et al.Dec 11arXiv

T-pro 2.0 is an open Russian language model that can answer quickly or think step by step, so you can pick speed or accuracy when you need it.

#T-pro 2.0#Russian LLM#Hybrid reasoning

Not triaged yet