Papers1262

Large Multimodal Models as General In-Context Classifiers

Marco Garosi, Matteo Farina et al.Feb 26arXiv

People often pick CLIP-like models for image labeling, but this paper shows that large multimodal models (LMMs) can be just as good—or even better—when you give them a few examples in the prompt (in-context learning).

#in-context learning#multimodal models#open-world classification

Not triaged yet

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

Intermediate

Wenjia Wang, Liang Pan et al.Feb 26arXiv

EmbodMocap is a low-cost, portable way to capture people moving inside real places using just two iPhones, so computers and robots can learn from real life instead of studios.

#Embodied AI#4D human-scene reconstruction#dual-view RGB-D

Not triaged yet

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Intermediate

Zhaochen Su, Jincheng Gao et al.Feb 26arXiv

AgentVista is a new test (benchmark) that checks whether AI agents can solve tough, real-life picture-based problems by using multiple tools over many steps.

#AgentVista#multimodal agents#visual grounding

Not triaged yet

The Trinity of Consistency as a Defining Principle for General World Models

Intermediate

Jingxuan Wei, Siyuan Li et al.Feb 26arXiv

The paper argues that to build an AI that truly understands and simulates the real world, it must be consistent in three ways at once: across different senses (modal), across 3D space (spatial), and across time (temporal).

#world model#trinity of consistency#modal consistency

Not triaged yet

GeoWorld: Geometric World Models

Intermediate

Zeyu Zhang, Danning Li et al.Feb 26arXiv

GeoWorld is a new way for AI to plan several steps into the future by thinking in shapes (geometry) instead of only numbers.

#geometric world model#hyperbolic JEPA#Poincaré ball

Not triaged yet

SkillNet: Create, Evaluate, and Connect AI Skills

Intermediate

Yuan Liang, Ruobin Zhong et al.Feb 26arXiv

Before SkillNet, AI agents kept solving the same kinds of problems over and over without saving what they learned in a clean, reusable way.

#AI skills#Skill ontology#Skill taxonomy

Not triaged yet

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Intermediate

Zeyuan Liu, Jeonghye Kim et al.Feb 26arXiv

This paper teaches a language-model agent to explore smarter by combining two ways of learning (on-policy and off-policy) with a simple, self-written memory.

#EMPO#memory-augmented agents#on-policy learning

Not triaged yet

General Agent Evaluation

Intermediate

Elron Bandel, Asaf Yehudai et al.Feb 26arXiv

This paper shows how to fairly test "general-purpose" AI agents that should work in many places without special tweaks.

#general-purpose agents#agent evaluation#unified protocol

Not triaged yet

OmniGAIA: Towards Native Omni-Modal AI Agents

Intermediate

Xiaoxi Li, Wenxiang Jiao et al.Feb 26arXiv

OmniGAIA is a new test that checks if AI can watch videos, look at images, listen to audio, and use web and code tools in several steps to find a verified answer.

#OmniGAIA#OmniAtlas#Tool-Integrated Reasoning

Not triaged yet

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Intermediate

Hongrui Jia, Chaoya Jiang et al.Feb 26arXiv

Large multimodal models (LMMs) can look at pictures and read text, but they still miss tricky cases, like tiny chart labels or multi-step math.

#Large Multimodal Models#Diagnostic-driven Progressive Evolution#Reinforcement Learning

Not triaged yet

Imagination Helps Visual Reasoning, But Not Yet in Latent Space

Beginner

You Li, Chi Chen et al.Feb 26arXiv

The paper asks a simple question: do the model’s invisible “imagination tokens” actually help it reason about images?

#multimodal large language model#visual reasoning#latent visual reasoning

Not triaged yet

Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction

Intermediate

Nils Schwager, Simon Münker et al.Feb 26arXiv

This paper tests whether AI can realistically guess what a specific social media user would comment when they see a new post.

#Conditioned Comment Prediction#LLM user simulation#implicit conditioning

Not triaged yet

8 9 10 11 12