Papers5

All Beginner Intermediate Advanced

All Sources arXiv

#multimodal learning

DREAM: Where Visual Understanding Meets Text-to-Image Generation

Beginner

Chao Li, Tianhong Li et al.Mar 3arXiv

DREAM is one model that both understands images (like CLIP) and makes images from text (like top text-to-image models).

#DREAM#contrastive learning#masked autoregressive modeling

Uncertainty-Aware Vision-Language Segmentation for Medical Imaging

Intermediate

Aryan Das, Tanishq Rachamalla et al.Feb 16arXiv

This paper builds a medical image segmentation system that uses both pictures (like X-rays) and words (short clinical text) at the same time.

#medical image segmentation#vision-language segmentation#uncertainty estimation

Kimi K2.5: Visual Agentic Intelligence

Beginner

Kimi Team, Tongtong Bai et al.Feb 2arXiv

Kimi K2.5 is a new open-source AI that can read both text and visuals (images and videos) and act like a team of helpers to finish big tasks faster.

#multimodal learning#vision-language models#joint optimization

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Intermediate

Shijie Lian, Bin Yu et al.Jan 21arXiv

Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.

#Vision-Language-Action#Bayesian decomposition#Latent Action Queries

A unified framework for detecting point and collective anomalies in operating system logs via collaborative transformers

Intermediate

Mohammad Nasirzadeh, Jafar Tahmoresnezhad et al.Dec 29arXiv

CoLog is a new AI system that reads computer logs like a story and spots both single strange events (point anomalies) and strange patterns over time (collective anomalies).

#log anomaly detection#multimodal learning#collaborative transformer