Papers4

#Q-Former

World Guidance: World Modeling in Condition Space for Action Generation

WoG (World Guidance) teaches a robot to imagine just the right bits of the near future and use those bits to pick better actions.

#Vision-Language-Action#world modeling#condition space

Not triaged yet

Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

Intermediate

Yuxuan Yang, Zhonghao Yan et al.Feb 23arXiv

Hepato-LLaVA is a special AI that reads giant microscope pictures of the liver and answers medical questions about cancer.

#Hepato-LLaVA#Hepatocellular Carcinoma#Whole Slide Images

Not triaged yet

Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

Intermediate

Hai Zhang, Siqi Liang et al.Feb 5arXiv

Robots usually need very detailed, step-by-step directions, but real life often gives only short, simple goals like ‘find the red bench.’

#Beyond-the-View Navigation#Sparse Video Generation#Vision-Language Navigation

Not triaged yet

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

Intermediate

Shengchao Zhou, Yuxin Chen et al.Dec 23arXiv

The paper tackles a big blind spot in vision-language models: understanding how objects move and relate in 3D over time (dynamic spatial reasoning, or DSR).

#dynamic spatial reasoning#vision-language models#4D understanding

Not triaged yet