🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers29

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#vision-language model

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Intermediate
Said Taghadouini, Adrien Cavaillès et al.Jan 20arXiv

LightOnOCR-2-1B is a single, compact AI model that reads PDF pages and scans and turns them into clean, well-ordered text without using fragile multi-step OCR pipelines.

#LightOnOCR-2-1B#end-to-end OCR#vision-language model

Future Optical Flow Prediction Improves Robot Control & Video Generation

Intermediate
Kanchana Ranasinghe, Honglu Zhou et al.Jan 15arXiv

FOFPred is a new AI that reads one or two images plus a short instruction like “move the bottle left to right,” and then predicts how every pixel will move in the next moments.

#optical flow#future optical flow prediction#vision-language model

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Intermediate
Christopher Clark, Jieyu Zhang et al.Jan 15arXiv

Molmo2 is a family of vision-language models that can watch videos, understand them, and point to or track things over time using fully open weights, data, and code.

#vision-language model#video grounding#pointing and tracking

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Intermediate
Yu Wang, Yi Wang et al.Jan 15arXiv

Cities are full of places defined by people, like schools and parks, which are hard to see clearly from space without extra clues.

#socio-semantic segmentation#vision-language model#reinforcement learning

OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

Intermediate
Sheng-Yu Huang, Jaesung Choe et al.Jan 14arXiv

OpenVoxel is a training-free way to understand 3D scenes by grouping tiny 3D blocks (voxels) into objects and giving each object a clear caption.

#OpenVoxel#Sparse Voxel Rasterization#training-free 3D understanding

SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL

Intermediate
Lijun Liu, Linwei Chen et al.Jan 14arXiv

SkinFlow is a 7B-parameter vision–language model that diagnoses skin conditions by sending the most useful visual information to the language brain, instead of just getting bigger.

#dermatology AI#vision-language model#Dynamic Visual Encoding

VIBE: Visual Instruction Based Editor

Intermediate
Grigorii Alekseenko, Aleksandr Gordeev et al.Jan 5arXiv

VIBE is a tiny but mighty image editor that listens to your words and changes pictures while keeping the original photo intact unless you ask otherwise.

#instruction-based image editing#vision-language model#diffusion model

PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Intermediate
Yuanhao Cai, Kunpeng Li et al.Dec 31arXiv

This paper teaches text-to-video models to follow real-world physics, so people, balls, water, glass, and fire act the way they should.

#text-to-video generation#physical consistency#direct preference optimization

SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Intermediate
Yong Xien Chng, Tao Hu et al.Dec 30arXiv

SenseNova-MARS is a vision-language model that can think step-by-step and use three tools—text search, image search, and image cropping—during its reasoning.

#multimodal agent#vision-language model#reinforcement learning

Figure It Out: Improve the Frontier of Reasoning with Executable Visual States

Intermediate
Meiqi Chen, Fandong Meng et al.Dec 30arXiv

FIGR is a new way for AI to ‘think by drawing,’ using code to build clean, editable diagrams while it reasons.

#executable visual states#diagrammatic reasoning#reinforcement learning for reasoning

Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion

Intermediate
Yi Zhou, Xuechao Zou et al.Dec 28arXiv

Co2S is a new way to train segmentation models with very few labels by letting two different students (CLIP and DINOv3) learn together and correct each other.

#semi-supervised segmentation#remote sensing#pseudo-label drift

Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

Intermediate
Jiacheng Ye, Shansan Gong et al.Dec 27arXiv

Dream-VL and Dream-VLA use a diffusion language model backbone to understand images, talk about them, and plan actions better than many regular (autoregressive) models.

#diffusion language model#vision-language model#vision-language-action
123