Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
IntermediateBoqiang Zhang, Lei Ke et al.Mar 6arXiv
Penguin-VL shows that small vision-language models (2B and 8B) can be very strong if you give them a better vision encoder, not just a bigger brain.
#Vision Language Model#LLM-based Vision Encoder#Contrastive Learning