Big models like Whisper are great for accuracy but too slow for live captions; this paper builds a smaller, faster Thai speech recognizer for real-time use.
This paper introduces Laser, a new way for vision-language models to think in their hidden space before speaking, so they see the whole βforestβ before picking out the βtrees.β