Most reinforcement learning agents only get a simple pass/fail reward, which hides how good or bad their attempts really were.
This paper builds a big, reusable library of computer skills so an AI can use Windows apps more like a careful human, not a clumsy robot.
STEP3-VL-10B is a small (10 billion parameters) open multimodal model that sees images and reads text, yet scores like much larger models.
Computer-using agents kept forgetting important visual details over long tasks and could not reliably find up-to-date, step-by-step help for unfamiliar apps.
MAI-UI is a family of AI agents that can see, understand, and control phone and computer screens using plain language.