Dream2Flow lets a robot watch a short, AI-generated video of a task and then do that task in real life by following object motion in 3D.
Youtu-LLM is a small (1.96B) language model that was trained from scratch to think, plan, and act like an agent instead of just copying bigger models.
Youtu-Agent is a build-and-grow factory for AI agents that cuts manual setup and keeps agents improving over time.
SenseNova-MARS is a vision-language model that can think step-by-step and use three tools—text search, image search, and image cropping—during its reasoning.
Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.
The paper teaches AI to write strong research plans by letting it grade its own work using checklists (rubrics) pulled from real scientific papers.
ProGuard is a safety guard for text and images that doesn’t just spot known problems—it can also recognize and name new, never-seen-before risks.
This survey links how human brains remember things to how AI agents should remember things so they can act smarter over time.
The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.
Coding agents used to fix software rely on feedback; unit tests give only pass/fail signals that are often noisy or missing.
LongVideoAgent is a team of three AIs that work together to answer questions about hour‑long TV episodes without missing small details.
Large language models can say things that sound right but aren’t supported by the given document; this is called a faithfulness hallucination.