Legal RAG Bench is a new, end-to-end test that checks how well legal AI systems find information and use it to answer tough, real-world legal questions.
SciDER is a team of smart AI helpers that can run almost the whole research process: think of ideas, read raw data, write and run code, and improve itself with feedback.
NanoKnow is a new benchmark that checks whether a language model’s answers come from what it saw during training or from extra text we give it at question time.
PaperBanana is a team of AI helpers that turns a paper’s method text and caption into a clean, accurate, publication-ready figure.
Typhoon-S is a simple, open recipe that turns a basic language model into a helpful assistant and then teaches it important local skills, all on small budgets.
Academic rebuttals are not just about being polite; they are about smart, strategic persuasion under hidden information.
MemGovern teaches code agents to learn from past human fixes on GitHub by turning messy discussions into clean, reusable 'experience cards.'
LLMs can look confident but still change their answers when the surrounding text nudges them, showing that confidence alone isn’t real truthfulness.
Long-term AI helpers remember past chats, but using all memories can trap them in old ideas (Memory Anchoring).
Real people often ask vague questions with pictures, and today’s vision-language models (VLMs) struggle with them.
DeepCode is an AI coding system that turns long, complicated papers into full, working code repositories.