VisionTrim makes picture-and-text AI models run much faster by keeping only the most useful visual pieces (tokens) and smartly merging the rest.
Large language models sometimes reach the right answer for the wrong reasons, which is risky and confusing.
Real attackers can try many prompts in parallel until a model slips, so testing safety with only one try badly underestimates risk.
TTCS is a way for a model to teach itself during the test by first making easier practice questions that are similar to the real hard question and then learning from them.
Big models are often used to grade AI answers, but they are expensive, slow, and depend too much on tricky prompts.
RAPTOR is a simple, fast way to find a direction (a concept vector) inside a frozen language model that points toward a concept like 'sarcasm' or 'positivity.'
Millions of public AI models exist, but downloads are concentrated on a tiny set of “official” checkpoints, which are not always the best performers.
This paper shows how to turn a big Transformer model into a faster hybrid model that mixes attention and RNN layers using far less training data (about 2.3B tokens).
The paper teaches AI agents better by grading not just their final answers, but also how they think and use tools along the way.
DynamicVLA is a small and fast robot brain that sees, reads, and acts while things are moving.
Large language models usually learn by guessing the next word, then get a tiny bit of instruction-following practice; this paper flips that by turning massive web documents into instruction-and-answer pairs at huge scale.
This paper shows a simple, one-model way to dub videos that makes the new voice and the lips move together naturally.