This paper shows that we can turn big, smart vision features into a small, easy-to-use code for image generation with just one attention layer.
GRAPE is a new way to tell Transformers where each word is in a sentence by using neat math moves called group actions.
LongCat-Image is a small (6B) but mighty bilingual image generator that turns text into high-quality, realistic pictures and can also edit images very well.
Big language models use RoPE to remember word order, but it throws away the imaginary half of a complex number during attention.
VideoCoF is a new way to edit videos that first figures out WHERE to edit and then does the edit, like thinking before acting.
Saber is a new way to make videos that match a text description while keeping the look of people or objects from reference photos, without needing special triplet datasets.
Robots need lots of realistic, long videos to learn, but collecting them is slow and expensive.
OmniSafeBench-MM is a one-stop, open-source test bench that fairly compares how multimodal AI models get tricked (jailbroken) and how well different defenses stop that.
The paper shows that making a model write a number as a sequence of digits and then grading the whole number at the end works better than grading each digit separately.
This paper fixes two big problems in image-making AI that builds pictures step by step: it often practices with perfect answers (teacher forcing) but must perform using its own imperfect guesses later, and the earliest coarse steps are much harder than the later fine steps.
VG-Refiner is a new way for AI to find the right object in a picture when given a description, even if helper tools make mistakes.
EditThinker is a helper brain for any image editor that thinks, checks, and rewrites the instruction in multiple rounds until the picture looks right.