Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure

Jooyeol Yun; Jaegul Choo

Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure

Intermediate

Jooyeol Yun, Jaegul Choo12/16/2025

arXiv PDF

Key Summary

•Vector Prism helps computers animate SVG images by first discovering which tiny shapes belong together as meaningful parts.
•It shows each tiny SVG piece in several different ways (like zoomed-in or highlighted) and asks a vision-language model what it is.
•Because those answers can be noisy, it uses statistics to figure out which viewing method is more trustworthy and then combines answers smartly (not just by majority vote).
•With reliable part labels, it reorganizes the SVG into clean groups (like eyes, nose, buttons) without changing how it looks.
•A language model then turns an instruction (like 'make the buttons bounce in one by one') into CSS animation code that moves those groups.
•This approach makes animations look more coherent and faithful to the instruction than prior methods, and even beats popular video models in user studies.
•Vector animations stay super small in file size compared to videos, which is great for fast, modern websites.
•The main limitation is that it can’t split a single chunky SVG path into smaller pieces if the input SVG doesn’t already have them.

Why This Research Matters

Websites need animations that are fast to load, crisp at any size, and easy to edit—SVGs are perfect for that, but only if we know which shapes belong together. Vector Prism recovers those missing part groupings so instructions like “blink,” “bounce,” or “spin” affect exactly the right elements. This makes animations look more natural and faithful to what designers want, without bloated video files. The small file sizes mean better performance on phones and in low-bandwidth areas. It can improve everything from app icons and logos to educational diagrams and data visuals, making the web both lighter and livelier.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you dump out a big box of LEGO bricks that once made a cute bunny. If someone mixed all the pieces and threw away the instruction booklet, it’s hard to know which bricks are the ears or the nose when you want to make it wiggle its nose.

🥬 The Concept (Scalable Vector Graphics, SVG):

What it is: SVGs are images made of shapes like lines and curves instead of pixels, so they stay sharp at any size.
How it works: An SVG file lists shapes (paths, circles, rectangles) and how to draw them in order.
Why it matters: Without understanding which shapes belong together (like both ears), animations get messy because the computer only sees separate tiny shapes.

🍞 Anchor: Think of a smiling emoji built from several small shapes. If you don’t know which shapes make the eyes, you can’t blink them properly.

🍞 Hook: You know how a friend who can read both pictures and text can understand a comic better than someone who sees only the pictures or only the speech bubbles?

🥬 The Concept (Vision–Language Models, VLMs):

What it is: VLMs are computer models that connect what they see (images) with what they read (words).
How it works: They look at a picture (or a rendered SVG) and produce text like plans or labels; they can also write code based on instructions.
Why it matters: Even though VLMs can plan and code, they often fail on SVGs because SVGs don’t tell them which shapes belong together in a meaningful way.

🍞 Anchor: If you say “make the compass needle spin,” a VLM needs to know which tiny shapes together form the needle, not the frame.

🍞 Hook: Imagine sorting a jigsaw puzzle by colors and edges before you assemble it; it becomes easier to build the picture.

🥬 The Concept (Semantic Structure):

What it is: Semantic structure is grouping shapes into meaningful parts (like nose, eyes, buttons) instead of random drawing order.
How it works: We label which small shapes belong to which big idea (e.g., both triangles together are the ears).
Why it matters: Without semantic groups, any movement (spin, bounce, blink) might grab the wrong pieces and look broken.

🍞 Anchor: Grouping “all star points” together lets you make the star twinkle instead of just one spike flickering.

The world before: People tried two main paths. First, they used image/video diffusion tricks to nudge motion in rendered images. That’s like pushing on the final painting instead of moving the parts underneath; it can wobble or jitter and resists big rearrangements. Second, they had language models write animation code directly, but since raw SVGs are optimized for drawing speed, not meaning, the models guessed wrong about which shapes to move together—so results often looked stiff or incorrect. Others even generated videos from text, which can look nice but can’t deliver tiny vector files or crisp scaling for the web.

The problem: SVGs usually group shapes for drawing order and efficiency, not meaning. So VLMs don’t know which pieces move together (eyes blink, buttons bounce, needle spins) and produce animations that fall apart.

Failed attempts: Majority voting across a few guesses (e.g., several views of the same shape) helps a little, but if one view is very noisy, its wrong guesses can swing the decision, scrambling parts. Optimizing motion on pixel renderings keeps appearance but won’t rewire parts semantically. Direct code generation without structure often animates everything the same way.

The gap: We need a reliable way to recover missing semantic structure from messy, low-level SVG shapes—before asking a model to animate them.

Real stakes: Websites want small, fast, and delightful animations. Vector animations load quickly, scale cleanly, and can be interactive. Better grouping means better motion, which helps everything from logos and buttons to infographics and educational content.

🍞 Hook: Picture a librarian who can’t find the right shelf because all the books are sorted by the day they arrived, not by topic.

🥬 The Concept (Semantic Recovery):

What it is: Semantic recovery is the process of rebuilding the “by-topic” shelves from scattered items.
How it works: Show each small SVG piece in several ways, ask a VLM what it is, then use statistics to combine the weak answers into strong labels.
Why it matters: Without this step, the animator doesn’t know which shapes belong together, so instructions like “bounce the buttons one by one” can’t be done right.

🍞 Anchor: If weak answers say a rectangle is probably “button” in three views but “background” in two, smart combining can choose “button” with confidence.

02Core Idea

🍞 Hook: You know how a science fair judge listens to many short presentations, figures out which speakers are usually right, and then trusts their votes a bit more?

🥬 The Concept (Key Insight):

What it is: The aha! moment: Ask a VLM about each tiny shape through multiple focused views, estimate which view-styles are more reliable, then combine their answers with smart (Bayesian) weights to rebuild clean semantic groups.
How it works: Render each shape several ways (highlight, isolation, zoom-in, outline, bounding box), collect labels, measure how often the views agree, estimate reliability for each view with a statistical model, then choose final labels with a reliability-weighted vote. Finally, reorganize the SVG by these labels to make animation easy and correct.
Why it matters: Without weighting by reliability, a noisy view can flip decisions; with it, you get steady, coherent parts that animate beautifully.

🍞 Anchor: If four friends usually guess snack flavors correctly and one often mixes them up, you weight the accurate friends’ votes higher before deciding the flavor.

Three analogies:

Detective team: Multiple witnesses describe a suspect (the shape). Some witnesses are more reliable. The lead detective (Bayes) weighs testimonies by reliability and identifies the right suspect group (eyes, nose, button).
Sports referees: Five refs watch the same play from different angles. A head ref studies how often each angle gets calls right, then uses weighted votes to make the final call.
Classroom quiz: Students take mini-quizzes about an image. You track which quiz formats produce truer answers, then trust those more when grading the final label.

Before vs After:

Before: VLMs tried to animate from messy, draw-order groups, leading to sways, jitters, or moving the wrong parts.
After: Vector Prism gives VLMs neat, semantic shelves: eyes, ears, buttons. Now motion plans are attached to the correct parts, so animations are crisp and faithful to instructions.

🍞 Hook: Imagine tuning a radio—when there’s static (noise), you need a smarter tuner to lock onto the real station (signal).

🥬 The Concept (Why It Works – the intuition):

What it is: A method to separate signal from noise by studying agreement patterns across different views.
How it works: If two views often agree, they’re likely accurate. We model this with a classic tool (Dawid–Skene) to estimate which view is trustworthy; then Bayes’ rule picks the most likely true label using those trust scores.
Why it matters: Majority voting treats all views equally, so a weak view can mislead. Weighted voting resists noise and stabilizes labels.

🍞 Anchor: Even if the zoomed-in view is shaky on small shapes, the outline and highlight views might be steady; trust them more and you keep grouping correctly.

Building blocks (mini Sandwiches):

🍞 Hook: You know how you check a toy from different angles to see what it really is? 🥬 The Concept (Multi-view rendering): Show each SVG shape five ways (highlight, isolation, zoom-in, outline, bounding box) to collect clues. Without multiple views, you miss crucial context or detail, and labels wobble. 🍞 Anchor: A small '+' sign hidden in a busy toolbar becomes clear in isolation or zoom-in.
🍞 Hook: If two students often give the same right answer, you trust them together more. 🥬 The Concept (Agreement matrix): Track how often each view agrees with another; frequent agreement hints they’re reliable. Without tracking agreements, you can’t estimate which view to trust. 🍞 Anchor: If outline and highlight match 90% of the time, they likely see the part well.
🍞 Hook: When picking team captains, you’d rather choose consistently skilled players than random picks. 🥬 The Concept (Reliability estimation via Dawid–Skene): Use a statistical model to estimate each view’s accuracy from agreements. Without this, you can’t downweight noisy views. 🍞 Anchor: Discover zoom-in is weaker on large shapes; adjust trust accordingly.
🍞 Hook: Imagine choosing a movie by combining star ratings, giving more weight to critics you trust. 🥬 The Concept (Bayesian decision rule): Combine labels with weights that reflect reliability to pick the final part label. Without Bayes, majority vote can flip on tricky shapes. 🍞 Anchor: Three strong “Plus” votes beat two weak “Minus” votes; final label: Plus.
🍞 Hook: After sorting LEGO bricks into labeled bins, building becomes easy. 🥬 The Concept (Restructuring SVG): Add class tags (like .eye, .nose), flatten tricky SVG nesting, and regroup by labels without changing appearance. Without restructuring, motions still tug the wrong sets. 🍞 Anchor: Now “blink” applies to all eye pieces together, cleanly.

03Methodology

High-level recipe: Input (SVG + instruction) → Planning (what should move) → Vector Prism (recover parts) → Restructuring (clean groups) → Animation generation (CSS) → Output (animated SVG).

Step A: Animation planning (Instruction-to-plan)

What happens: Render the whole SVG as an image so a VLM can see it. Give it the user instruction. It writes a plan describing which meaningful parts should move and how (e.g., sun rises up, sky brightens).
Why this exists: VLMs are good at visual reasoning with raster images, but not at reading raw SVG code structures. Without a plan, code generation becomes guessy.
Example: Instruction: “Make the compass needle spin once quickly.” Plan: Identify the needle (thin long triangle + circle center), set a 360-degree rotation around its center in 0.8s.

Step B: Vector Prism (Semantic recovery) B1) Multi-view rendering and weak labels

What happens: For each primitive (path, rect, circle, etc.), render it five ways: highlight (on original canvas), isolation (on blank background), zoom-in (crop), outline (stroke only), bounding box overlay. Ask the VLM to name the part (from the plan’s label set).
Why this exists: A single view can be misleading; multiple views provide complementary clues. Without multi-view, many labels are unreliable.
Example data: The same rectangle gets: Highlight→“Plus”, Isolation→“Background”, Zoom-in→“Plus”, Outline→“Plus”, BBox→“Minus”.

B2) Burn-in agreements and reliability estimation

What happens: Build an agreement matrix counting how often each pair of views agrees across all shapes. Use a Dawid–Skene-style model to infer each view’s reliability. Intuition: if two views agree much more than chance, they’re likely good.
Why this exists: Without estimating reliability, you can’t tell which views deserve more weight.
Example: Highlight agrees with Outline 90% (strong), but Zoom-in agrees with others 50% (weak on this SVG).

B3) Bayes-weighted labeling

What happens: Convert each view’s reliability into a weight. For each primitive, add up weights of the views that voted for each label. Pick the label with the largest total (Bayes decision rule).
Why this exists: Majority vote can be flipped by noisy views. Weighted decisions resist noise and produce stable clusters.
Example: Three strong “Plus” votes beat two weak “Minus” votes → final label “Plus.”

Step C: Restructuring SVG (Turning meaning into organization)

What happens: Flatten the SVG hierarchy so style is baked into each primitive (appearance preserved). Attach the chosen label as a class to each shape. Regroup primitives by label while maintaining original paint order and checking overlaps so nothing’s visually altered.
Why this exists: The original grouping is by draw order, not meaning. Without regrouping, you still can’t animate by parts (like eyes or buttons) cleanly.
Example: All eye pieces (white, iris, pupil) now live under class .eye, so “blink” can lower their opacity together.

Step D: Animation generation (Plan-to-code)

What happens: An LLM writes CSS keyframes per semantic class, using a lanes pattern with custom properties (like --eye-rot1) to avoid conflicts across many animations. It generates code iteratively per class to stay within token limits.
Why this exists: Writing all CSS at once can exceed model limits, and overlapping transforms can overwrite each other. Without lanes, animations collide.
Example: It first writes .sun CSS to rise up, then .sky to brighten, ensuring transforms don’t clash.

Secret Sauce (mini Sandwiches):

🍞 Hook: When picking a team, you’d rather know who’s dependable than treat all players the same. 🥬 The Concept: Reliability weighting is the clever twist. It’s better than majority vote whenever some views are stronger. 🍞 Anchor: If Outline and Highlight are strong and Zoom-in is weak, you still get the right label most of the time.
🍞 Hook: Imagine rearranging your backpack so you can grab what you need fast. 🥬 The Concept: Restructuring keeps the SVG looking identical but makes the animation step simple and safe. 🍞 Anchor: Now “button bounce-in one by one” is just addressing .button-1, .button-2, etc., in order.

Concrete walk-through (Plus/Minus icon):

Input: SVG with many small shapes; instruction: “Bounce in the buttons one by one.”
Planning: VLM spots a grid of round-corner squares with plus/minus signs. It plans bounce-in per button in sequence.
Prism views and labels: One rectangle gets 3×“Plus” and 2×“Minus.” Agreement shows Zoom-in is weak here; Highlight/Outline strong.
Bayes picks “Plus” for that rectangle.
Restructure: Tag shapes as .plus, .minus, .background. Keep paint order.
Code: CSS bounces .plus and .minus one by one, using lanes so translate and opacity don’t clash.
Output: Clean, lively button entrance animation.

04Experiments & Results

The test: The authors built a set of 114 instruction–SVG pairs from SVGRepo across lots of themes (animals, logos, UI elements, nature). They measured two things: how well the animation follows the instruction (CLIP-T2V and GPT-T2V) and how nice it looks (DOVER). They also ran a human preference study.

The competition: They compared against (1) AniClipart, which optimizes motion using diffusion priors on raster frames; (2) GPT-5, which can write code but struggles without recovered structure; (3) Wan 2.2 and (4) Sora 2, powerful video generators that produce raster videos, not vector animations.

The scoreboard (with context):

Instruction following and quality: Vector Prism achieved the best combined scores (e.g., GPT-T2V 76.1 vs Sora 2 69.1), which is like getting an A when others got a B.
Human study: In 760 pairwise comparisons by 19 people, the method was preferred over AniClipart 79.2% of the time, over Wan 2.2 76.5%, and over Sora 2 63.3%.
File size vs fidelity: Compared to Sora 2 videos, Vector Prism’s SVG+CSS files were about 54.8× smaller on average while scoring higher on instruction fidelity. That’s like fitting a backpack’s worth of stuff into a pencil case—without losing what matters for the task.

Why these numbers matter:

76.1 GPT-T2V vs 69.1 (Sora 2) means the animation movements matched the text instructions more clearly.
DOVER scores indicated pleasing visual quality while still moving enough to satisfy instructions (often a tricky balance).
The huge compression advantage shows why vectors are superior for crisp, scalable web animation.

Surprising findings:

Even strong video models sometimes produced static or distorted frames when given animation-style instructions (like “an opening scene of the SVG”), while Vector Prism stayed faithful because it moves symbolic parts, not pixels.
Majority voting on labels improved over raw SVG groups, but the Bayes-weighted method delivered dramatic stability gains, reflected in a much better clustering quality (Davies–Bouldin index ~0.82 vs 12.6 for majority vote and 33.8 for original groups). That’s like going from a messy drawer to neatly labeled bins.

Takeaway: Adding semantic structure first, then animating, beats pushing pixels or writing code blindly. The method consistently attached the right motions to the right parts—and people noticed.

05Discussion & Limitations

Limitations (be specific):

Granularity ceiling: If an entire object (like a lightning bolt) is a single path, the method can’t split it into pieces (e.g., to “shatter”) because it treats input primitives as atoms.
View-dependent reliability: Reliability is estimated per SVG; if an SVG is extremely complex or stylized, the best views might shift, requiring a solid burn-in pass.
Expressiveness boundaries: The provided pipeline focuses on CSS; super complex physics or path morphs may need JavaScript or libraries (which the approach can extend to, but is not the default).
VLM dependence: It assumes a reasonably capable VLM to read the rendered views and a code-capable LLM to write CSS. Very weak models could degrade performance.

Required resources:

A VLM for visual labeling from multi-view renders, an LLM for code generation, and an SVG renderer to produce the five views at about 512×512.
Modest compute: The method is efficient since it uses lightweight VLM variants for labeling and a one-pass burn-in for agreement.

When NOT to use:

If you must create new geometry not present in the SVG (like fracture lines) or perform heavy path morphing unavailable in your setup.
If target output must be a raster video with cinematic effects outside CSS’s comfort zone.
If the SVG is intentionally abstract or lacks recognizable parts and labels would be arbitrary.

Open questions:

Automatic splitting: Can we learn to subdivide oversized paths into meaningful subparts when instructions demand it?
Domain transfer: How well does this semantic recovery idea generalize to 3D scenes, GUIs, or diagram editors?
Active view selection: Can the system adaptively choose the most informative views per primitive to cut down queries?
Joint planning–labeling loops: Could re-planning after preliminary labels further improve reliability and motion design?

06Conclusion & Future Work

Three-sentence summary: Vector Prism makes SVG animations reliable by first recovering the missing semantic structure of the artwork. It asks a VLM about each tiny shape from several views, estimates which views are trustworthy, and uses a Bayes-weighted decision to label parts, then restructures the SVG accordingly. With clean part groups, an LLM writes CSS that moves exactly the right pieces, producing coherent, instruction-faithful animations that beat prior vector and even video approaches.

Main achievement: Revealing and fixing the semantic–syntactic gap in SVGs—using multi-view weak labels plus statistical reliability estimation—so that animation can target meaningful parts, not just low-level shapes.

Future directions: Add automatic path subdivision for finer-grained motion, extend beyond CSS to richer animation stacks, and apply the same semantic recovery principle to other symbolic domains like 3D assets. Explore adaptive view strategies to reduce queries and tighter planner–labeler feedback loops.

Why remember this: Before Vector Prism, many systems tried to animate without truly knowing what the parts were; after it, we can recover that meaning first and make motion that feels correct, compact, and web-ready. It’s a simple idea—trust the reliable views more—but it unlocks a big improvement in how AI works with vector graphics.

Practical Applications

•Animate brand logos smoothly on landing pages without heavy video files.
•Create engaging UI micro-interactions (button bounce, toggle flips, loader spins) that are semantically precise.
•Build educational diagrams (heart pumping, planets orbiting) where each labeled part moves correctly.
•Produce interactive infographics where elements reveal, fade, or rotate by semantic groups.
•Design lightweight onboarding animations that scale perfectly on any screen size.
•Automate motion for icon libraries by labeling parts (e.g., arrows, rings, stars) and reusing animation recipes.
•Enhance accessibility by ensuring key motions target the intended parts (e.g., focus rings pulsing).
•Speed up prototyping: designers can describe desired motion, and the system generates coherent CSS.
•Generate dynamic SVG stickers or badges that stay crisp when resized in apps.
•Localize motion behaviors (e.g., right-to-left UI mirroring) by adjusting semantic groups without redrawing.

Version: 1