SlideTailor: Personalized Presentation Slide Generation for Scientific Papers

Wenzheng Zeng; Mingyu Ouyang; Langyuan Cui; Hwee Tou Ng

SlideTailor: Personalized Presentation Slide Generation for Scientific Papers

Intermediate

Wenzheng Zeng, Mingyu Ouyang, Langyuan Cui et al.12/23/2025

arXiv PDF

Key Summary

•SlideTailor is an AI system that turns a scientific paper into personalized presentation slides that match what a specific user likes.
•Instead of asking the user to write long instructions, it learns their taste from two easy things: one example paper-with-slides pair and a visual PowerPoint template.
•It first figures out hidden preferences (what to include and how it should look), then plans the talk, and finally edits a template to produce fully editable .pptx slides.
•A new chain-of-speech step writes a mini speech draft alongside each slide, so the visuals and narration fit together naturally.
•The system follows a human-like workflow: learn style → organize content → pick layouts → render slides.
•A new PSP benchmark with diverse preferences and clear metrics tests how well systems match user taste and overall quality.
•Across tests, SlideTailor beats prior systems like ChatGPT, AutoPresent, and PPTAgent on both alignment to preferences and presentation quality.
•Ablations show that preference distillation and chain-of-speech each provide big quality gains, especially for clarity and flow.
•The output slides are editable, and the speech script enables auto-generated video talks with cloned voices.
•Limitations include focus on scientific papers, reliance on large models, and the need for more human-aligned evaluation.

Why This Research Matters

SlideTailor turns slide-making from a chore into a tailored assistant that speaks in your voice. By learning from a single example deck and a template, it makes personalization simple and natural, with no long instruction lists. Teachers, researchers, and professionals can quickly get clear, coherent slides that match their storytelling habits and branding. The built-in speech drafts help people present confidently and even enable automatic video talks. Because the output is fully editable, users can fine-tune the final touches in minutes. This approach sets a new bar for usability and quality in AI-powered content creation.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how everyone decorates their bedroom differently—same furniture pieces, totally different vibes? Presentations are like that. Two people can present the same paper, but the story, the highlights, and the look-and-feel can be completely different.

The Concept: Before this paper, most slide generators treated the job like copying: take a paper, shrink it, and pour it into slides. That helped a little but ignored personal taste—how much detail someone likes, which parts they emphasize, and what styles they find clear or beautiful. Without capturing the presenter’s style, the results often felt generic.

How it worked before (and why that broke):

One-size-fits-all summarizers: They squish the paper’s text into shorter text. They rarely think about layouts, color, or which figures to bring over.
Layout-only helpers: Some newer tools picked nice templates, but still didn’t capture how you, specifically, like to tell your story—like whether you love starting with a big-picture challenge or jumping straight to the results.
Super-limited personalization: A few tried simple toggles like “expert vs. non-expert” or “short vs. long,” but real humans have way richer, messier preferences.

Why that matters in daily life: Presentations are about persuasion and clarity. If slides don’t match your voice—maybe too wordy, too sparse, wrong focus—they can confuse your audience, waste time, and lower impact. Teachers, students, scientists, and professionals all need slides that fit their style to communicate well.

🍞 Hook: Imagine lending your friend your favorite hoodie, and they return it after washing it on super-hot. It still ‘fits,’ but it’s not you anymore. 🥬 The Concept (Preference-Guided Generation): SlideTailor is built on the idea that slide-making should follow the presenter’s personal taste. It aims to generate slides that fit a user’s style for content and visuals.

How it works: (1) Look at an example paper-with-slides pair and a PowerPoint template; (2) learn what the user likes to include/omit and how they like things to look; (3) plan the talk; (4) render slides that match both taste and message.
Why it matters: Without guiding by preference, the AI makes bland, off-target slides—like getting a haircut that ignores the style you wanted. 🍞 Anchor: If you always start talks with ‘Why this problem matters,’ SlideTailor learns that rhythm from your example and does it again for new papers.

But there’s a twist: asking users to write detailed instructions (“Please emphasize ablations, keep intros short, prefer bullets over paragraphs…”) is tiring and unnatural. People don’t author that way. They usually show an example deck they like and a template they trust. So SlideTailor embraces exactly that.

🍞 Hook: You know how your friend can guess your pizza order just by remembering what you picked last time? 🥬 The Concept (Implicit Preference Distillation): The system learns your style by studying a past paper-with-slides pair and your chosen template—no labels, no forms.

How it works: (1) From your old paper and slides, it learns your content style—narrative flow, level of detail, what to skip; (2) from the template, it reads aesthetics—layouts, colors, fonts, and typical placements.
Why it matters: Without learning from examples, the AI can’t capture subtle habits like “use short phrases” or “show results earlier.” 🍞 Anchor: If your past deck used short bullets with supporting images, SlideTailor will do the same for the new paper.

Finally, SlideTailor mirrors human behavior: first internalize preferences, then plan content and narration together, then pick layouts, and finally produce editable slides. It even drafts what you might say out loud, so speech and slides match.

🍞 Hook: Think about planning a school talk—you outline your story and also practice what you’ll say to each slide. 🥬 The Concept (Chain-of-Speech Mechanism): While planning slides, the system writes a mini speech draft for each one.

How it works: For each slide in the outline, it drafts narration that explains the point, ensuring the text-on-slide and spoken words support each other.
Why it matters: Without this, slides and speech can drift apart—pretty slides but awkward explanations, or vice versa. 🍞 Anchor: When the slide shows “Key Idea + Diagram,” the speech draft explains the idea in friendly words that match the diagram on-screen.

That is the world this paper builds: a practical, preference-aware way to generate slides that sound like you and look like you—without you doing all the heavy lifting.

02Core Idea

The key insight in one sentence: You can personalize slide generation by learning a user’s hidden content and style preferences from one example paper-with-slides pair and a chosen template, then plan slides and speech together before editing a template to create polished, editable slides.

Three analogies for the same idea:

Tailored suit: A tailor measures one outfit you love and a fabric style you chose, then crafts a new suit that fits you, not just a mannequin.
Favorite playlist: By peeking at one playlist you adored and your preferred album covers (template), a DJ makes a new playlist for a different party with the same vibe.
Recipe remix: A chef watches how you season one dish and your plating style, then cooks a new recipe in your flavor and plating aesthetics.

🍞 Hook: You know how drawing a comic works better when you script the dialogue while sketching panels? 🥬 The Concept (Human Behavior-Inspired Framework): SlideTailor follows a human-like workflow: learn your style → reorganize the paper into a talk → pick layouts → render slides.

How it works: (1) Distill preferences from your example and template; (2) make a presentation-friendly summary and outline; (3) choose per-slide layouts from your template; (4) write into an editable .pptx.
Why it matters: Skipping these steps makes slides that look okay but don’t tell your story your way. 🍞 Anchor: Like a teacher prepping a lesson: first understanding the class’s needs, then structuring the lecture, choosing board layouts, and finally writing the content neatly.

Before vs. After:

Before: Systems summarized papers and sometimes chose layouts, but ignored your personal flow, emphasis, and style.
After: SlideTailor learns your narrative style and aesthetic rules from examples, plans with a narration in mind, and produces slides that feel like you made them.

Why it works (intuition, not equations):

Examples speak louder than instructions: Your past slides reveal your real habits better than a checklist.
Separate but combine: Content preferences (what and how to say) are learned from the sample pair; aesthetic preferences (how it looks) are learned from the template. Treating them as two branches keeps learning clearer.
Plan first, format second: Aligning slides with a speech draft ensures coherence when rendering the final deck.

Building blocks (the moving parts):

Preference distillation: learns a ‘map’ from paper sections to your talk structure and a schema of your favorite layouts.
Paper reorganizer: turns a dense paper into a presenter-friendly storyline following your flow (e.g., Motivation → Method → Results).
Slide outline designer with chain-of-speech: plans each slide’s key message and drafts what you will say.
Template selector: matches each slide’s needs (text, image, table) to one of the template’s layouts.
Slide renderer: edits the chosen template slides in place to produce a polished, fully editable .pptx.

🍞 Hook: Imagine your friend guessing your comic-book style from one of your old comics and your favorite page template. 🥬 The Concept (Implicit Preference Distillation): Infer content choices (what to keep or skip) from an old paper+slides and infer visual rules (layout, fonts, colors) from a template.

How it works: Language and vision models examine examples to build a structured ‘preference profile’ covering narrative flow, section emphasis, and layout schemas.
Why it matters: Without this, the system can’t generalize your style to new papers. 🍞 Anchor: If your sample deck always summarizes background quickly and spends more time on results, the new deck does too.

🍞 Hook: You know how good storytellers plan both what appears on a slide and how they’ll talk through it? 🥬 The Concept (Chain-of-Speech Mechanism): Drafts narration along with each slide plan.

How it works: For each outline item, create a brief speech that explains the visuals.
Why it matters: Keeps slides and speech in lockstep, improving clarity. 🍞 Anchor: A results slide gets a short narration that calls out exactly what the chart shows and why it matters.

In short, SlideTailor’s aha! moment is to learn your taste from what you naturally provide (an example and a template), then to plan content and speech together before committing to visuals.

03Methodology

At a high level: Input (paper + example pair + template) → Distill preferences → Plan content and speech → Select template layouts → Edit template → Output editable slides.

Stage 1: Implicit Preference Distillation 🍞 Hook: Think of learning to cook your parent’s soup by tasting it and seeing the bowls they use, not by reading a recipe card. 🥬 The Concept (Preference Distillation): Convert unlabeled examples into a structured profile of content and aesthetic preferences.

What happens: Two branches run in parallel.
1. Content branch: From the sample paper and its slides, a language model infers your narrative flow (e.g., Title → Motivation → Method → Results), what gets expanded vs. condensed, and stylistic choices like bullets vs. paragraphs.
2. Aesthetic branch: From the .pptx template, a vision-language model plus file parsing identifies slide types (title, content-with-image, table slide) and precise positions, colors, and fonts.
Why this step exists: Without the profile, later steps can’t stay true to your style.
Example: If your sample deck skips ‘Related Work’ and your template favors a two-column layout, the profile encodes those patterns. 🍞 Anchor: Next time, when creating a new deck, the system knows to keep ‘Related Work’ minimal and to pick two-column layouts for comparison slides.

Stage 2: Preference-Guided Slide Planning This stage includes three agents: paper reorganizer, slide outline designer (with chain-of-speech), and template selector.

2.1 Paper Reorganizer 🍞 Hook: Imagine rearranging a long textbook chapter into a punchy class lecture. 🥬 The Concept: Turn the dense paper into a presenter-friendly storyline that matches your profile.

What happens: The agent orders sections to match your flow and adjusts detail levels—maybe short background, clear problem framing, focused methods, and highlighted results.
Why it matters: Raw papers are not talks; without reorganization, slides become cluttered and confusing.
Example: A methods-heavy paper becomes a talk that opens with motivation, gives a high-level method overview, then shows key results. 🍞 Anchor: Your reorganized content forms the backbone for slide-by-slide planning.

2.2 Slide Outline Designer with Chain-of-Speech 🍞 Hook: When you outline an essay, you also jot what you want to say under each bullet. 🥬 The Concept: For each slide, draft the key message, supporting points, visuals to include (like a figure from the paper), and a short speech draft.

What happens: The agent segments the reorganized content into slide-sized chunks and writes a mini narration that explains the point clearly.
Why it matters: Without narration planning, slides and speech drift apart; with it, they reinforce each other.
Example with data: Suppose Slide 6 is ‘Experimental Results.’ The outline lists the main finding, picks two figures from the paper, and drafts a 4–6 sentence speech calling out trends and takeaways. 🍞 Anchor: These per-slide plans directly drive layout selection and final rendering.

2.3 Template-Aware Layout Selection 🍞 Hook: Picking the right container makes packing a suitcase neat and fast. 🥬 The Concept: Match each planned slide to the best-fitting template layout.

What happens: The agent compares the outline’s needs (text-only, text-plus-image, comparison, table) to the library of template layouts, choosing the closest match.
Why it matters: The right layout avoids crowded text, awkward images, or empty space.
Example: A ‘method overview + diagram’ slide gets a layout with a big image spot and bullet area; a ‘results table’ slide uses a table-first layout. 🍞 Anchor: The chosen layouts keep the deck visually consistent with your template.

Stage 3: Slide Realization 🍞 Hook: After sketching your comic panels, you ink and color them carefully inside the lines. 🥬 The Concept: Edit the selected template slides with the planned content.

What happens: A layout-aware agent maps titles, bullets, and visuals to exact placeholders; a code agent writes the changes into a .pptx so everything remains editable.
Why it matters: Direct editing preserves the professional look of the template and lets you tweak anything afterward.
Example: The agent resizes a chart to fit its placeholder, injects the slide title, and formats bullets as per style (short phrases, not long paragraphs). 🍞 Anchor: You open the final .pptx and can adjust a word or move an image—nothing is a frozen screenshot.

Secret Sauce (what’s especially clever):

Learning from natural artifacts (one example deck and a template) instead of long, fussy instructions.
Separating content taste from visual taste, then harmonizing them during planning.
Writing a speech draft alongside each slide (chain-of-speech) to boost clarity and to enable instant video narration.

End-to-End Flow Recap: Input → (Learn preferences) → (Reorganize and outline with speech) → (Pick layouts) → (Render editable slides).

04Experiments & Results

🍞 Hook: You know how judging a baking contest needs clear rules like ‘taste,’ ‘texture,’ and ‘presentation’? 🥬 The Concept (PSP Benchmark): The authors built a new benchmark to fairly test personalized slide generation.

What it is: PSP (Paper-to-Slides with Preferences) includes 200 target papers, 50 sample paper-with-slides pairs (for content taste), and 10 templates (for visual taste), creating up to 100,000 input combinations.
How it works: Systems must read a new paper, a style example pair, and a template, then produce slides; evaluation checks both preference-matching and general quality.
Why it matters: Without a diverse benchmark, it’s hard to know if a system really adapts to different users. 🍞 Anchor: Like testing many recipes with many tasters and plates to make sure the chef adapts well.

The tests and why they matter:

Preference-based metrics (do you match the user’s style?):
1. Coverage: Do you include the same big topics as the sample deck?
2. Flow: Is the order of topics similar to the sample deck’s storyline?
3. Content Structure: Does your pacing/detail/formatting feel like the sample’s style?
4. Aesthetic: Do you visually follow the given template (layouts, colors, fonts)?
Preference-independent metrics (are your slides good, period?):
- Content quality: Clear, accurate communication of the paper’s key ideas.
- Aesthetic quality: Overall visual appeal and professional design.

🍞 Hook: Imagine checking a book report: Did you cover the same chapters? In the same order? Is your writing style similar? Does the layout look like the teacher asked? 🥬 The Concept (Coverage & Flow): Topic inclusion and ordering.

How it works: Compare the generated deck’s big sections with the sample’s sections and order.
Why it matters: Matching structure makes the talk feel like ‘your’ style. 🍞 Anchor: If the sample goes Title → Motivation → Method → Results, the new deck should feel similar.

🍞 Hook: Think of a band covering a song in the same style—same tempo, same groove. 🥬 The Concept (Content Structure Metric): Judge similarity of pacing, detail level, bullets vs. paragraphs, and transitions.

How it works: A language model scores structural similarity on a 1–5 scale.
Why it matters: Even with different topics, the style should feel like the same presenter. 🍞 Anchor: If you like short bullets and quick transitions, the metric rewards decks that do the same.

🍞 Hook: Matching the dress code at a party matters if you want to ‘fit’ the theme. 🥬 The Concept (Aesthetic Metric): Judge how well the deck adheres to the template’s look.

How it works: A vision-language model checks layout, background, colors, fonts, and recurring elements.
Why it matters: Great content can still look off if visuals ignore the template. 🍞 Anchor: If the template uses a top banner and specific fonts, the deck should too.

Competitors compared:

ChatGPT: Accepts multimodal inputs but struggles to reliably learn visual style and sometimes skips figures/tables.
AutoPresent: Strong at structuring text but text-only input makes aesthetic alignment weak and visuals less faithful.
PPTAgent: Good at using templates but weaker at matching content structure preferences from the sample pair.

Scoreboard with context:

Overall, SlideTailor (with GPT-4.1) reached about 75.8% average—think of that as a solid A when others get closer to B/C. It led on both preference-following and general quality.
Using open-source Qwen2.5 models also performed strongly, showing the approach is robust across backbones.

Surprising/interesting findings:

Chain-of-speech is a big deal: Removing it dropped general content quality sharply in ablations, proving narration planning boosts clarity.
Content preference distillation matters: Turning it off reduced structure matching (coverage/flow) by around 10%, showing the sample pair truly captures the presenter’s style.
Costs are modest per deck, especially with open-source models, making the approach practical.

Human evaluations:

Graduate student judges preferred SlideTailor over PPTAgent in over 80% of cases, aligning with automated metrics and suggesting the benefits are noticeable to real people.

Takeaway: SlideTailor consistently matches user taste and produces clearer, more coherent talks than strong baselines—and the benchmark and metrics make those gains measurable.

05Discussion & Limitations

Limitations (be specific):

Domain focus: The benchmark and demonstrations center on scientific papers; business decks, lessons, or marketing pitches may need different patterns and visuals.
Model reliance: The framework leans on large language/vision models; on smaller hardware or with limited APIs, performance and cost trade-offs appear.
Template dependence: If the provided template has too few useful layouts, some slides might be forced into imperfect shapes.
Evaluation gap: Automated judges (MLLMs) correlate with human ratings but still miss fine visual details; humans remain stricter, especially on aesthetics.

Required resources:

An example paper-with-slides pair and a .pptx template from the user.
Access to capable language and vision-language models (proprietary or strong open-source) and a simple .pptx editing toolchain.

When not to use:

If no example deck or template is available and the user cannot articulate preferences, a simpler general-purpose slide generator may suffice.
If the task requires highly custom visuals beyond what templates can support (e.g., bespoke infographics), manual design work might be faster.
If the content domain is extremely visual (e.g., art portfolios) or extremely code/math-heavy (dense derivations) without suitable templates, alignment may suffer.

Open questions:

Broader domains: How to extend preference profiles to business, education, and marketing where story arcs differ?
Learning preferences over time: Can the system fine-tune a lasting user profile from multiple decks, not just one pair?
Better evaluation: How to design human-aligned, fine-grained visual judges that notice typography, spacing, and subtle layout flaws?
End-to-end training: Would a fully trained multimodal model (instead of an agent pipeline) boost stability without losing adaptability?
Collaborative editing: How to combine AI-generated drafts with lightweight human feedback loops for rapid refinement?

In short, SlideTailor proves that example-driven personalization works well for slides, but scaling to new domains, improving aesthetic judging, and making preference learning continual are promising next steps.

06Conclusion & Future Work

Three-sentence summary: SlideTailor generates personalized slides from scientific papers by learning a user’s content and visual preferences from a sample paper-with-slides pair and a chosen template. It plans slides and narration together, then edits the template to produce clean, fully editable .pptx decks that match the user’s style. A new benchmark and metrics show SlideTailor outperforms strong baselines on both preference alignment and overall quality.

Main achievement: Turning natural, easy-to-provide artifacts (an example deck and a template) into a robust preference profile—and coupling that with chain-of-speech planning—to deliver talks that feel authentically ‘yours.’

Future directions: Broaden to non-academic domains, build longer-term user profiles across many decks, explore end-to-end multimodal training, and create better human-aligned evaluators for fine-grained aesthetics. Also, integrate quick human-in-the-loop tweaks (like ‘more visuals on slides 4–5’) to refine drafts instantly.

Why remember this: SlideTailor reframes slide generation as tailoring, not shrinking—learning from what you already have and speaking in your voice. That shift makes automated slides more useful in classrooms, research talks, and professional settings, where clarity and personality matter most.

Practical Applications

•Create personalized research talk slides that match a lab’s usual flow and template.
•Generate lecture slides aligned with a teacher’s preferred pacing and visual style.
•Draft company update decks that follow brand templates and an executive’s emphasis habits.
•Produce grant or proposal presentations with the applicant’s typical narrative arc.
•Make thesis defenses that reflect the student’s preferred structure and slide density.
•Turn preprints into conference-ready slides that highlight results the way the presenter likes.
•Build webinar decks with consistent layouts and narration that fit the host’s tone.
•Auto-generate narrated video presentations using the slide-aligned speech drafts.
•Standardize team-wide slide quality by learning a shared style from one exemplar deck.
•Rapidly localize a presentation’s style for different audiences (expert vs. general) by swapping the example pair and template.

Version: 1