Enhancing Sentiment Classification and Irony Detection in Large Language Models through Advanced Prompt Engineering Techniques

Marvin Schmitt; Anne Schwerk; Sebastian Lempert

Enhancing Sentiment Classification and Irony Detection in Large Language Models through Advanced Prompt Engineering Techniques

Beginner

Marvin Schmitt, Anne Schwerk, Sebastian Lempert1/13/2026

arXiv PDF

Key Summary

•Giving large language models a few good examples and step-by-step instructions can make them much better at spotting feelings in text.
•The paper compares two popular models (GPT-4o-mini and gemini-1.5-flash) on tasks like simple sentiment, aspect-based sentiment, and irony detection.
•Few-shot prompts (showing a couple of labeled examples) were the most reliable booster overall, especially for GPT-4o-mini.
•Chain-of-thought prompts (think step by step) made gemini-1.5-flash dramatically better at detecting irony, raising its F1-score by up to 46%.
•Neutral is hard: both models tended to avoid the neutral label until the prompts included neutral examples, which fixed a big chunk of errors.
•Zero-shot ‘just explain your reasoning’ (zero-shot-CoT) often hurt performance, showing that unguided reasoning can wander off-track.
•Self-consistency (ask the model several times and vote) increased cost and sometimes locked in confident but wrong answers in GPT-4o-mini.
•Different models like different prompting styles, so the best prompt depends on both the model’s design and the task’s complexity.
•The study used careful testing and bootstrap confidence intervals to check that improvements were real and not just lucky.
•Bottom line: prompt engineering is not one-size-fits-all; it’s a smart toolbox you tailor to the model and the job.

Why This Research Matters

Better prompts let AI actually hear what people mean, not just what they say. That helps customer support react kindly to frustrated users, even when the frustration is wrapped in sarcasm. Brands can discover exactly which product features people love or dislike, and in which languages, without building a new model every time. Safer moderation becomes possible when irony and context are recognized instead of ignored. Teams also save money and time by picking the right prompting style for each model, preventing wasted API calls and low-accuracy runs. Altogether, this turns powerful LLMs into practical, trustworthy tools for real communication.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re coaching two smart students for a feelings-detective contest. One is great with quick examples, the other shines when allowed to explain their thinking out loud. If you give both the same instructions, you won’t get their best. But if you coach each one the way they learn best, both improve a lot.

🥬 The Concept (Prompt Engineering): It’s the art of writing the model’s instructions so it answers better. How it works: 1) You tell the model what role to take (like “be a sentiment expert”), 2) You show it what to do (maybe with examples), 3) You ask for the answer in a clear format. Why it matters: Without good prompts, even a powerful model can guess, get confused, or miss subtle clues like sarcasm.

🍞 Anchor: If you ask, “Good or bad review?” the model might wobble. If you say, “You are a movie-review judge. Here are 2 examples of good and bad. Now label this one as ‘positive’ or ‘negative,’” it steadies and improves.

🍞 Hook: You know how you can tell if a text message sounds happy or grumpy? Computers need help to do that too.

🥬 The Concept (Sentiment Classification): It’s deciding whether a text is positive, negative, or sometimes neutral. How it works: 1) Read the text, 2) Look for clues (words, context), 3) Pick a label like positive/negative/neutral. Why it matters: Without it, apps can’t sort happy customers from upset ones or spot trends.

🍞 Anchor: “I loved the food but hated the wait” might be mixed, while “Best day ever!” is clearly positive.

🍞 Hook: When you review a laptop, you might like the battery but dislike the keyboard. That’s two different feelings in one review.

🥬 The Concept (Aspect-Based Sentiment Analysis, ABSA): It finds the feeling about each part (aspect) of something, like battery life or service. How it works: 1) Identify the aspect (e.g., “battery”), 2) Read the sentence around it, 3) Label that aspect’s sentiment. Why it matters: Without ABSA, we miss the details—companies can’t fix what’s broken or double down on what people love.

🍞 Anchor: “Battery life is amazing, but the screen is dim.” ABSA says: battery=positive, screen=negative.

🍞 Hook: Have you ever said “Great job…” but meant the exact opposite? That’s irony.

🥬 The Concept (Irony Detection): It spots when words say one thing but mean another. How it works: 1) Notice tone or context clues, 2) Check if the literal meaning clashes with the situation, 3) Decide if it’s ironic or not. Why it matters: Without it, systems get fooled and think insults are compliments or miss a cry for help hidden in jokes.

🍞 Anchor: Tweet: “Oh perfect, my phone died right before the test.” Words are ‘perfect,’ meaning is ‘this is bad’—that’s irony.

The world before: Classic tools (like counting word pieces, called n-grams) did okay at simple cases, but struggled with tricky ones (neutral vs. slightly positive, or irony). New large language models (LLMs) are much smarter, but they still need the right instructions.

The problem: We didn’t know which prompting tricks help which model on which task. People tried using LLMs “as-is,” and results jumped around depending on the wording of the prompt.

Failed attempts: Zero-shot “just do it” prompts often missed subtlety, and even “think step by step” without examples could drift into overthinking. Some found small fine-tuned models beat big models with weak prompts.

The gap: We lacked a clear, apples-to-apples test of advanced prompt styles (few-shot, chain-of-thought, self-consistency) across different sentiment jobs and across different LLMs.

Real stakes: Better prompts mean kinder chatbots that spot sarcasm, smarter tools that pull out exactly what customers liked or disliked, safer systems that catch harmful content hidden behind jokes, and better support across languages (like English and German).

02Core Idea

🍞 Hook: Think of prompts like recipe cards. If the card shows a couple of finished cookies (examples) or explains each baking step (reasoning), your cookies turn out tastier. But not every baker needs the same card.

🥬 The Concept: The key insight is that the best prompt depends on both the model and the task—few examples make GPT-4o-mini shine, while step-by-step thinking supercharges gemini-1.5-flash for irony. How it works: 1) Pick a task (simple sentiment, ABSA, irony), 2) Choose a prompt style (few-shot, chain-of-thought, self-consistency), 3) Test and measure (accuracy, F1), 4) Keep the style that fits the model and task best. Why it matters: Without tailoring, you may pick a prompt that actually makes things worse, even if it sounds fancy.

🍞 Anchor: Showing GPT-4o-mini a few labeled tweets boosts its scores; asking gemini-1.5-flash to ‘think aloud’ catches irony much better.

Multiple analogies for the same idea:

Coaching analogy: Different students learn best by examples (few-shot) or by explaining their steps (CoT). Good teachers match the method to the student and the subject.
GPS analogy: On highways (simple tasks), a straight route is fine. In twisty side streets (irony), turn-by-turn directions (CoT) keep you on track.
Eyeglasses analogy: Few-shot is like putting on lenses matched to the lighting; CoT is like switching on a reading lamp. Both help, but one may help more for your eyes and your book.

Before vs After:

Before: Prompts were often one-size-fits-all and results were uneven—neutral was underused, irony was often missed.
After: Few-shot reliably lifts performance (especially for GPT-4o-mini), and CoT unlocks big irony gains for gemini-1.5-flash. Neutral improves when you include neutral examples.

Why it works (intuition):

Few-shot anchors the model: tiny, well-chosen examples act like a mini-lesson right before the test, shaping what the model pays attention to.
CoT structures attention: step-by-step reasoning nudges the model to weigh context and contradictions—key for irony.
Self-consistency reduces randomness by voting, but if the base reasoning path is wrong, voting can confidently lock in the wrong answer.

Building blocks:

Clear role and output instructions (so answers are well-formed).
Class-balanced examples (especially include neutral to fight polarity bias).
Reasoning templates for tasks needing pragmatics (irony) or aspect focus (ABSA).
Stability checks (bootstrap confidence intervals) to be sure gains are real, not luck.

03Methodology

🍞 Hook: Picture an assembly line that turns raw sentences into useful labels, like a factory that sorts fruit by color and ripeness.

🥬 The Concept (High-Level Pipeline): Input text → Choose prompt style → Ask LLM → Get label (and sometimes reasoning) → Score against the correct answer. How it works: 1) Pick the dataset (movie reviews, tweets, ABSA, irony), 2) Craft the prompt (baseline, few-shot, CoT, self-consistency), 3) Set decoding to be steady (temperature=0.2), 4) Run 1,000 sampled items per dataset through each model, 5) Compute accuracy, precision, recall, F1; use bootstrap to check significance. Why it matters: Without a careful, repeatable recipe, we can’t tell if a change helped or just looked lucky.

🍞 Anchor: Example item: “Battery lasts all day, but the screen is dim.” ABSA prompt: “Aspect: screen. Label: negative.”

Recipe steps in detail:

Inputs and tasks

Datasets: SST-2 (positive/negative movie reviews), SB10k (German tweets: positive/neutral/negative), SemEval-2014 ABSA (aspect labels: positive/neutral/negative), SemEval-2018 irony (ironic vs not).
Why this set: covers simple polarity, multilingual 3-class, aspect specifics, and tricky irony.

Baseline prompt

What happens: Give a simple instruction like “Classify as positive or negative.”
Why it exists: A fair starting line to beat.
Example: “Text: ‘What a waste of time.’ → negative.”

Few-shot prompting

What happens: Show 2 examples per class (more for neutral when needed), then ask for the label of the new text.
Why it exists: Tiny lessons right before the quiz anchor the model’s expectations.
Example: Provide two neutral tweet examples; then classify “Meh. It’s okay, I guess.” as neutral.

Chain-of-thought (CoT)

What happens: Ask the model to explain step-by-step before the final label.
Why it exists: Complex cues (like irony) need reasoning about context and contradictions.
Example: “Oh perfect, it’s raining on game day” → steps: word ‘perfect’ vs bad situation → label: ironic.

Zero-shot CoT

What happens: Same as CoT but with no examples, only ‘think step-by-step’ instructions.
Why it exists: Tries to get reasoning without curating examples.
Example: “Analyze step-by-step before answering” for a new review.

Self-consistency

What happens: Run the CoT prompt multiple times and take a majority vote (n=3, here with low temperature).
Why it exists: Reduce random flukes by aggregating answers.
Example: Three runs say: ironic, ironic, not ironic → final: ironic.

Output formatting and scoring

Output: a clean label; sometimes an explanation (for CoT).
Metrics: accuracy, precision, recall, F1 (weighted and macro), plus class-wise results.
Significance: Bootstrap 1,000 resamples to form a 95% confidence interval of F1 differences. If it doesn’t cross zero, the improvement is real.

The Secret Sauce:

Class-balanced, task-specific examples (especially adding neutral examples) were the quiet hero. They corrected the model’s natural tilt toward polarized labels.
Matching method to model: few-shot unlocked steady gains in GPT-4o-mini; CoT unlocked big irony jumps in gemini-1.5-flash.
Keeping temperature low (0.2) stabilized outputs; for self-consistency, higher temperature could explore more reasoning paths, but costs rise.

Concrete walkthroughs:

SB10k (German) neutral fix: One-shot with a neutral example raised gemini’s neutral recall from 0.37 to 0.51. Few-shot with extra neutral examples spread improvements across all three classes for GPT-4o-mini, lifting weighted F1 to about 0.72 (+14% over baseline).
Irony (SemEval-2018): gemini-1.5-flash with CoT improved weighted F1 to ~0.60, a +46% jump over its baseline; crucially, recall for “no irony” rose from 0.06 (baseline) to 0.38 (CoT), reducing the model’s habit of over-calling irony.

04Experiments & Results

🍞 Hook: Think of a classroom scoreboard where each prompt style is a strategy card. We try each card and keep the ones that lift the grades.

🥬 The Concept: The study measured accuracy, precision, recall, and F1 to see which prompting styles beat a simple baseline across four datasets. How it works: 1) Test each style (baseline, one-shot, few-shot, CoT, zero-shot-CoT, self-consistency) on GPT-4o-mini and gemini-1.5-flash, 2) Record scores per class and overall, 3) Use bootstrap intervals to check if gains are truly better, not noise. Why it matters: Without fair tests and context, numbers can mislead.

🍞 Anchor: Saying “F1 = 0.95” is like saying “an A+ when most are at B–.” It tells you how far ahead it really is.

The scoreboard with context:

SST-2 (binary, English movie reviews)
- GPT-4o-mini: Few-shot reached weighted F1 ≈ 0.93 (+2%). That’s like nudging an A to a solid A.
- gemini-1.5-flash: CoT hit ≈ 0.95 (+12%), a clear A+ leap over baseline.
SB10k (German, three-class)
- GPT-4o-mini: Few-shot ≈ 0.72 F1 (+14%), a big jump—neutral improved and positives/negatives also got cleaner.
- gemini-1.5-flash: Few-shot ≈ 0.61 (+15%), a meaningful gain, though smaller than GPT’s.
ABSA (SemEval-2014)
- GPT-4o-mini: Few-shot ≈ 0.85 F1 (+2.4%). Helpful, but not dramatic.
- gemini-1.5-flash: CoT/self-consistency hovered ≈ 0.83 (+2.5%). Many improvements weren’t statistically strong.
Irony (SemEval-2018)
- GPT-4o-mini: Few-shot modestly improved to ≈ 0.76 (+4%); CoT/self-consistency often hurt.
- gemini-1.5-flash: CoT ≈ 0.60 (+46%), turning a struggling baseline into a much more reliable detector.

Surprising findings:

Zero-shot-CoT sometimes underperformed the plain baseline, showing that step-by-step without examples can overthink the wrong path.
Self-consistency occasionally made GPT-4o-mini confidently wrong (the majority voted for the same mistake), proving that more votes don’t help if the reasoning template is off.
A single neutral example in one-shot already nudged models to use “neutral” more appropriately.

Statistical checks:

Bootstrap confidence intervals confirmed many gains (like SB10k few-shot and gemini irony CoT) were significant—improvements unlikely due to luck.

05Discussion & Limitations

🍞 Hook: Even the best recipes have limits—if the oven runs too hot or the ingredients are unusual, you might still burn the cookies.

🥬 The Concept: The method works well, but it’s not magic. There are limits, resource needs, and times when you shouldn’t use a given trick. How it works: 1) Name the limits, 2) Name the needed resources, 3) Say when not to use which method, 4) List open questions we still need to answer. Why it matters: Knowing the edges prevents mistakes and saves time and money.

🍞 Anchor: If your model keeps misreading sarcasm, don’t just add more voting—teach it better examples or change the reasoning style.

Limitations:

Model-specific: Results are for GPT-4o-mini and gemini-1.5-flash; other LLMs may respond differently.
Data sampled to 1,000 items per task for cost; rare patterns may be under-tested.
Manually designed prompts; no automated prompt search or ablations to see which pieces mattered most.
Fixed decoding settings (temperature=0.2); didn’t explore temperature–method interactions.
Limited linguistic error analysis, especially for irony’s subtle cues.

Required resources:

Access to LLM APIs, prompt engineering time, and budget for multiple runs (self-consistency multiplies cost/time).

When NOT to use certain methods:

Don’t use zero-shot-CoT for nuanced tasks if initial tests drop performance; it can encourage confident overthinking.
Avoid self-consistency if base reasoning is shaky—it can lock in wrong answers expensively.
Skip heavy CoT on simple, high-throughput tasks where few-shot already nails it; you’ll pay more for little gain.

Open questions:

Which prompt pieces (role, examples, wording) drive most of the gains?
How do different LLM architectures respond to the same prompt blueprint?
Can retrieval-augmented prompts reduce irony and ABSA mistakes by grounding context?
What temperature and sampling settings best pair with self-consistency for balanced exploration vs. stability?

06Conclusion & Future Work

🍞 Hook: Like matching the right shoes to the right sport, the right prompt to the right model and task makes a huge difference.

🥬 The Concept: This study shows that prompt engineering is a toolbox, not a single tool: few-shot is the reliable wrench for GPT-4o-mini, while chain-of-thought is the precision screwdriver for gemini-1.5-flash’s irony task. How it works: 1) Test styles on each task, 2) Keep class-balanced examples (neutral matters!), 3) Use step-by-step only when it truly helps, 4) Verify with statistics. Why it matters: One-size-fits-all prompts can waste money and lower accuracy; tailored prompts deliver dependable gains.

🍞 Anchor: Add neutral examples to fix neutral bias; turn on CoT for gemini irony detection; use few-shot as the everyday default for GPT-4o-mini.

3-sentence summary:

Advanced prompting methods noticeably improve sentiment tasks without fine-tuning, but the best method depends on the model and the task.
Few-shot prompting was the most consistently helpful, while chain-of-thought especially lifted gemini-1.5-flash on irony.
Zero-shot-CoT and self-consistency can backfire if misapplied, so validate each method with careful tests.

Main achievement: A clear, side-by-side map showing which prompting techniques help which model on which kind of sentiment problem—and by how much.

Future directions: Automate prompt search and ablation, try retrieval-augmented prompting for ABSA and irony, explore temperature–self-consistency tradeoffs, and extend tests across more LLM architectures and languages.

Why remember this: Prompting isn’t magic words; it’s smart coaching. When you match the coaching style to the learner (the model) and the game (the task), performance jumps—and trust in AI grows.

Practical Applications

•Customer support triage: Use few-shot prompts to reliably separate positive, neutral, and negative tickets.
•Social media monitoring: Add neutral examples to curb polarity bias and better capture brand sentiment.
•Product reviews mining: Use ABSA prompts to extract aspect-level likes and dislikes (e.g., battery vs. screen).
•Sarcasm-aware chatbots: Turn on CoT for models like gemini-1.5-flash to better detect ironic complaints.
•Crisis detection: Flag ironic or backhanded posts that might hide urgent issues behind jokes.
•Multilingual feedback: Prompt in the same language as the data (e.g., German for SB10k) to boost accuracy.
•Quality dashboards: Track per-class F1 and confusion matrices to pinpoint where prompts need tuning.
•Cost control: Prefer few-shot for routine sentiment tasks; reserve self-consistency for critical edge cases.
•A/B test prompts: Run bootstrap intervals to confirm real gains before deploying widely.
•Prompt playbooks: Build role + examples + format templates for repeatable, high-accuracy labeling.

Version: 1