HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

HyperAI Team; Yuchen Liu; Kaiyang Han; Zhiqiang Xia; Yuhang Dong; Chen Song; Kangyu Tang; Jiaming Xu; Xiushi Feng; WenXuan Yu; Li Peng; Mingyang Wang; Kai Wang; Changpeng Yang; Yang Li; Haoyu Lu; Hao Wang; Bingna Xu; Guangyao Liu; Long Huang; Kaibin Guo; Jinyang Wu; Dan Wu; Hongzhen Wang; Peng Zhou; Shuai Nie; Shande Wang; Runyu Shi; Ying Huang

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

Intermediate

HyperAI Team, Yuchen Liu, Kaiyang Han et al.12/16/2025

arXiv PDF

Key Summary

•HyperVL is a small but smart model that understands images and text, designed to run fast on phones and tablets.
•It avoids wasting power by shrinking images to just the right size with a tiny helper called the Visual Resolution Compressor.
•It uses two vision encoders (a small and a large one) that both talk to the same language brain, so it can pick speed or accuracy on the fly.
•Dual Consistency Learning trains both vision encoders to agree, so switching between them doesn’t confuse the language brain.
•An image-tiling trick chops big pictures into manageable pieces, keeping memory steady even for huge screenshots.
•On public tests, HyperVL matches or beats bigger models, especially on OCR and document tasks like reading receipts and charts.
•The compressor cuts visual tokens by about 20% while keeping almost all accuracy, saving time and battery.
•On a real mobile chip, HyperVL shows about 12.9× speed-up and around 6.8× less peak memory than a baseline.
•Even with 4-bit weights (low precision), its accuracy barely drops, which is great for phones.
•This makes private, fast, and reliable on-device AI assistants more practical for everyday use.

Why This Research Matters

HyperVL brings powerful vision-and-language skills directly onto your phone, so sensitive images like bills or medical notes never have to leave your device. It saves battery and feels snappy by using just the right image detail for each task and switching to a faster or stronger “eye” when needed. This means better assistants that can read receipts, understand app screens, and summarize documents—even offline. Companies save cloud costs while users gain privacy and faster responses. The approach is robust under low-bit precision, making it practical for real-world phones. In short, it bridges the gap between cloud-level smarts and pocket-sized efficiency.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how taking a super high-resolution photo looks amazing but makes your phone slow when you try to edit or share it? AI models that see and read (multimodal models) have the same struggle. Before this research, the strongest multimodal models—great at reading text in images, understanding charts, and doing reasoning—mostly ran in the cloud. They were huge and hungry for memory and compute, especially because their vision part (often a Vision Transformer) gets very slow as images get bigger. That made it hard to put them directly on phones, where memory and battery are limited. What was the world like before? Big, cloud-sized models like GPT-4o, Gemini, or large open-source models could solve tough visual problems, but needed strong servers. Smaller on-device models started to appear and were getting smarter, but the vision encoders still slowed everything down when images were large—like long receipts, complex app screenshots, or detailed diagrams. The Vision Transformer’s cost grows fast with image size, so higher resolution meant waiting longer and using more memory. The problem: On-device apps need to handle high-resolution, text-rich images (think: invoices, shipping labels, app UIs) quickly and privately. Standard tricks, like squishing images (aggressive downsampling) to reduce tokens, often throw away details you really need—tiny prices, dates, or labels—so accuracy drops. People tried a few things: compressing visual tokens with extra networks; trimming tokens; or building smaller encoders. These helped a bit but often lost key details or still struggled with worst-case latency spikes on high-res inputs. A consistent issue remained: the model didn’t adapt to the image. It used the same heavy vision path for both easy and hard pictures, and the same resolution even when a lower one would do fine. The missing piece (the gap): adaptivity. Instead of always using the maximum resolution and the biggest vision encoder, could the model quickly decide how much detail is needed and which vision path to use—without breaking the language brain’s understanding? Also, could it process giant images without exploding memory, especially on mobile NPUs? Why this matters to real life: When your phone runs the model, your photos, receipts, and app screens never leave your device—better privacy. It also saves cloud costs and works even with poor internet. That means faster, safer smart features: reading bills, filling forms, helping with homework from a photo, guiding you through an app’s settings, or ranking images for better search. But to be helpful, it must be fast and accurate—no one likes a slow, battery-draining assistant that misses the tiny text that matters. Enter HyperVL. It rethinks the vision side for phones and tablets. It makes three big moves: (1) an adaptive image shrinker that picks the just-right resolution per image, (2) two coordinated vision routes (small and large) that produce consistent outputs so the shared language brain stays calm, and (3) a memory-friendly tiling trick so even huge screenshots get processed in steady, small chunks. Put together, this means lower latency, lower memory, and high accuracy on the tasks people actually use on-device.

02Core Idea

🍞 Hook: Imagine reading a billboard. If the text is huge, you can stand far away; if it’s tiny, you step closer. You automatically pick the right viewing distance. AI should do that too. 🥬 The Concept (Aha!): HyperVL’s key insight is to be adaptive: it picks how much image detail to use and which of two vision paths to run—so it stays fast on easy images and precise on hard ones—while keeping the same language brain. How it works, big picture:

A tiny helper (Visual Resolution Compressor) looks at an image and predicts how much to shrink it (from 10% to 100%) so we don’t overpay for detail we don’t need.
Two vision encoders (small and large) are trained to agree with each other (Dual Consistency Learning), so we can swap them without confusing the language part.
A tiling strategy chops big images into tiles to cap memory and keep latency predictable. Why it matters: Without adaptivity, phones waste compute on simple images and choke on big ones. HyperVL uses just enough detail and just enough muscle. 🍞 Anchor: If you snap a restaurant bill, HyperVL may use a higher resolution and the bigger vision path to read tiny totals. If you show a clear sign that says “EXIT,” it shrinks the image and uses the lighter path for a quick answer.

Now let’s explain the building blocks in the best learning order, each with the sandwich pattern:

Multimodal Large Language Model (MLLM) 🍞 Hook: You know how you understand a comic better than just words or just pictures? Because you combine both. 🥬 The Concept: An MLLM is a model that understands and generates text using clues from images (and sometimes other media).

What it is: One brain that connects what it sees with what it reads/writes.
How it works: (1) A vision part turns the image into features; (2) a projector maps those features into the language space; (3) a language model reasons and replies.
Why it matters: Without a shared brain, the model can’t tie a picture of a chart to an answer about its trend. 🍞 Anchor: You show a chart and ask, “Did sales go up?” The model looks, matches the visual trend to language, and answers, “Yes, up 40%.”

Vision Transformer (ViT) 🍞 Hook: Imagine scanning a poster with a magnifying glass, checking different spots and how they relate. 🥬 The Concept: A ViT is an image model that looks at small patches and learns how parts of the image relate to each other.

What it is: A transformer for images; it processes patch tokens and models their relationships.
How it works: (1) Split an image into patches; (2) turn them into tokens; (3) use attention to connect important parts; (4) output a visual embedding.
Why it matters: Without ViT-quality features, the language brain won’t get reliable visual clues, especially for text and fine details. 🍞 Anchor: Reading a receipt image: the ViT helps spot where the total is and relate it to items above.

Visual Resolution Compressor (VRC) 🍞 Hook: You don’t need a microscope to read a street sign, but you might need one for tiny print. 🥬 The Concept: VRC is a tiny model that predicts how much to shrink each image before the main vision model sees it.

What it is: A lightweight scaler that picks a compression ratio (10%–100%).
How it works: (1) Glance at the image; (2) estimate info density; (3) pick the smallest safe resolution; (4) feed that to the encoder.
Why it matters: Without it, the model wastes time and memory on easy images or loses accuracy by always shrinking too much. 🍞 Anchor: A clean stop sign → shrink a lot, still read “STOP”; a dense invoice → keep it big to read tiny totals.

Image Tiling Strategy 🍞 Hook: Cutting a giant pizza into slices makes it easy to hold and eat. 🥬 The Concept: Tiling splits big images into fixed-size pieces so the model processes them without blowing up memory.

What it is: A way to cap peak memory by handling one tile at a time.
How it works: (1) Keep aspect ratio with light padding; (2) split into tiles; (3) encode tiles; (4) combine features for the LLM.
Why it matters: Without tiling, huge screenshots can exceed on-chip memory, causing big slowdowns and crashes. 🍞 Anchor: A 4K phone screenshot becomes neat tiles, each processed smoothly, so the app stays responsive.

Dual Consistency Learning (DCL) 🍞 Hook: Two students study the same topic; the teacher checks that their answers match so either can present for the team. 🥬 The Concept: DCL trains a small and a large vision encoder to produce semantically consistent outputs for the same image.

What it is: A training strategy that aligns the small branch to the large one (teacher–student) and alternates training so both map to the same language space.
How it works: (1) Alternate which branch you train; (2) use the large branch as teacher; (3) nudge the small branch to match the teacher’s predictions on text outputs.
Why it matters: Without DCL, swapping branches would confuse the language brain, hurting accuracy. 🍞 Anchor: Whether the small or large vision path sees your bus schedule photo, the LLM still answers, “Next bus at 3:42 PM.”

Dynamic Switching Mechanism 🍞 Hook: Like choosing a bike’s gear—low gear for hills (power), high gear for flats (speed). 🥬 The Concept: The system can switch between the small and large vision encoders based on the task and device budget.

What it is: A runtime choice: pick speed (small) or precision (large) without retraining the language part.
How it works: (1) Policy decides branch (user setting, device profile, or task type); (2) shared projector and LLM keep outputs consistent; (3) you get a balanced result.
Why it matters: Without switching, you pay too much compute on easy tasks or lose accuracy on hard ones. 🍞 Anchor: Quick icon ID on a phone uses the small path; a densely printed contract uses the large path for accuracy.

Before vs After: Before: One-size-fits-all vision path and resolution—slow on big images, wasteful on easy ones. After: Adaptive resolution and twin aligned encoders—fast when possible, precise when needed. Why it works: Most images don’t need maximum detail; saving tokens reduces the transformer’s cost dramatically. Tiling keeps memory flat. Alignment means one language brain works well with either visual path. Together, they turn a cloud-style model into a phone-ready helper.

03Methodology

At a high level: Image + Prompt → Visual Resolution Compressor decides scale → AnyRes keeps aspect ratio + padding → Image Tiling splits into tiles → Choose Small or Large Vision Encoder → Vision features → Projector maps to LLM tokens (with token-length reduction) → Shared LLM generates answer.

Step-by-step details and why each exists:

Input handling and scaling (VRC)

What happens: A tiny MobileNet-based compressor looks at a resized preview and outputs a compression ratio between 0.1 and 1.0. The original image is then scaled accordingly.
Why it exists: To avoid overpaying for resolution when the image is simple; to preserve detail when the image is dense (e.g., tiny text).
Example: A street sign photo might get 0.3× scale; a dense receipt might get 1.0×.
What breaks without it: Latency and memory spike on easy images; aggressive fixed shrinkage loses details on hard images.

Aspect-ratio preservation (AnyRes-like preprocessing)

What happens: The image is minimally scaled to fit a target side while keeping aspect ratio; if needed, padding fills the rest.
Why it exists: To avoid stretching that distorts text or shapes; keeps geometry consistent for the encoder.
Example: A tall phone screenshot remains tall (with padding), so UI elements stay proportional.
What breaks without it: Distorted characters or icons reduce OCR and recognition accuracy.

Image Tiling

What happens: The scaled image is split into non-overlapping fixed-size tiles. Tiles are encoded sequentially or in small batches.
Why it exists: To cap peak memory and keep activations within on-chip buffers, avoiding expensive memory swaps.
Example: A 2048×4096 screenshot becomes a grid of tiles (e.g., 512×512 each), processed steadily.
What breaks without it: Memory blow-ups and exponential latency on high-resolution inputs.

Dynamic branch selection (Small vs Large ViT)

What happens: At runtime, the system picks the small or large Vision Transformer branch based on device, user setting, or task type (e.g., lightweight for icon ID, heavy for contracts).
Why it exists: To balance speed and accuracy per use case while using one shared language model.
Example: On a mid-tier phone, default to small; on a flagship, use large for document tasks.
What breaks without it: Either wasted compute on easy tasks or poor accuracy on hard tasks.

Visual encoding (ViT)

What happens: Each tile is patchified into tokens; attention layers relate patches; outputs are pooled or kept as sequences.
Why it exists: To produce rich visual features that capture both local detail (tiny text) and global context (layout).
Example: For a receipt tile, it encodes digits, lines, and table structure.
What breaks without it: The language model would lack reliable visual evidence.

Vision-language projector + token reduction

What happens: A two-layer MLP maps vision features into the LLM’s embedding space. A token compaction (e.g., pixel shuffle or similar) reduces the number of visual tokens sent to the LLM (about 4× shorter sequences).
Why it exists: To align modalities and cut the cost inside the LLM, where long sequences are expensive.
Example: 4,000 visual tokens become 1,000 tokens before hitting the LLM.
What breaks without it: The LLM struggles with very long sequences, slowing dramatically and using more memory.

Shared LLM reasoning

What happens: The LLM (e.g., Qwen3 ~1.7B) receives the visual tokens plus the text prompt, and generates the answer.
Why it exists: To unify perception with reasoning and language output.
Example: Given a chart image and “How much did it increase from 2005 to 2013?”, the LLM outputs “40%.”
What breaks without it: No natural-language answers or multi-step reasoning.

Training the twin vision branches to agree (the secret sauce):

Alternating training: The small and large branches are trained in alternating steps so both learn a stable mapping into the same language space.
Semantic consistency (teacher–student): The large branch provides soft guidance; the small branch is nudged to produce similar text predictions. Importantly, this alignment focuses on text outputs (not image tokens), which keeps the signal stable and avoids overfitting to low-level features.
Why this is clever: It lets you swap branches at runtime without surprising the LLM, keeping accuracy steady while giving you control over speed.

Training the Visual Resolution Compressor (how it learns to pick scales):

Data construction: For each training image–question–answer triple, create many compressed versions (10%–100%). Compute the loss of a reference model on each version and find the smallest resolution that doesn’t hurt performance beyond a tiny tolerance. That becomes the target ratio.
Model: A lightweight MobileNet backbone + pooling + MLP predicts the ratio from a resized preview.
Loss: Mean squared error between predicted and target ratio.
Why this is clever: The compressor learns how much detail each image really needs for the downstream task, not just generic image quality.

Overall training pipeline (3 phases):

Phase 1 (Alignment): Freeze ViT and LLM; train the projector on caption-style data so visual features talk the LLM’s language.
Phase 2 (Knowledge enhancement): Unfreeze and pretrain on varied multimodal data (and some text-only) to grow knowledge and robustness; compute loss only on text tokens.
Phase 3 (Multi-task): Use high-quality reasoning and synthetic chain-of-thought data to strengthen step-by-step skills.

Concrete walk-through example:

Input: A 3000×3000 invoice photo. VRC predicts 0.6, so it becomes 1800×1800.
AnyRes pads minimally to keep aspect ratio. Tiling splits into 512×512 tiles.
Branch choice: Since it’s a document, the large vision branch is chosen.
ViT encodes each tile; projector maps to LLM space and reduces tokens 4×.
LLM reads: “What is the total due?” It points to and outputs “$342.17.”
Total effect: Much faster than full-res everywhere, but still accurate on tiny text.

The secret sauce: Adaptivity + agreement. HyperVL doesn’t waste detail where it’s not needed, and it can switch vision gears without throwing the language brain off balance. Tiling ensures the phone’s memory never panics, keeping latency smooth even for giant screenshots.

04Experiments & Results

The test: Can a small, phone-friendly model stay accurate on hard visual tasks while being much faster and lighter on memory? The team measured standard benchmarks (reasoning, OCR, documents, charts, hallucination checks), plus real internal tasks (intent suggestion, UI parsing, creative writing, and image re-ranking). They also measured speed and memory on an actual mobile platform and checked how well low-bit quantization works.

The competition: Models around 2–3 billion parameters, including Qwen and InternVL variants and efficient baselines. HyperVL has about 1.8B with the small vision branch and 2.0B with the large one.

The scoreboard (with context):

Overall OpenCompass average: HyperVL scores about 64.5, which is like getting a solid A- when many classmates (of similar size) get B to B+. The larger HyperVL variant reaches ~66.1.
Strong suits: OCR, charts, and documents. Scores like ~83.8 on ChartQA and ~91.3 on DocVQA show the model reads and reasons over structured visuals very well (think: invoices, tables, diagrams).
Reasoning and math: On MathVista, HyperVL reaches around mid-60s, competitive with similar-sized peers and not far behind bigger ones.
Hallucination checks: It stays steady and comparable to larger models on tests like HallBench and POPE, suggesting it remains careful and grounded.

Ablations (what changed what):

Dual Consistency Learning (DCL): Adding DCL to the base HyperVL substantially boosts scores on several benchmarks (for example, +5 points on AI2D diagrams, +2.6 on ChartQA, and a big jump on OCRBench). This shows the small branch really learns from the large branch, narrowing the gap.
Visual Resolution Compressor (VRC): Adds only about 2 ms overhead but cuts visual tokens by roughly 20% on average while keeping about 98.7% of accuracy. Some tasks with dense details (e.g., charts) get little compression; simpler documents get more, showing the compressor adapts based on content.

On-device measurements (real hardware):

Speed: About 12.9× faster than a baseline on high resolutions—think moving from a sluggish load to snappy, near-real-time.
Memory: Peak usage drops by around 6.8×, and, thanks to tiling, stays almost constant even as images get larger (no more scary spikes).
Why that matters: On phones, predictable latency and memory are key to user experience and battery life.

Quantization (low-bit check):

With 4-bit weights (W4A16), the model keeps almost all its accuracy; for example, DocVQA drops by only ~0.1 points. That means even tighter memory and bandwidth without meaningful quality loss—perfect for edge devices.

Surprising findings:

The compressor often chooses very different ratios per task: tiny reductions for detail-critical charts, large reductions for redundant documents—confirming that “one resolution doesn’t fit all.”
Despite being small, HyperVL excels at OCR-heavy and document tasks—exactly the kind that matter on-device (receipts, forms, app screens).

Internal, real-world tasks:

Intent recognition and recommendation (from screenshots): HyperVL tops or matches larger models—key for proactive helpers.
UI understanding and structured parsing: Competitive accuracy while staying fast.
Image–text creative writing: Leads the pack, showing strong alignment between vision and writing style.
Image re-ranking: Best or near-best precision, which means better search and recommendation results.

05Discussion & Limitations

Limitations:

Tiny text at extreme compression can still be lost. The compressor aims to avoid that, but very detail-dense images might need the large branch and near-full resolution.
The branch choice (small vs large) may not always be perfect without a good policy; wrong picks can trade off speed for accuracy (or vice versa) at the wrong time.
Training relies on a stronger branch guiding a weaker one; if the teacher has biases or blind spots, the student may inherit them.
Video or long sequences aren’t the main focus here; further work is needed for smooth multi-frame reasoning.

Required resources:

Two vision encoders (small and large) plus a shared language model.
A tiny compressor model (very light) and a projector.
For best results on-device: quantization support and an inference stack that benefits from tiling and fixed-size buffers (e.g., mobile NPUs).

When not to use:

Ultra-low-end devices with extremely tight compute/memory may still struggle, even with tiling.
High-frame-rate video understanding or AR that needs full-resolution streams without latency tolerance.
Tasks where the tiniest details always matter (e.g., microprint verification) might require always-on maximum resolution and the large branch.

Open questions:

Can branch selection be fully automated with a learned policy that considers time budgets, battery, and task hints?
Can we add token sparsity or attention pruning inside the LLM for even bigger gains without losing reasoning quality?
How can we best extend to video: dynamic per-frame scaling, temporal tiling, and cross-frame consistency?
Personalization: Can the compressor and switcher adapt to a user’s habits (e.g., they often scan receipts) while keeping privacy fully on-device?
Fairness and multilingual robustness: How does compression interact with diverse scripts, low-resource languages, and accessibility needs?

06Conclusion & Future Work

In three sentences: HyperVL makes multimodal AI phone-ready by adapting image resolution, switching between a small and large vision path that share one language brain, and tiling big images to keep memory stable. It delivers near–state-of-the-art accuracy for its size, shines on OCR and documents, and runs fast and light on real devices—with robust performance even under low-bit quantization. The result is a practical recipe for private, low-latency, on-device assistants that truly understand what they see and say. Main achievement: Showing that adaptivity (resolution + switchable, consistency-trained vision branches) plus tiling can turn a compact MLLM into a high-performing, energy-efficient, and reliable on-device system. Future directions: Smarter branch policies, token sparsity and attention pruning, strong video extensions, and user-aware personalization to push speed and accuracy even further. The team also highlights expanding to interactive and agentic scenarios on-device. Why remember this: HyperVL proves we don’t need to choose between accuracy and efficiency on mobile—by being adaptive and consistent, small models can handle big, real-world visual problems quickly and privately.

Practical Applications

•On-device receipt and invoice reading that extracts totals, dates, and vendors privately.
•Smart UI helpers that understand app screens and guide users step-by-step without sending screenshots to the cloud.
•Offline translation and transcription of menus, signs, or labels while traveling.
•Form autofill from photos of IDs, tickets, or utility bills with high accuracy.
•Search and recommendation boosters that re-rank images on-device for better results and privacy.
•Content creation tools that write captions or posts aligned with an uploaded photo’s style and context.
•Accessibility aids that describe screens, read text aloud, and highlight important buttons.
•Personal photo library organization: find receipts, serial numbers, or event flyers from pictures.
•Field work digitization: scan equipment labels, parts lists, or shipment boxes on-site with a phone.
•Education helpers: explain a math diagram or science chart from a photo, step by step.

Version: 1