šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
🧩ProblemsšŸŽÆPrompts🧠Review
Search
Ministral 3 | How I Study AI

Ministral 3

Beginner
Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian et al.1/13/2026
arXivPDF

Key Summary

  • •Ministral 3 is a new family of small-but-mighty AI language models (3B, 8B, 14B) that learn from a larger model using a step-by-step tutoring method called Cascade Distillation.
  • •Instead of training from scratch on huge data (15–36 trillion tokens), they reach strong results with about 1–3 trillion tokens by pruning the big model and distilling its knowledge.
  • •Each size comes in three flavors: Base (general), Instruct (friendly and helpful at following directions), and Reasoning (better at math, code, and multi-step thinking).
  • •All models can understand images through a frozen vision encoder and can read very long texts (up to 256k tokens; 128k for reasoning variants).
  • •The training recipe is prune → distill on short contexts → distill on long contexts, then post-train with SFT + ODPO (Instruct) or SFT with Chain-of-Thought + GRPO + ODPO (Reasoning).
  • •Careful pruning keeps the most important layers and neurons using activation-based scores, PCA rotations, and gated-MLP importance measures.
  • •Ministral 3 beats or matches similarly sized open models on many benchmarks and even rivals a larger 24B teacher while being 40% smaller at the top size.
  • •A surprising finding: using a stronger teacher isn’t always better during pretraining, but stronger preference-tuned teachers help more during post-training.
  • •The models are open-weight (Apache 2.0), making them practical for resource-limited devices and private deployments.

Why This Research Matters

Smaller, open-weight models that still perform strongly mean more people and organizations can use advanced AI without massive servers. Long-context ability lets these models read entire reports, legal documents, or codebases at once, making research, compliance, and debugging faster. Image understanding enables practical assistants for forms, charts, and screenshots in everyday workflows. The efficient training approach reduces energy and cost, which is better for the environment and for teams with tight budgets. Because the models are open under Apache 2.0, schools, startups, and nonprofits can adapt them privately and responsibly. The methods also set a template for future efficient training, raising the bar for what small models can do.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

šŸž Hook: You know how a smart student can teach a younger sibling, making the hard homework feel easier? What if big AIs could do that for smaller AIs—so the little ones still learn a lot without needing giant computers?

🄬 The Concept (Big Picture): Ministral 3 is a family of small, efficient AI models that learn from a larger ā€œparentā€ model through a careful, step-by-step tutoring process. How it works: (1) start with a strong big model, (2) trim it down safely so it keeps the good parts, (3) have the little model practice by imitating the big model’s answers, and (4) repeat to make even smaller but still capable models. Why it matters: Before, getting strong AI usually meant huge models trained on enormous data with massive compute; now, you can get similar quality with less cost, memory, and energy.

šŸž Anchor: Imagine a 14B-parameter model that performs close to a 24B teacher, trained on far fewer tokens, and still reads book-length documents—this is what the paper delivers.

The World Before:

  • Big AIs like popular open families (e.g., Llama, Qwen) were trained on vast amounts of text (15–36 trillion tokens). Great results—but expensive.
  • Smaller models trained from scratch often lagged behind or needed heavy data and compute to catch up.
  • Long-context reading (hundreds of pages) and image understanding were usually reserved for big models or required complex, costly setups.

The Problem:

  • How do we get small, dense models that perform well without the giant training bill?
  • How can we keep long-context reading and image abilities while shrinking size?
  • How do we teach smaller models efficiently so they don’t forget important skills when pruned?

Failed Attempts:

  • Train small models from scratch: cheap per-step, but they often end up weak unless trained on massive data.
  • One-shot pruning: cut down a big model all at once; results can drop sharply because important parts get lost.
  • Mixed training objectives: blending ā€œpredict next wordā€ with ā€œcopy the teacherā€ can be tricky and not always better than pure distillation.

The Gap:

  • We needed a practical, repeatable recipe to shrink a strong teacher into smaller students while transferring knowledge smoothly and preserving long-context and vision skills.

šŸž Hook (New Concept – Cascade Distillation): Imagine carving a statue from a big block of marble, not in a single chop, but through careful, guided steps while constantly checking a reference model.

🄬 The Concept (Cascade Distillation): It is a ā€œprune–distill–repeatā€ training strategy that starts from a strong parent model and produces a chain of smaller children, each taught by the parent’s answers (logits). How it works: (1) prune the big model to the target size, (2) distill by having the small one mimic the teacher’s output probabilities, (3) extend its context window, and (4) repeat to reach smaller sizes. Why it matters: Without this, small models either learn too slowly from scratch or lose too much when cut down.

šŸž Anchor: The paper derives 14B → 8B → 3B models, each time pruning first and then distilling from the same 24B teacher, keeping performance high and costs low.

Real Stakes (Why You Should Care):

  • Phones, laptops, and private servers can run smarter models without cloud-scale GPUs.
  • Long documents, logs, legal files, and codebases (up to 256k tokens) can be handled locally.
  • Image understanding unlocks helpful multimodal assistants without massive infrastructure.
  • Open weights (Apache 2.0) mean wider access for schools, startups, and researchers.

šŸž Hook (New Concept – Parameter-efficient Models): You know how a compact car can still take you everywhere while using less gas?

🄬 The Concept (Parameter-efficient Models): These are models designed to do more with fewer parameters and less compute. How it works: prune away low-importance parts, keep the best pathways, and train the student to imitate a strong teacher. Why it matters: Without parameter efficiency, helpful AI stays stuck on big servers, out of reach for everyday devices.

šŸž Anchor: A 3B model that understands images and follows instructions well can fit on modest hardware and still be useful for everyday tasks.

02Core Idea

šŸž Hook: Imagine a relay race where a champion sprinter starts, then passes the baton to a teammate who runs a shorter but well-guided segment, and so on—each runner benefits from the champion’s strategy.

🄬 The Concept (The ā€œAha!ā€ Moment): The key insight is to shrink a strong model step-by-step while it continuously tutors its smaller descendants—prune, then distill, then repeat—so each smaller model keeps surprisingly strong skills. How it works: (1) Carefully score and keep the most important layers and neurons, (2) train the pruned model to match the teacher’s output probabilities (logits), (3) extend context from short to very long windows, and (4) use the resulting Base model as the starting point for instruction-following or reasoning post-training. Why it matters: Without this staircase approach, small models would be either too weak (trained from scratch) or too damaged (aggressively pruned once).

šŸž Anchor: The 14B Base matches much of a 24B parent while being over 40% smaller and trained with fewer tokens.

Multiple Analogies:

  1. Chef and Recipe: A master chef simplifies a complex recipe for home cooks, removing only non-essential steps, then watches them cook and gives feedback—result: a dish that still tastes great in a smaller kitchen.
  2. Backpack Trimming: You only keep the gear that matters for the hike, then practice with a guide’s tips, and later tackle longer trails—light pack, strong performance.
  3. Sculpture: Start with a complete statue, chip away carefully using a scoring tool (importance metrics), and frequently compare with the original—ending with a smaller but still faithful sculpture.

Before vs After:

  • Before: Small models from scratch needed massive data to get decent; pruned models often underperformed; long contexts were rare for small models.
  • After: With Cascade Distillation, small models are trained efficiently, retain much of the big model’s capability, and get 256k-token context windows (128k for reasoning variants), plus image understanding.

Why It Works (Intuition):

  • The teacher’s logits carry soft hints about which answers are likely, not just the single correct next token; copying these probabilities gives the student richer guidance.
  • Pruning guided by activation statistics and PCA rotation keeps the most information-dense directions, so the student starts from a smart initialization rather than random.
  • Extending context after the core skills are learned lets the model generalize to long inputs without forgetting.
  • Post-training aligns the model with human preferences and strengthens step-by-step reasoning without bloating parameters.

Building Blocks (each explained with a sandwich):

šŸž Hook (Layer Pruning): Think about trimming a tree: you keep the strongest branches and remove the weak ones so the tree stays healthy. 🄬 The Concept: Layer pruning removes less important layers using a simple importance score—the ratio of output to input activation norms. How it works: (1) measure each layer’s input and output norms, (2) compute an importance score, (3) keep the top-k layers. Why it matters: Removing the wrong layers tanks performance. šŸž Anchor: They keep only 40, 34, or 26 layers (for 14B/8B/3B), preserving the heavy lifters and dropping weaker ones.

šŸž Hook (Hidden Dimension Pruning with PCA): Imagine compressing a messy closet by rotating and stacking clothes so the most-used outfits stay easy to reach. 🄬 The Concept: PCA finds the most informative directions in activations and rotates the model into a lower-dimensional space. How it works: (1) collect activations across attention and feed-forward norms, (2) run PCA to get a rotation, (3) project to a smaller hidden size. Why it matters: Without a good rotation, you might delete useful directions. šŸž Anchor: One rotation matrix is applied across layers to shrink hidden size while preserving variance.

šŸž Hook (Feedforward Dimension Pruning): Think of a factory line where some stations do more useful work; you keep those and remove the slack. 🄬 The Concept: In gated MLPs (SwiGLU), they score neuron importance using gated activations and keep only top contributors. How it works: (1) compute |silu(W1Ā·x) * W3Ā·x| averaged over a big batch, (2) keep top-k dimensions, (3) prune matching rows/cols. Why it matters: Cutting the wrong neurons ruins the model’s vocabulary and reasoning. šŸž Anchor: This preserves the strongest MLP channels that carry key features.

šŸž Hook (Logit Distillation): You know how copying a teacher’s full grading rubric gives you more clues than just seeing right/wrong answers? 🄬 The Concept: The student learns from the teacher’s probability distribution (logits) over next tokens. How it works: (1) run teacher and student on the same data, (2) minimize forward KL divergence so student matches teacher’s probabilities, (3) repeat over mixed text and multimodal data. Why it matters: Without logit distillation, the student learns slower and overfits to single answers. šŸž Anchor: They found pure distillation (forward KL) outperformed mixing it with next-token prediction.

šŸž Hook (Long Context Extension): Imagine practicing short stories first, then stretching to read entire novels. 🄬 The Concept: After short-context training (~16k), the model is extended to 256k tokens using YaRN and position-based temperature scaling. How it works: (1) master short contexts, (2) apply context-window tricks to stabilize attention over long ranges, (3) train on long-context data. Why it matters: Without this stage, the model can’t reliably handle book-length inputs. šŸž Anchor: Final Base models reach 256k context (Reasoning: 128k) while keeping quality high.

03Methodology

At a high level: Parent Model + Data → Prune (init child) → Distill on Short Context → Distill on Long Context → Base Model → Post-train to Instruct or Reasoning → Released Models.

Step-by-step details with purpose and examples:

  1. Pruning to Initialize the Child
  • What happens: Start from Mistral Small 3.1 (24B). Compute layer importance (output/input norm), keep top layers; collect activations and run PCA to rotate and shrink hidden size; compute SwiGLU channel importance and prune MLP dimensions accordingly.
  • Why it exists: Gives the student a ā€œsmartā€ starting point that already encodes useful features, instead of random weights. Without it, the student would need far more data/compute to reach similar skill.
  • Example: Suppose a passage asks, ā€œWhat’s the capital of France?ā€ The pruned model still keeps language and world-knowledge circuits that know ā€œParis,ā€ because the pruning scores preserve high-signal parts.
  1. Short-Context Distillation (ā‰ˆ16k tokens)
  • What happens: Run teacher and student on the same text and multimodal data. Train the student to match the teacher’s logits using forward KL (pure distillation). Vision encoder is frozen; only the projection and LLM weights update.
  • Why it exists: Teaches the student core language/multimodal skills efficiently. Without this, pruned weights might drift or forget.
  • Example: For ā€œSummarize this article,ā€ the student learns to produce fluent, focused summaries that match the teacher’s style and content choices.
  1. Long-Context Distillation (→ 256k tokens)
  • What happens: Extend context from 16,384 to 262,144 using YaRN and position-based temperature scaling. Continue distillation so the student remains aligned while learning long-range attention.
  • Why it exists: Long inputs can cause attention to misbehave; temperature scaling and specialized schedules stabilize training. Without it, the model would stumble on book-length or multi-file code inputs.
  • Example: The model can now read a 500-page policy document or a large codebase index and answer cross-referenced questions in one go.

Secret Sauce: Cascade Distillation

  • The clever bit is repeating prune→distill, then using the up-trained 14B as the starting point to prune again into 8B, then 3B—each time still taught by the same strong parent (MS3.1). This avoids data repetition, stays compute-efficient, and preserves knowledge across size steps.

Post-Training for Instruct Variants

šŸž Hook (SFT – Supervised Fine-Tuning): Think of practicing with answer keys and teacher notes to learn the right tone and format. 🄬 The Concept: SFT teaches the model to follow instructions and tools using high-quality examples and a distillation loss from a stronger teacher (Mistral Medium 3). How it works: (1) fine-tune with curated instruction data (text and multimodal), (2) freeze vision encoder, train projection/LLM, (3) use fp8 and logit distillation for stability. Why it matters: Without SFT, the model may be knowledgeable but not helpful or safe in conversation. šŸž Anchor: After SFT, the model responds politely, follows steps, and uses tools when asked.

šŸž Hook (ODPO – Online Direct Preference Optimization): Imagine asking two friends to write answers and then using a judge to pick which sounds better; you repeat and improve over time. 🄬 The Concept: ODPO aligns the model with human preferences by sampling two candidate responses and ranking them with a Pairwise Reward Model (PWRM). How it works: (1) sample two responses at T=0.7, (2) PWRM outputs a probabilistic preference, (3) use a two-sided loss weighted by win probabilities with temperature calibration and beta-rescaling, (4) treat infinite loops as automatic losers, (5) allow tool execution during generation. Why it matters: Without ODPO, models may be correct but unhelpful, verbose, or prone to odd habits like looping. šŸž Anchor: ODPO boosts alignment benchmarks and reduces bad conversational quirks.

Post-Training for Reasoning Variants

šŸž Hook (Chain-of-Thought – CoT): Like showing your work on math problems so you don’t skip steps. 🄬 The Concept: CoT exposes the model to step-by-step reasoning traces. How it works: (1) fine-tune on clean short and long CoT samples across math, code, dialogue, tools, and vision, (2) filter for quality, (3) add reasoning-specific system prompts. Why it matters: Without CoT, the model may jump to answers and fail on tricky multi-step tasks. šŸž Anchor: The model explains intermediate steps for geometry proofs or coding tasks and gets more right.

šŸž Hook (GRPO – Group Relative Policy Optimization): Think of a class using shared rubrics where several answers are graded together to guide improvement. 🄬 The Concept: GRPO is reinforcement learning with group-based relative feedback. How it works: (1) STEM RL: train on math, code, and visual reasoning with curated problems and clear correctness checks, (2) General RL: use LLM-judge and rubrics (faithfulness, quality) to score rollouts and reward the fraction of satisfied criteria, (3) increase max generation length from 32k to 80k to avoid truncation. Why it matters: Without RL, the model’s reasoning may plateau; with RL, it learns to think longer and more accurately. šŸž Anchor: After GRPO, AIME and coding benchmarks jump significantly; fewer cut-off solutions.

šŸž Hook (ODPO after RL): Like polishing an essay after you’ve solved the hard math—make it clear, concise, and friendly. 🄬 The Concept: Apply ODPO again to align conversation quality while keeping reasoning strength. How it works: (1) strip hidden ā€œthinkingā€ from outputs before scoring, (2) run the same ODPO loop, (3) select checkpoints with best chat-alignment. Why it matters: Reasoning models can be brilliant but not chatty; ODPO makes them pleasant and aligned. šŸž Anchor: The 14B/8B reasoning models get big gains on chat benchmarks post-ODPO.

Architecture and Multimodal Bridge

šŸž Hook (Vision Transformer – ViT): Like using a camera sensor that feeds pictures into your language brain. 🄬 The Concept: A 410M ViT encodes images; it’s frozen, and a learned projection maps vision features into the language model. How it works: (1) keep ViT weights fixed (copied from Pixtral), (2) train a fresh projection per model size, (3) train LLM with interleaved text-image data. Why it matters: Without a good vision bridge, image understanding would lag or require huge compute. šŸž Anchor: The models answer chart questions, read screenshots, or solve math diagrams on benchmarks like MMMU/MathVista.

Other key choices: decoder-only transformer, Grouped Query Attention (32Q/8KV), RoPE positions, SwiGLU activations, RMSNorm, and long-context stabilization via YaRN + position-based softmax temperature scaling. Together they balance speed, memory, and long-input reliability.

04Experiments & Results

The Test: The authors evaluated Ministral 3 on a wide set of benchmarks to cover general knowledge (MMLU, MMLU-Redux, ARC-Challenge, RACE, TriviaQA, NaturalQS), math & code (MATH, GPQA Diamond, MBPP), multilingual (European languages, Chinese, Japanese, Korean), and multimodal reasoning (MMMU, MathVista). Post-training evaluations included Arena Hard, WildBench, MM MTBench, AIME 2024/2025, HMMT 2025, PhyBench, and LiveCodeBench.

The Competition: They compared against same-scale open-weight models from Qwen 3 and Gemma 3, and also showed how much capability from the 24B parent (Mistral Small 3.1) is retained by the 14B/8B/3B children.

The Scoreboard (with context):

  • Pretraining (Base models):

    • 14B Base: Outperforms Qwen 3 14B on TriviaQA and MATH and is stronger than Gemma 12B across the board. That’s like getting an A on trivia and math while carrying a lighter backpack than the bigger student.
    • 8B Base: Often beats Gemma 12B despite being smaller, showing strong parameter efficiency. Think: a smaller car winning a fuel economy race and still going fast.
    • 3B Base: Competitive for its size, keeping surprisingly high fractions of the 24B teacher’s capability. Understandably, gaps grow at the tiniest scale, but it holds up well.
    • Across general reasoning and multilingual, performance scales smoothly with size, and multimodal scores remain strong with the frozen ViT + learned projection.
  • Post-training (Instruct):

    • 14B Instruct: Scores around 55 on Arena Hard and 84.9 on MM MTBench, comfortably ahead of Gemma 12B Instruct and the non-thinking Qwen 14B in many cases.
    • 8B Instruct: Competitive with Qwen3-VL-8B Instruct (e.g., 80.8 on MM MTBench vs 80.0) and strong on WildBench.
    • 3B Instruct: Holds its own against 4B/2B baselines, often leading on challenging follow-the-instructions tests.
  • Post-training (Reasoning):

    • AIME 2024: 14B reaches ~89.8% vs Qwen 3 14B ~83.7% (pass@16), a notable jump showing the quality of CoT + GRPO + ODPO.
    • GPQA Diamond: 14B ~71.2% vs Qwen 3 14B ~66.3%, indicating stronger graduate-level QA reasoning.
    • LiveCodeBench v6: Ministral models keep up or surpass counterparts at matched sizes in pass@5, with solid coding performance after RL.

Surprising Findings:

  • Stronger teacher ≠ better pretraining: Distilling from Mistral Small 3.1 (24B) worked better than from a larger ā€œMediumā€ teacher during pretraining, echoing a ā€œcapacity gapā€ idea—students sometimes learn worse if the teacher is too advanced at that stage.
  • But during post-training, a stronger preference-tuned teacher helps: Using Mistral Medium 3.* as the SFT distillation teacher improved alignment and instruction quality.
  • Distilling from post-trained (instruct/reasoning) teachers during pretraining improves STEM and multimodal scores more than using a purely pretrained teacher, with negligible change for knowledge-only tasks.
  • ODPO after RL notably boosts conversational benchmarks for the 14B and 8B reasoning models; the 3B’s public benchmark gains were smaller, though human evals still preferred it.

What the numbers mean in plain terms:

  • The 14B Base getting close to a 24B teacher is like a junior player nearly matching a pro—while being lighter and cheaper to field.
  • The 8B beating a 12B competitor on many tasks shows that smart training beats raw size.
  • Long context + vision on all sizes means these models are practical for real documents, code repositories, and images, even when running on modest hardware.

05Discussion & Limitations

Limitations:

  • Verbosity and reflection: Pushing too much long Chain-of-Thought into Instruct models can cause overthinking, backtracking, and unnatural monologues. The team balanced gains in STEM with the risk of chatty, meandering outputs.
  • 3B sensitivity: The smallest model is more brittle during fine-tuning and RL; it needed extra stabilization (e.g., logit distillation from a smaller teacher) and careful hyperparameters.
  • Reasoning context cap: Reasoning variants use 128k context instead of 256k, which may limit ultra-long proofs or code audits in a single pass.
  • Frozen vision encoder: While efficient, a frozen ViT could cap peak multimodal performance compared to full end-to-end vision-language training.

Required Resources:

  • A good teacher checkpoint (Mistral Small 3.1 Base for pretraining; Medium 3.* for SFT distillation) and enough compute to run iterative prune + distill stages across billions of parameters—still far less than training from scratch on trillions of tokens, but not trivial.
  • High-quality instruction, CoT, and RL datasets; an LLM judge and rubrics for GRPO; and a Pairwise Reward Model for ODPO.

When NOT to Use:

  • If you need state-of-the-art vision performance with end-to-end finetuning of the image tower, a frozen ViT may be limiting.
  • If your app demands super concise replies always, aggressive CoT or reasoning variants may be too verbose without prompt controls.
  • If you have zero access to teacher models or RL infrastructure, you might prefer off-the-shelf instruct models.

Open Questions:

  • Capacity gap mechanics: Why do stronger teachers sometimes hurt pretraining while helping post-training? Is there an optimal teacher ā€œdistanceā€ per student size and stage?
  • Verbosity tuning: How best to gain reasoning accuracy without bloating outputs or inducing loops? Can ODPO or new losses better control length and style?
  • Long-context scaling: Can reasoning variants reach stable 256k contexts without regressions, and can attention temperature schedules be further improved?
  • Data recipes: What’s the best mix of post-trained vs pretrained teachers during different stages to maximize math/code vs knowledge vs multimodal gains?

06Conclusion & Future Work

3-Sentence Summary: Ministral 3 introduces a practical way to build small, strong language models by pruning a big teacher and distilling its knowledge step by step, then extending to very long contexts and adding image understanding. Each size—3B, 8B, and 14B—comes as Base, Instruct, and Reasoning variants, trained efficiently on far fewer tokens than typical large models. The result: open-weight models that run on modest hardware while competing with or beating similar-sized peers across general, STEM, multilingual, and multimodal tasks.

Main Achievement: The paper’s #1 contribution is Cascade Distillation: a ā€œprune → distill → repeatā€ recipe that reliably transfers a large model’s capability down to much smaller children, keeping performance high and costs low, and serving as a solid foundation for instruction-following and reasoning post-training.

Future Directions:

  • Sharpen verbosity control so reasoning gains don’t cause overthinking in general chat.
  • Explore teacher selection policies by stage and size to exploit the capacity gap wisely.
  • Push reasoning variants to full 256k contexts and evaluate broader long-form tasks.
  • Investigate partial vision finetuning or adapters for improved multimodal peaks.

Why Remember This: Ministral 3 shows that careful, staged knowledge transfer lets small models punch above their weight—bringing long-context, multimodal AI within reach for everyday devices and open communities. It reframes the path to capable, efficient LLMs as a guided staircase rather than a cliff climb from scratch.

Practical Applications

  • •Run private chatbots on a single GPU or high-end laptop that can read entire PDFs and answer questions.
  • •Build on-device assistants that summarize long emails, docs, or logs without sending data to the cloud.
  • •Create coding copilots that analyze large repositories and propose fixes while fitting into local developer machines.
  • •Deploy visual helpdesks that read screenshots or charts and explain steps to resolve issues.
  • •Use reasoning-tuned models for math and science tutoring with step-by-step explanations.
  • •Automate policy or contract review by feeding 100k+ token documents and querying cross-references.
  • •Enable multimodal RAG systems that combine images (diagrams, forms) with long-text retrieval.
  • •Power offline, privacy-preserving note summarizers and meeting analyzers on enterprise laptops.
  • •Integrate tool-using assistants (search, calculators, code runners) aligned via ODPO to avoid bad loops.
  • •Prototype domain-specific copilots (medical coding, financial analysis) with efficient fine-tuning of Base models.
#Cascade Distillation#Model pruning#Logit distillation#Parameter-efficient LLMs#Ministral 3#Mistral Small 3.1#Online Direct Preference Optimization (ODPO)#Group Relative Policy Optimization (GRPO)#Chain-of-Thought (CoT)#Long-context 256k#YaRN#Grouped Query Attention (GQA)#RoPE#RMSNorm#SwiGLU#Vision Transformer
Version: 1