🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers11

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#LPIPS

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Intermediate
Zehong Ma, Ruihan Xu et al.Feb 2arXiv

PixelGen is a new image generator that works directly with pixels and uses what-looks-good-to-people guidance (perceptual loss) to improve quality.

#pixel diffusion#perceptual loss#LPIPS

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Intermediate
Letian Zhang, Sucheng Ren et al.Jan 21arXiv

OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).

#Unified Visual Encoder#VAE#Vision Transformer

GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

Intermediate
Yi-Chuan Huang, Hao-Jen Chien et al.Dec 31arXiv

GaMO is a new way to rebuild 3D scenes from just a few photos by expanding each photo’s edges (outpainting) instead of inventing whole new camera views.

#3D reconstruction#outpainting#multi-view diffusion

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Intermediate
Hau-Shiang Shiu, Chin-Yang Lin et al.Dec 29arXiv

This paper makes diffusion-based video super-resolution (VSR) practical for live, low-latency use by removing the need for future frames and cutting denoising from ~50 steps down to just 4.

#video super-resolution#diffusion model#latent diffusion

Robust and Calibrated Detection of Authentic Multimedia Content

Intermediate
Sarim Hashmi, Abdelrahman Elsayed et al.Dec 17arXiv

Deepfakes are getting so good that simple yes/no detectors are failing, especially when attackers add tiny, invisible changes.

#Authenticity Index#calibrated resynthesis#reconstruction-free inversion

Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets

Intermediate
Jialong Zuo, Haoyou Deng et al.Dec 17arXiv

This paper checks if a popular text-to-image model called Nano Banana Pro can fix messy photos without any extra training.

#low-level vision#zero-shot restoration#generative models

SS4D: Native 4D Generative Model via Structured Spacetime Latents

Intermediate
Zhibing Li, Mengchen Zhang et al.Dec 16arXiv

SS4D is a new AI model that turns a short single-camera video into a full 3D object that moves over time (that’s 4D), and it does this in about 2 minutes.

#4D generation#structured spacetime latents#temporal attention

Feedforward 3D Editing via Text-Steerable Image-to-3D

Intermediate
Ziqi Ma, Hongqiao Chen et al.Dec 15arXiv

Steer3D lets you change a 3D object just by typing what you want, like “add a roof rack,” and it does it in one quick pass.

#3D editing#image-to-3D#ControlNet

DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

Intermediate
Peiying Zhang, Nanxuan Zhao et al.Dec 11arXiv

DuetSVG is a new AI that learns to make SVG graphics by generating an image and the matching SVG code together, like sketching first and then tracing neatly.

#DuetSVG#multimodal generation#SVG generation

Sharp Monocular View Synthesis in Less Than a Second

Beginner
Lars Mescheder, Wei Dong et al.Dec 11arXiv

SHARP turns a single photo into a 3D scene you can look around in, and it does this in under one second on a single GPU.

#monocular view synthesis#3D Gaussians#real-time neural rendering

Relational Visual Similarity

Intermediate
Thao Nguyen, Sicheng Mo et al.Dec 8arXiv

Most image-similarity tools only notice how things look (color, shape, class) and miss deeper, human-like connections.

#relational similarity#visual analogy#anonymous captions