Skip to content

Step1X-Edit

Open-source image editing foundation model by StepFun (Shanghai). De facto standard open backbone for instruction-based image editing (2025-2026). Qwen-Image-Edit-2511 by Alibaba/Qwen is a production-tuned variant of this architecture.

Architecture

Input Image → VAE Encode → image latent (z)
Text Instruction → Qwen 2.5 VL → text embeddings (c)
              MMDiT Transformer Blocks
         (joint attention over z and c)
              VAE Decode → edited image

Three core components:

Component Implementation Trainable? Size
Text Encoder [[Qwen 2.5 VL]] (vision-language model) Yes (usually) ~7B params
Transformer [[MMDiT]] with joint attention Yes / LoRA ~20B+ params
VAE Custom autoencoder (RealRestorerAutoencoderKL variant) Usually frozen ~200M params

Total weights: ~40-60 GB in bf16 depending on variant.

Why Qwen VL instead of CLIP

Previous editing models (InstructPix2Pix, MagicBrush) used CLIP as text encoder. CLIP encodes text only — it cannot "see" the input image at encoding time. Qwen 2.5 VL is a full vision-language model: - Processes both the text instruction AND the input image - Understands spatial relationships ("move the cup to the left") - Reasons about what to change vs what to preserve - Generates richer conditional embeddings

This is the key architectural innovation that separates Step1X-Edit generation models from prior approaches.

Scheduler

Uses [[Flow Matching]] instead of DDPM/DDIM. Default inference: 28 steps, guidance_scale 3.0. Faster convergence, more stable at low step counts.

Variants

Model Maintainer License Notes
Step1X-Edit StepFun Apache 2.0 Original
Qwen-Image-Edit-2511 Alibaba/Qwen Apache 2.0 Production variant, DiffSynth-Studio framework
RealRestorerTransformer2DModel RealRestorer team Academic only (weights) Modified for restoration

Inference

from diffsynth import QwenImageEditPlusPipeline
pipe = QwenImageEditPlusPipeline.from_pretrained("Qwen/Qwen-Image-Edit-2511", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
result = pipe(prompt="change the background to a beach", image=input_image, num_inference_steps=28, guidance_scale=3.0)

VRAM: ~40 GB at 1024x1024 in bf16. With CPU offload — fits on 24 GB but slower. FP8 quantization → ~20-30 GB.

Downstream Models Built on This Architecture

  • [[RealRestorer]] — image restoration (9 degradation types)
  • [[PixelSmile]] — facial expression editing (LoRA rank 64, 850 MB)
  • [[MACRO]] Qwen variant — multi-reference generation (full fine-tune)

Commercial Use

Apache 2.0 — fully permitted for commercial use. Both Step1X-Edit and Qwen-Image-Edit-2511. This makes it the go-to backbone for commercial editing products, unlike models with NC restrictions.

  • Step1X-Edit GitHub: github.com/stepfun-ai/Step1X-Edit
  • Qwen-Image-Edit HF: huggingface.co/Qwen/Qwen-Image-Edit-2511
  • Framework: DiffSynth-Studio (github.com/modelscope/DiffSynth-Studio)