Skip to content

MMDiT (Multi-Modal Diffusion Transformer)

Transformer architecture for diffusion models that processes multiple modalities (text, image) through joint attention in shared transformer blocks. Used in SD3, FLUX, [[Step1X-Edit]], and most modern diffusion models (2024-2026).

Architecture

vs Standard DiT

DiT (Peebles & Xie, 2023): text conditioning via cross-attention (separate Q from image, K/V from text). Text and image tokens live in different attention spaces.

MMDiT: text and image tokens concatenated into a single sequence, processed by the same self-attention layers. Both modalities attend to each other symmetrically.

DiT block:
  image_tokens → self_attn(Q=img, K=img, V=img) → cross_attn(Q=img, K=text, V=text) → FFN

MMDiT block:
  [image_tokens; text_tokens] → self_attn(Q=all, K=all, V=all) → split → img_FFN / txt_FFN

Key Components per Block

Component Purpose LoRA target?
to_q, to_k, to_v Image-stream QKV projections Yes
add_q_proj, add_k_proj, add_v_proj Text-stream QKV projections Yes
to_out.0 Image attention output projection Yes
to_add_out Text attention output projection Yes
img_mlp Image-stream FFN (2 linear layers) Yes
txt_mlp Text-stream FFN (2 linear layers) Yes

Both streams share attention weights but have separate FFN layers — this lets the model learn modality-specific transformations while maintaining cross-modal attention.

Attention Pattern

In joint attention, image tokens can attend to text tokens and vice versa. This creates bidirectional information flow: - Text informs image generation ("make it red") - Image informs text understanding (spatial context)

For editing models like [[Step1X-Edit]], this is critical: the model needs to understand BOTH what the image currently looks like AND what the text instruction asks to change.

LoRA Application Pattern

For fine-tuning MMDiT (as demonstrated by [[PixelSmile]]):

# Standard LoRA targets for MMDiT editing models
target_modules = [
    "to_q", "to_k", "to_v",           # image attention
    "add_q_proj", "add_k_proj", "add_v_proj",  # text attention
    "to_out.0", "to_add_out",          # output projections
    "img_mlp.net.0.proj", "img_mlp.net.2",     # image FFN
    "txt_mlp.net.0.proj", "txt_mlp.net.2",     # text FFN
]
# PixelSmile: rank=64, alpha=128, dropout=0 → 850 MB LoRA

Targeting all projections + both FFNs gives maximum expressivity. For lighter adaptation, attention-only (skip FFN) reduces LoRA size by ~40%.

Models Using MMDiT

Model Variant Notes
Stable Diffusion 3 Original MMDiT First major adoption
FLUX.1 Modified MMDiT Adds RoPE, different conditioning
[[Step1X-Edit]] MMDiT + Qwen VL encoder Image editing
[[MACRO]] Bagel variant MoT (Mixture of Transformers) Multi-reference, related architecture

Performance Characteristics

  • Quadratic attention: O(n^2) in total sequence length (image + text tokens). At 1024x1024 with VAE 8x downscale = 16384 image tokens + ~200 text tokens
  • Flash Attention: critical for practical inference. Most implementations require flash-attn 2.x
  • Memory: dominated by attention maps. Tiling/chunked attention helps for high-res

Key Insight

MMDiT's joint attention is what makes instruction-following editing possible at high quality. Cross-attention (DiT-style) creates an information bottleneck — the model can only "ask" the text about specific queries. Joint attention lets the model freely mix both signals, discovering complex relationships between "what is" and "what should be."