MMDiT (Multi-Modal Diffusion Transformer)¶
Transformer architecture for diffusion models that processes multiple modalities (text, image) through joint attention in shared transformer blocks. Used in SD3, FLUX, [[Step1X-Edit]], and most modern diffusion models (2024-2026).
Architecture¶
vs Standard DiT¶
DiT (Peebles & Xie, 2023): text conditioning via cross-attention (separate Q from image, K/V from text). Text and image tokens live in different attention spaces.
MMDiT: text and image tokens concatenated into a single sequence, processed by the same self-attention layers. Both modalities attend to each other symmetrically.
DiT block:
image_tokens → self_attn(Q=img, K=img, V=img) → cross_attn(Q=img, K=text, V=text) → FFN
MMDiT block:
[image_tokens; text_tokens] → self_attn(Q=all, K=all, V=all) → split → img_FFN / txt_FFN
Key Components per Block¶
| Component | Purpose | LoRA target? |
|---|---|---|
to_q, to_k, to_v | Image-stream QKV projections | Yes |
add_q_proj, add_k_proj, add_v_proj | Text-stream QKV projections | Yes |
to_out.0 | Image attention output projection | Yes |
to_add_out | Text attention output projection | Yes |
img_mlp | Image-stream FFN (2 linear layers) | Yes |
txt_mlp | Text-stream FFN (2 linear layers) | Yes |
Both streams share attention weights but have separate FFN layers — this lets the model learn modality-specific transformations while maintaining cross-modal attention.
Attention Pattern¶
In joint attention, image tokens can attend to text tokens and vice versa. This creates bidirectional information flow: - Text informs image generation ("make it red") - Image informs text understanding (spatial context)
For editing models like [[Step1X-Edit]], this is critical: the model needs to understand BOTH what the image currently looks like AND what the text instruction asks to change.
LoRA Application Pattern¶
For fine-tuning MMDiT (as demonstrated by [[PixelSmile]]):
# Standard LoRA targets for MMDiT editing models
target_modules = [
"to_q", "to_k", "to_v", # image attention
"add_q_proj", "add_k_proj", "add_v_proj", # text attention
"to_out.0", "to_add_out", # output projections
"img_mlp.net.0.proj", "img_mlp.net.2", # image FFN
"txt_mlp.net.0.proj", "txt_mlp.net.2", # text FFN
]
# PixelSmile: rank=64, alpha=128, dropout=0 → 850 MB LoRA
Targeting all projections + both FFNs gives maximum expressivity. For lighter adaptation, attention-only (skip FFN) reduces LoRA size by ~40%.
Models Using MMDiT¶
| Model | Variant | Notes |
|---|---|---|
| Stable Diffusion 3 | Original MMDiT | First major adoption |
| FLUX.1 | Modified MMDiT | Adds RoPE, different conditioning |
| [[Step1X-Edit]] | MMDiT + Qwen VL encoder | Image editing |
| [[MACRO]] Bagel variant | MoT (Mixture of Transformers) | Multi-reference, related architecture |
Performance Characteristics¶
- Quadratic attention: O(n^2) in total sequence length (image + text tokens). At 1024x1024 with VAE 8x downscale = 16384 image tokens + ~200 text tokens
- Flash Attention: critical for practical inference. Most implementations require flash-attn 2.x
- Memory: dominated by attention maps. Tiling/chunked attention helps for high-res
Key Insight¶
MMDiT's joint attention is what makes instruction-following editing possible at high quality. Cross-attention (DiT-style) creates an information bottleneck — the model can only "ask" the text about specific queries. Joint attention lets the model freely mix both signals, discovering complex relationships between "what is" and "what should be."