MMDiT (Multi-Modal Diffusion Transformer)¶

Transformer architecture for diffusion models that processes multiple modalities (text, image) through joint attention in shared transformer blocks. Used in SD3, FLUX, [[Step1X-Edit]], and most modern diffusion models (2024-2026).

Architecture¶

vs Standard DiT¶

DiT (Peebles & Xie, 2023): text conditioning via cross-attention (separate Q from image, K/V from text). Text and image tokens live in different attention spaces.

MMDiT: text and image tokens concatenated into a single sequence, processed by the same self-attention layers. Both modalities attend to each other symmetrically.

DiT block:
  image_tokens → self_attn(Q=img, K=img, V=img) → cross_attn(Q=img, K=text, V=text) → FFN

MMDiT block:
  [image_tokens; text_tokens] → self_attn(Q=all, K=all, V=all) → split → img_FFN / txt_FFN

Key Components per Block¶

Component	Purpose	LoRA target?
`to_q`, `to_k`, `to_v`	Image-stream QKV projections	Yes
`add_q_proj`, `add_k_proj`, `add_v_proj`	Text-stream QKV projections	Yes
`to_out.0`	Image attention output projection	Yes
`to_add_out`	Text attention output projection	Yes
`img_mlp`	Image-stream FFN (2 linear layers)	Yes
`txt_mlp`	Text-stream FFN (2 linear layers)	Yes

Both streams share attention weights but have separate FFN layers — this lets the model learn modality-specific transformations while maintaining cross-modal attention.

Attention Pattern¶

In joint attention, image tokens can attend to text tokens and vice versa. This creates bidirectional information flow: - Text informs image generation ("make it red") - Image informs text understanding (spatial context)

For editing models like [[Step1X-Edit]], this is critical: the model needs to understand BOTH what the image currently looks like AND what the text instruction asks to change.

LoRA Application Pattern¶

For fine-tuning MMDiT (as demonstrated by [[PixelSmile]]):

# Standard LoRA targets for MMDiT editing models
target_modules = [
    "to_q", "to_k", "to_v",           # image attention
    "add_q_proj", "add_k_proj", "add_v_proj",  # text attention
    "to_out.0", "to_add_out",          # output projections
    "img_mlp.net.0.proj", "img_mlp.net.2",     # image FFN
    "txt_mlp.net.0.proj", "txt_mlp.net.2",     # text FFN
]
# PixelSmile: rank=64, alpha=128, dropout=0 → 850 MB LoRA

Targeting all projections + both FFNs gives maximum expressivity. For lighter adaptation, attention-only (skip FFN) reduces LoRA size by ~40%.

Models Using MMDiT¶

Model	Variant	Notes
Stable Diffusion 3	Original MMDiT	First major adoption
FLUX.1	Modified MMDiT	Adds RoPE, different conditioning
[[Step1X-Edit]]	MMDiT + Qwen VL encoder	Image editing
[[MACRO]] Bagel variant	MoT (Mixture of Transformers)	Multi-reference, related architecture

Performance Characteristics¶

Quadratic attention: O(n^2) in total sequence length (image + text tokens). At 1024x1024 with VAE 8x downscale = 16384 image tokens + ~200 text tokens
Flash Attention: critical for practical inference. Most implementations require flash-attn 2.x
Memory: dominated by attention maps. Tiling/chunked attention helps for high-res

Key Insight¶

MMDiT's joint attention is what makes instruction-following editing possible at high quality. Cross-attention (DiT-style) creates an information bottleneck — the model can only "ask" the text about specific queries. Joint attention lets the model freely mix both signals, discovering complex relationships between "what is" and "what should be."