FLUX Kontext¶

Image editing model from Black Forest Labs (BFL). Extends FLUX.1 architecture with sequence concatenation for context image conditioning. Best-in-class for text editing, character consistency, and multi-turn editing.

Paper: arXiv:2506.15742 (June 2025).

Architecture¶

Same 12B DiT as FLUX.1-dev: - 19 double-stream blocks (separate image/text weights, joint attention) - 38 single-stream blocks (fused FFN) - Hidden: 3072, heads: 24, 16-channel latent VAE - Text: CLIP (768d) + T5 (4096d)

Key Innovation: Sequence Concatenation¶

Context images encoded by FLUX VAE → latent tokens appended to target image tokens in visual stream:

Target tokens:  position (0, h, w)
Context image:  position (1, h, w)   ← "virtual timestep" via 3D RoPE
Context image 2: position (2, h, w)  ← architecturally supported, not yet released

Channel concatenation was tested and performed worse. Sequence concat preserves independent resolution/aspect ratio for input vs output.

When context is empty → falls back to pure text-to-image.

vs [[Step1X-Edit]]¶

Aspect	FLUX Kontext	Step1X-Edit / Qwen-Edit
Base	FLUX.1 12B DiT	Custom MMDiT
Text encoder	CLIP + T5	Qwen 2.5 VL (vision-language)
Conditioning	Sequence concat (latent tokens)	Joint attention (image + text)
Mask support	No explicit mask (text-driven)	DiffSynth pipeline supports masks
Speed	3-5s at 1024×1024	Slower (~40 GB VRAM)

Variants¶

Variant	Access	License
Kontext [dev]	Open weights (HF)	Non-commercial (BFL dev license)
Kontext [pro]	API only	Commercial via BFL
Kontext [max]	API only	Commercial via BFL

Dev is I2I only (no T2I). T2I only in pro/max.

Training¶

Starts from FLUX.1 T2I checkpoint
Joint fine-tune on I2I + T2I with rectified flow loss
Data: millions of relational pairs (not disclosed)
FSDP2, Flash Attention 3, selective activation checkpointing

VRAM¶

~24 GB minimum (transformer in bf16). Full pipeline ~32-40 GB. FP8/FP4 quantization available for 24 GB GPUs.

Gotchas¶

Modifies entire image, not just edited region. Use flux-kontext-diff-merge node for selective merging via LAB color space detection + Poisson blending
Multi-turn editing degrades after ~6 iterations
Dev model has distillation artifacts
Non-commercial license for open weights — commercial requires BFL licensing
[[ACE++]] development suspended on FLUX base due to training instability

Key Links¶

GitHub: github.com/black-forest-labs/flux
HF: huggingface.co/black-forest-labs/FLUX.1-Kontext-dev
Diff-merge: github.com/safzanpirani/flux-kontext-diff-merge