Calligrapher (Freestyle Text Image Customization)¶
Text generation and editing on images with style reference. Built on [[FLUX Kontext|FLUX.1-Fill-dev]] + SigLIP style encoder. Takes font/style sample from image, generates new text in same style.
Paper: arXiv:2506.24123 (June 2025).
Architecture¶
Style reference image → SigLIP (ViT, siglip-so400m-patch14-384)
→ Qformer (learnable queries)
→ Linear projection → K_E, V_E
↓
Input image + mask → FLUX.1-Fill-dev denoising ← cross-attention:
Q from denoiser, K/V from style encoder
↓
Output with styled text
Style injection: encoder output replaces K and V matrices in style attention module (not concatenation — replacement).
Self-Distillation Training¶
No manual annotation needed: 1. LLM generates prompts with typographic style descriptors 2. Pretrained FLUX synthesizes stylized text images 3. Neural text detection locates text regions 4. Strategic cropping → style reference + transfer target 5. Model trains on self-generated pairs
Modes¶
- Self-reference: change text content, preserve original style
- Cross-reference: apply style from different image
- Non-text reference: transfer from arbitrary images (fire, water, etc.)
- Multilingual: Chinese, Korean, Japanese via TextFLUX
Results¶
FID: 38.09 vs 66-70 (baselines). OCR accuracy: 0.84 vs 0.45-0.81. 72% user preference.
VRAM / Speed¶
~4s per image on A6000 at 10 steps. Recommended resolution: 512px (trained at this). 768px acceptable, higher → spelling errors.
License¶
Inherits FLUX.1-Fill-dev Non-Commercial License. Outputs can be used commercially.
Key Links¶
- GitHub: github.com/Calligrapher2025/Calligrapher
- HF: huggingface.co/Calligrapher2025/Calligrapher