SANA-Denoiser - Our Architecture Design¶

Repurposing [[SANA]] 1.6B DiT as an image restoration model. Combines efficient linear attention with [[Paired Training for Restoration]] and [[Temporal Tiling]] via [[Block Causal Linear Attention]].

Why SANA for Restoration¶

Property	SANA 1.6B	Step1X-Edit (RealRestorer)	FLUX-dev
Params	1.6B	~15B	12B
Attention	Linear O(N)	Quadratic O(N^2)	Quadratic O(N^2)
VAE compression	32x ([[DC-AE]])	8x	8x
Tokens at 1024px	1024	16384	4096
Tokens at 4K	16384	262144 (!)	65536
Speed (1024px)	1.2s	~15s	23s

SANA is 10x smaller, 4x fewer tokens, linear complexity. For restoration where we need high-res processing, this is decisive.

Architecture Changes (Minimal)¶

1. Input Conditioning: Channel Concat¶

degraded → DC-AE.encode → condition_latents [B, 32, H, W]
target   → DC-AE.encode → latents           [B, 32, H, W]

x_t = (1-σ)*noise + σ*latents               [B, 32, H, W]
model_input = concat([x_t, condition_latents], dim=1)  [B, 64, H, W]

projection = Conv2d(64, 32, 1)               # 1x1 conv, ~1K params
# Identity init for noise channels, zero init for condition channels
# At step 0: model = pretrained T2I behavior
# Condition signal learned gradually during fine-tuning

model(projection(model_input), timestep, text_embeddings)

Total new parameters: 1,024 (32 x 32 x 1 x 1 conv kernel). Compare: ControlNet = ~800M.

2. Text Conditioning for Degradation Type¶

Prompt describes what to restore: - "Remove gaussian noise, restore sharp details" - "Remove JPEG compression artifacts" - "Enhance this low-light image" - "Clean and restore this image"

Leverages SANA's Gemma-2-2B text encoder for degradation-type understanding.

3. Temporal Tiling for High-Resolution¶

For images > training resolution (e.g., 4K product photos):

4096x4096 image
  ↓ split into overlapping 1024px tiles (raster scan)
  ↓ each tile: DC-AE encode → 32x32x32 latent
  ↓ denoise with Block Causal Linear Attention
  ↓    (running sum S, Z from previous tiles = global context)
  ↓ stitch latents with linear blending in overlap
  ↓ DC-AE decode full stitched latent
4096x4096 restored image

Memory: constant O(D^2) cache + one tile latent. Processes any resolution.

Training Strategy¶

Phase 1: LoRA (fast iteration)¶

Rank 32, target: attn.to_q/k/v/out + input projection conv
512px, 10K steps, DIV2K + Flickr2K synthetic degradation
Evaluate: does it learn to denoise at all?

Phase 2: Full Fine-Tune (if LoRA insufficient)¶

Unfreeze all transformer params + projection
VAE stays frozen
Gradient checkpointing for memory
Curriculum: 512px → 1024px

Phase 3: Temporal Tiling (inference-only first)¶

No retraining needed - causal attention is native to linear attention
Just implement the tile loop + S, Z accumulation
If quality insufficient: fine-tune with multi-tile samples

Dataset¶

Source: DIV2K (800) + Flickr2K (2650) = 3450 clean images Degradation: 5-8 variants per image = 17K-28K pairs

Degradation	Params	Prompt
Gaussian noise	σ=10,15,25,35,50	"Remove gaussian noise sigma {σ}"
JPEG	q=15,25,40	"Remove JPEG artifacts quality {q}"
Blur	k=3,5,7,9	"Remove blur, restore sharpness"
Downscale	2x,3x,4x	"Upscale and restore details"
Combined	2-3 random	"Restore this degraded image"

Evaluation Targets¶

Benchmark	Metric	Target	SOTA Reference
SIDD val	PSNR	> 38 dB	NAFNet: 40.3
SIDD val	SSIM	> 0.95	NAFNet: 0.96
DIV2K (σ=25)	PSNR	> 30 dB	SwinIR: 30.9
Urban100 (σ=25)	PSNR	> 29 dB	SwinIR: 29.5
Temporal tiling	Seam PSNR	> 40 dB	MultiDiffusion baseline

Project Files¶

happyin-research/
├── sana-fm/
│   ├── data/paired_dataset.py      ← paired loader
│   ├── data/degradation.py         ← degradation functions
│   ├── configs/img2img_denoise.yaml
│   └── train_flowmatching.py       ← modified compute_loss
├── sana-denoiser/
│   ├── prepare_dataset.py          ← DIV2K + Flickr2K + degradations
│   ├── train.py                    ← wrapper
│   ├── temporal_tiling.py          ← tile-as-sequence inference
│   └── eval/benchmark.py           ← vs SwinIR, NAFNet

Risk Assessment¶

Risk	Likelihood	Mitigation
DC-AE 32x compression loses fine details	Medium	Compare DC-AE reconstruction vs 8x VAE on jewelry textures
Linear attention insufficient for restoration	Low	SANA matches quadratic models on generation; restoration is simpler
Temporal tiling adds latency	High	Acceptable: quality > speed for product photography
1.6B too small for complex degradations	Medium	Scale to 4.8B if needed; depth-pruning from 4.8B as fallback