MACRO (Multi-Reference Image Generation)¶

Dataset + benchmark + fine-tuning recipe that fixes quality degradation when generation models receive many (6-10) reference images. Not a new architecture — applied to existing models (Bagel, OmniGen2, Qwen-Image-Edit).

Paper: arXiv:2603.25319 (March 2026). Authors: HKU MMLab + Meituan.

Problem¶

Models like Bagel, OmniGen2, Qwen-Image-Edit support <image N> placeholders for multi-reference generation, but quality drops sharply at 6+ images. Root cause: training data bottleneck — existing datasets dominated by 1-2 reference pairs with no structured supervision for dense inter-reference dependencies.

Solution: Data-Centric¶

MACRO-400K Dataset¶

400K samples, up to 10 references per sample, average 5.44 references. Four task categories (100K each):

Task	Description	Sources
Customization	Multi-subject composition	OpenSubject, MVImgNet, DL3DV, WikiArt
Illustration	Image from multimodal context	OmniCorpus-CC-210M web crawl
Spatial	Novel view synthesis	G-buffer Objaverse, Pano360, Polyhaven
Temporal	Future frame prediction	OmniCorpus-YT videos

Balanced across reference count brackets: 1-3 / 4-5 / 6-7 / 8-10.

Construction pipeline: Split → Generate (Gemini + Nano APIs) → Filter (LLM scoring + bidirectional VLM assessment). The generation step uses proprietary APIs — pipeline not fully reproducible, but the resulting dataset is fully released.

Dynamic Resolution Scaling¶

At inference, input images automatically downsized as count increases: - 1-2 images: 1M px - 3-5 images: 590K px - 6+ images: 262K px

Training Recipe¶

Full fine-tune (not LoRA). Per-model framework:

Model	Framework	Training	Size
Bagel (14.7B, MoT)	FSDP + FLEX packing	LR 2e-5, 10 epochs, VAE frozen	~29.5 GB
OmniGen2	Native framework	Same hyperparams	—
Qwen-Image-Edit	DiffSynth + DeepSpeed	Same hyperparams	~98.6 GB

T2I co-training: 10% text-to-image data mixed in to preserve general T2I capability.

Results (MacroBench)¶

4000 samples, 16 sub-categories, LLM-scored:

Model	Open?	Score	vs Base
Nano Banana Pro	No	6.12	—
GPT-Image-1.5	No	5.89	—
Macro-Bagel	Yes	5.71	+88% (base: 3.03)
Macro-OmniGen2	Yes	—	significant improvement
Macro-Qwen	Yes	—	mitigates severe drops at 6-10

Macro-Bagel approaches Nano Banana Pro in Customization, surpasses it in Spatial tasks.

Ablation Insights¶

Sharpest gains between 1K-10K samples, diminishing returns 10K-20K
Upweighting large-input samples (2:2:3:3 ratio) helps without hurting low-input
Cross-task co-training provides synergistic benefits — spatial training helps customization

Inference¶

# Bagel variant
from inference_bagel import generate
result = generate(model, prompt="...", reference_images=[img1, img2, ...], resolution=768)
# Default: 768x768

VRAM: 40-80 GB depending on model variant. enable_model_cpu_offload supported for OmniGen2.

License¶

Component	License
Code	Apache 2.0 (HF; GitHub has no LICENSE file)
All 3 model weights	Apache 2.0
MACRO-400K dataset	CC-BY-4.0

Fully commercially usable — code, weights, and dataset.

Gotchas¶

Dataset construction uses proprietary Gemini/Nano APIs — cannot recreate dataset, but can use the released one
GitHub repo has no explicit LICENSE file yet (fresh project, 3 days old)
Full fine-tune requires multi-GPU setup (FSDP) — not a quick LoRA
Training code released but expects specific framework versions

Key Links¶

GitHub: github.com/HKU-MMLab/Macro
HF models: huggingface.co/Azily/ (Macro-Bagel, Macro-OmniGen2, Macro-Qwen-Image-Edit)
HF dataset: huggingface.co/datasets/Azily/Macro-Dataset
Project page: macro400k.github.io