Skip to content

ATI (Any Trajectory Instruction)

Trajectory-based motion control for I2V generation. Lightweight Gaussian motion injector module for pretrained video DiT. Controls camera and object motion with the same unified representation.

Paper: arXiv:2505.22944 (May 2025). Authors: ByteDance.

Architecture

Adds a module between preprocessing and patchify of frozen I2V DiT:

Input image → VAE encode → latent L_I (H×W×C)
For each trajectory point: bilinear interpolation → C-dim feature vector
Gaussian distribution across frames:
  P = exp(-||position_t - (i,j)||^2 / 2σ)   σ=1/440
Soft spatial guidance signals → injected into latent space
Frozen Wan2.1-I2V-14B → generated video following trajectories

Unified Representation

Camera motion = coordinated point trajectories (radial expansion=zoom, uniform translation=pan). Object motion = point trajectories anchored to objects. Same mechanism for both. No separate encoders per motion type.

Training

  • 2.4M video clips (filtered from 5M), TAP-Net tracked 120 points/frame
  • 50K iterations on 64 GPUs (80GB each)
  • 1-20 random points per video during training
  • "Tail Dropout Regularizer" (p=0.2): randomly truncates trajectories to prevent occlusion hallucination

Base Model

Wan2.1-I2V-14B-480P (primary). Also validated on Seaweed-7B (internal ByteDance). Model-agnostic injector.

License

Apache 2.0 — fully commercial.

Gotchas

  • Output 480P only
  • Very rapid movements (half image width in 2 frames) → failure
  • Requires full Wan2.1 14B model + ATI weights + VAE/T5/CLIP copied manually
  • No confirmed Wan 2.2 support yet
  • Trajectory editor = localhost only (security risk on remote)
  • ComfyUI nodes available via Kijai (ComfyUI-WanVideoWrapper)
  • GitHub: github.com/bytedance/ATI
  • HF: huggingface.co/bytedance-research/ATI
  • ComfyUI: docs.comfy.org/tutorials/video/wan/wan-ati