Textual Latent Interpolation¶
Technique for continuous attribute control in diffusion models by interpolating between text embeddings. Enables smooth transitions between states (e.g., neutral → happy) with an intensity parameter alpha. Core mechanism behind [[PixelSmile]]'s expression control.
Principle¶
Same idea as word2vec arithmetic ("king - man + woman = queen") but in the text embedding space of a diffusion model's text encoder.
# Direction vector in embedding space
e_neutral = text_encoder("neutral expression")
e_target = text_encoder("happy expression")
delta = e_target - e_neutral
# Controlled generation at any intensity
e_conditioned = e_neutral + alpha * delta
# alpha=0 → neutral, alpha=1 → full expression, alpha>1 → exaggerated
Implementation Strategies¶
Strategy 1: All-token interpolation (PixelSmile default)¶
Interpolate across ALL token positions in the embedding sequence. Simplest, works when prompts differ only in the attribute word.
# score_one_all method
e_cond = (1 - alpha) * e_neutral + alpha * e_target
# Equivalent to: e_neutral + alpha * (e_target - e_neutral)
Strategy 2: Tail-token interpolation¶
Only interpolate the last N tokens (where the attribute-specific words are). Preserves the shared context tokens exactly.
# Interpolate only last 6-7 tokens
e_cond = e_neutral.clone()
e_cond[:, -7:, :] = (1 - alpha) * e_neutral[:, -7:, :] + alpha * e_target[:, -7:, :]
Strategy 3: Direction projection¶
Compute the delta, project it to a subspace, apply scaled:
delta = e_target - e_neutral
delta_norm = delta / delta.norm() # unit direction
e_cond = e_neutral + alpha * magnitude * delta_norm
Multi-Attribute Blending¶
Extend to pairwise combinations (PixelSmile's expression blending):
e_angry = text_encoder("angry expression")
e_sad = text_encoder("sad expression")
e_neutral = text_encoder("neutral expression")
delta_angry = e_angry - e_neutral
delta_sad = e_sad - e_neutral
# Blend two emotions at different intensities
e_cond = e_neutral + 0.7 * delta_angry + 0.5 * delta_sad
9 of 15 basic-expression pairs produce plausible results. Failures when attributes are physiologically contradictory (conflicting muscle states).
Requirements¶
-
Aligned embedding space: the text encoder must produce embeddings where linear interpolation is semantically meaningful. VLM-based encoders ([[Qwen 2.5 VL]]) work better than CLIP for this.
-
LoRA training with the technique: the model must be trained to respond to interpolated embeddings. PixelSmile explicitly trains with varied alpha values — the model learns that intermediate embeddings mean intermediate expressions.
-
Prompt structure: neutral and target prompts should be identical except for the controlled attribute. Keeps the delta clean.
Applications Beyond Expressions¶
The technique generalizes to any attribute controllable via text:
| Application | Neutral Prompt | Target Prompt | Alpha Controls |
|---|---|---|---|
| Expression | "neutral expression" | "happy expression" | Smile intensity |
| Age | "young person" | "elderly person" | Apparent age |
| Lighting | "normal lighting" | "dramatic lighting" | Light intensity |
| Style | "photograph" | "oil painting" | Stylization degree |
| Weather | "clear sky" | "heavy rain" | Rain intensity |
Relation to Other Techniques¶
| Technique | Mechanism | Granularity | Training Required |
|---|---|---|---|
| Textual Latent Interpolation | Embedding arithmetic | Continuous | Yes (for best results) |
| Prompt weighting | Token-level scale factors | Discrete steps | No |
| ControlNet | Spatial conditioning | Pixel-level | Yes (dedicated model) |
| IP-Adapter | Image embedding injection | Per-reference | Yes (adapter) |
Key advantage: zero additional parameters at inference (just different conditioning), continuous control, and composable.