UniFusion

Vision-Language Model as Unified Encoder for Image Generation

Kevin (Yu-Teng) Li* Manuel Brack* Sudeep Katakol Hareesh Ravi Ajinkya Kale

* Equal contributions. Kevin led the model design and ablations. Manuel led the evaluation and paper writing.

UniFusion Architecture

UniFusion is the first architecture that uses only VLM as input-condition encoder without auxiliary signals from VAE or CLIP to do image editing with competitive quality, to the best of our knowledge. The unified encoder framework and our proposed Layerwise Attention Pooling (LAP) module enables emergent capabilities such as zero-shot multi-reference generation when trained on single-reference pairs, and capability transfer where training on Editing helps T2I quantitatively and qualitatively.

Text-to-Image Results

Single-reference editing results

Hover to see the full edited image and instruction prompts.

Results shown are generated by a UniFusion checkpoint trained for roughly 11k steps on editing and has never seen multi-reference pairs. Hover over the image to pause the carousel!

Abstract

Despite rapid advancements in visual generative models, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models' ability to perform cross-modal reasoning and knowledge transfer. To maximize the benefits of the joint multimodal reasoning and representation capacity of VLMs, we present UniFusion, a diffusion-based generative model conditioned on a frozen large vision-language model (VLM) that serves as a unified multimodal encoder.

At the core of UniFusion is the Layerwise Attention Pooling (LAP) mechanism that extracts both high level semantics and low level details from text and visual tokens of a VLM. We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithful transfer of visual information from VLM to the diffusion model which is key for editing. We propose VLM-Enabled Rewriting Injection with Flexibile Inference (Verifi), which leverages only the text tokens generated by the VLM during in-model prompt rewriting for increased capabilities and flexibility at inference.

UniFusion surpasses Flux.1 [dev] and BAGEL on DPG-Bench with a smaller training set, while comparing favorably against Flux.1 Kontext [dev] and Qwen-Image-Edit in editing tasks without any post-training. In addition, finetuning on editing task not only improves text-image alignment for generation, indicative of cross-modality knowledge transfer, but also exhibits tremendous generalization capabilities. Our model when trained on single image editing, zero-shot generalizes to multiple image references further motivating the unified encoder design of UniFusion.

Conditioning Architectures

We began by ablating four conditioning strategies — including the conventional last-layer, key-value fusion, hidden-state injection — and found that Layerwise Attention Pooling (LAP) consistently performs best. However, LAP alone is not a simple drop-in replacement to surpass T5, especially when the base model (e.g., Llama 3.1-8B) contains insufficient capacity. By switching to InternVL 2.5-8B in later section, which provides richer joint representations, we match and even exceed T5 performance once paired with our Verifi rewriting mechanism, achieving both stronger prompt alignment and improved visual grounding.

UniFusion Design Choices

One of the motivations for LAP was high-level semantics and low-level concepts are dispersed across different layers, all of which are useful for the downstream image generation model. Our intuition was confirmed as we visualize query-key activation norms in a trained LAP module, where different tokens activate different VLM layers, and on many tokens, the model utilizes implicit clusters of adjacent layers. Given this observation, along with VLM layers cosine-similarity plot (shown in paper), we decided to subsample every N layers (e.g. N=3) from the VLM as input to LAP in the final UniFusion architecture, so as to reduce memory overhead and mitigate the chance of model overfitting to certain local cluster of layers and fail to incorporate all useful information.

In addition to activation analysis, we did a layer dropout study before LAP to understand the influence of each VLM layer on the visual output. We observe that image generation does not strongly rely on the first and last layers. When zeroing out the respective weights during pooling, the image composition remains mostly unchanged. Interestingly, dropping out the middle layers breaks the prompt understanding but does not generate complete noise. Instead, early / late layers retained low-level concepts like "mountain" in the example of the top row.

Can VLM layers reconstruct images? To answer this question, we included a small percentage of image reconstruction batches (roughly 3%) from VLM encoded images tiles, during text-to-image training stage of UniFusion. The result indicates that not only can VLM layers reconstruct images, there exists a clear scaling trend as the number of tiles increases - global structure can be captured with just 1 thumbnail tile, while fine-grained details require more tiles to retrain. Interestingly, even when UniFusion is not trained with image reconstruction batches in early experiments, at test-time when a VLM-encoded image is provided, the downstream DiT generates visually similar images.

UniFusion intentionally does not train with user input prompt for most batches. Instead, we propose VLM-Enabled Rewriting Injection with Flexible Inference (Verifi), which comes with 3 significant benefits:

  • Different from standalone rewriter models, Verifi perform a single forward pass without re-encoding, which means the decoded text still attends to image and text inputs directly, aligning features between modalities better and grounding potential hallucination.
  • Repetition of important tokens from the original prompt further mitigates "position biases" in causal attention - later tokens have less activation products.
  • While using the same system prompt during training, we can meaningfully influence the model's behavior by adopting a different system prompt during inference.

UniFusion achieves competitive performance on DPG-Bench against much larger models trained on more data. We report average and best generation across four seeds at 1024px resolution. Macro Average is taken as mean over scores per category, whereas Micro averages scores across all prompts. Results are scored by Gemma-3-27B with extensive CoT to reduce hallucinations in scoring.

How does UniFusion perform on editing with just VLM-encoded images? In an AB test with 200 annotators across 448 prompts of diverse editing types (as shown in the pie chart below),

  • UniFusion beats BAGEL in overall user preference, identity preservation and visual quality
  • UniFusion beats Qwen-Image-Edit 2509 in identity preservation and ties on visual quality
  • UniFusion remains competitive with Flux.1 Kontext [dev] despite being an 8B model with a smaller training set
Part of the remaining gap can be attributed to missing data types such as "portrait retouch", which explains why UniFusion lags behind in Prompt Match aspect, but we retained these cases for a fair evaluation. The strong performance of UniFusion despite this highlights its potential when equipped with a more comprehensive dataset and scaled up in parameters.

(AB testing plot on editing will be included in the updated version of ArXiv shortly)

Capability Transfer - Editing helps text-to-image generation

UniFusion demonstrated significant improvements in prompt alignment for text-to-image generation after finetuning on editing tasks for just 11k steps. Specifically, UniFusion-Edit leads UniFusion-Base by over 2 percentage points in Micro Avg. of DPG-Bench. In an A/B test with 180 annotators across 616 prompts (2 seeds each), UniFusion-Edit achieves nearly double the win rate of UniFusion-Base.

We hypothesize that this improvement stems from the unified encoder framework. In a shared encoder space, concepts represented by different modalities become more semantically aligned compared to the conventional separation of VAE and T5 spaces. For DiT, this offers two theoretical benefits:

  1. Reduced disruption in parameter space when transitioning from text-to-image to editing tasks, and
  2. More coherent concept optimization as DiT learns to optimize the same flow matching (or denoising) loss given different modalities of the same concept, enabling a more well-rounded understanding, which facilitates smoother capability transfer between tasks.