UniFusion

Vision-Language Model as Unified Encoder for Image Generation

Kevin (Yu-Teng) Li* Manuel Brack* Sudeep Katakol Hareesh Ravi Ajinkya Kale

* Equal contributions. Kevin led the model design and ablations. Manuel led the evaluation and paper writing.

UniFusion Architecture

UniFusion is the first architecture that uses only VLM as input-condition encoder without auxiliary signals from VAE or CLIP to do image editing with competitive quality. The unified encoder framework and our proposed Layerwise Attention Pooling (LAP) module enables emergent capabilities such as zero-shot multi-reference generation when trained on single-reference pairs, and capability transfer where training on Editing helps T2I quantitatively and qualitatively.

Text-to-Image Results

Single-reference editing results

Hover to see the full edited image and instruction prompts.

Results shown are generated by a UniFusion checkpoint trained for roughly 11k steps on editing and has never seen multi-reference pairs. Hover over the image to pause the carousel!

Abstract

Despite rapid advancements in visual generative models, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models’ ability to perform cross-modal reasoning and knowledge transfer. Prior attempts to bridge this gap often use the last layer information from VLM, employ multiple visual encoders, or train large multimodal unified models which demands large-scale computation resources and datas.

To maximize the benefits of the joint multimodal reasoning and representation capacity of VLMs, we present UniFusion, a novel framework for image generation using frozen VLMs as unified encoders. At the core of UniFusion is the Layerwise Attention Pooling (LAP) mechanism that extracts both high-level semantics and low-level details from text and visual tokens of a frozen VLM to condition a diffusion generative model. We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithfully transfer of visual information from a VLM to the diffusion model to empower high-quality editing.

With an 8B VLM and an 8B DiT, UniFusion surpasses Flux.1 [dev] and BAGEL on DPG-Bench with a smaller training set, while comparing favorably against Flux.1 Kontext [dev] and Qwen-Image-Edit in editing tasks without any post-training. We further discovered that editing task not only improves text-image alignment for generation, indicative of cross-modality knowledge transfer, but also exhibits tremendous generalization capabilities -- the UniFusion checkpoint trained on single-image reference could generalize zero-shot to multi-reference inputs, further motivating the unified encoder design of UniFusion.

Conditioning Architectures

We begin by ablating four conditioning strategies — including the conventional last-layer, key-value fusion, and hidden-state injection — and find that Layerwise Attention Pooling (LAP) consistently performs best across tasks. However, LAP alone is not a simple drop-in replacement to surpass T5, especially when the base model (e.g., Llama 3.1-8B) contains insufficient capacity. By switching to InternVL 2.5-8B in later section, which provides richer joint representations, we match and even exceed T5 performance once paired with our VeriFi rewriting mechanism, achieving both stronger prompt alignment and improved visual grounding.

UniFusion Design Choices

One of the motivations for LAP was high-level semantics and low-level concepts are dispersed across different layers, all of which are useful for the downstream image generation model. Our intuition was confirmed as we visualize query-key activation norms in a trained LAP module, where different tokens activate different VLM layers, and on many tokens, the model utilizes implicit clusters of adjacent layers. Given this observation, along with VLM layers cosine-similarity plot (shown in paper), we decided to subsample every N layers (e.g. N=3) from the VLM as input to LAP in the final UniFusion architecture, so as to reduce memory overhead and mitigate the chance of model overfitting to certain local cluster of layers and fail to incorporate all useful information.

In addition to activation analysis, we did a layer dropout study before LAP to understand the influence of each VLM layer on the visual output. We observe that image generation does not strongly rely on the first and last layers. When zeroing out the respective weights during pooling, the image composition remains mostly unchanged. Interestingly, dropping out the middle layers breaks the prompt understanding but does not generate complete noise. Instead, early / late layers retained low-level concepts like "mountain" in the example of the top row.

Can VLM layers reconstruct images? To answer this question, we included a small percentage of image reconstruction batches (roughly 3%) from VLM encoded images tiles, during text-to-image training stage of UniFusion. The result indicates that not only can VLM layers reconstruct images, there exists a clear scaling trend as the number of tiles increases - global structure can be captured with just 1 thumbnail tile, while fine-grained details require more tiles to retrain. Interestingly, even when UniFusion is not trained with image reconstruction batches in early experiments, at test-time when a VLM-encoded image is provided, the downstream DiT generates visually similar images.

UniFusion intentionally does not train with user input prompt for most batches. Instead, we propose VLM-Enabled Rewriting Injection with Flexible Inference (VeriFi), which comes with 3 significant benefits:

  • Different from standalone rewriter models, VeriFi perform a single forward pass without re-encoding, which means the decoded text still attends to image and text inputs directly, aligning features between modalities better and grounding potential hallucination.
  • Repetition of important tokens from the original prompt further mitigates "position biases" in causal attention - later tokens have less activation products.
  • While using the same system prompt during training, we can meaningfullly influence the model's behavior by adopting a different system prompt during inference.

UniFusion achieves competitive performance on DPG-Bench against much larger models trained on more data. We report average and best generation across four seeds at 1024px resolution. Macro Average is taken as mean over scores per category, whereas Micro averages scores across all prompts. Results are scored by Gemma-3-27B with extensive CoT to reduce hallucinations in scoring.

Capability Transfer - Editing helps text-to-image generation

UniFusion demonstrated significant improvements in prompt alignment for text-to-image generation after finetuning on editing tasks for just 11k steps. Specifically, UniFusion-Edit leads UniFusion-Base by over 2 percentage points in Micro Avg of DPG-Bench. In an A/B test with 180 annotators across 616 prompts (2 seeds each), UniFusion-Edit achieves nearly double double the win rate of UniFusion-Base.

We hypothesize that this improvement stems from the unified encoder framework. In a shared encoder space, concepts represented by different modalities become more semantically aligned compared to the conventional separation of VAE and T5 spaces. For DiT, this offers two theoretical benefits:

  1. Reduced disruption in parameter space when transitioning from text-to-image to editing tasks, and
  2. More coherent concept optimization as DiT learns to optimize the same flow matching (or denoising) loss given different modalities of the same concept, enabling a more well-rounded understanding, which facilitates smoother capability transfer between tasks.