UniFusion

Text-to-Image Results

Single-reference editing results

Hover to see the full edited image and instruction prompts.

a photo of this cat swimming in a river in a jungle, go pro action footage

change the color to the wall of the house to yellow

change the words to say "watch out above you"

Using the reference portrait as identity, generate a new image of the same woman throwing pottery at a wheel in a rustic studio. Keep facial features, hair color/part, approximate age, and skin texture consistent. Hands should be clay-smeared; add soft daylight from a high window plus a warm rim light from camera right. 35 mm, f/2.0, chest-level angle

change this character to smiling

Generate an image of this photo showing the whole body of the subject. The subject is wearing a fluffy white headband.

change the "STOP" to "START"

Isolate only the rosemary on yellow background

Using this style make some art of a boat in the harbor of an old town

Same woman in a modern lab wearing a white coat and nitrile gloves, holding a pipette near a rack; match facial structure and hair color; neutral white balance.

Place the dog on a white background

add earrings to a white marble round stand, add delicate shadows in the background of the flowers, place a white rose next to the stand

Abstract

Despite rapid advancements in visual generative models, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models' ability to perform cross-modal reasoning and knowledge transfer. To maximize the benefits of the joint multimodal reasoning and representation capacity of VLMs, we present UniFusion, a diffusion-based generative model conditioned on a frozen large vision-language model (VLM) that serves as a unified multimodal encoder.

At the core of UniFusion is the Layerwise Attention Pooling (LAP) mechanism that extracts both high level semantics and low level details from text and visual tokens of a VLM. We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithful transfer of visual information from VLM to the diffusion model which is key for editing. We propose VLM-Enabled Rewriting Injection with Flexibile Inference (Verifi), which leverages only the text tokens generated by the VLM during in-model prompt rewriting for increased capabilities and flexibility at inference.

UniFusion surpasses Flux.1 [dev] and BAGEL on DPG-Bench with a smaller training set, while comparing favorably against Flux.1 Kontext [dev] and Qwen-Image-Edit in editing tasks without any post-training. In addition, finetuning on editing task not only improves text-image alignment for generation, indicative of cross-modality knowledge transfer, but also exhibits tremendous generalization capabilities. Our model when trained on single image editing, zero-shot generalizes to multiple image references further motivating the unified encoder design of UniFusion.

Conditioning Architectures

We began by ablating four conditioning strategies — including the conventional last-layer, key-value fusion, hidden-state injection — and found that Layerwise Attention Pooling (LAP) consistently performs best. However, LAP alone is not a simple drop-in replacement to surpass T5, especially when the base model (e.g., Llama 3.1-8B) contains insufficient capacity. By switching to InternVL 2.5-8B in later section, which provides richer joint representations, we match and even exceed T5 performance once paired with our Verifi rewriting mechanism, achieving both stronger prompt alignment and improved visual grounding.

UniFusion Design Choices

One of the motivations for LAP was high-level semantics and low-level concepts are dispersed across different layers, all of which are useful for the downstream image generation model. Our intuition was confirmed as we visualize query-key activation norms in a trained LAP module, where different tokens activate different VLM layers, and on many tokens, the model utilizes implicit clusters of adjacent layers. Given this observation, along with VLM layers cosine-similarity plot (shown in paper), we decided to subsample every N layers (e.g. N=3) from the VLM as input to LAP in the final UniFusion architecture, so as to reduce memory overhead and mitigate the chance of model overfitting to certain local cluster of layers and fail to incorporate all useful information.

In addition to activation analysis, we did a layer dropout study before LAP to understand the influence of each VLM layer on the visual output. We observe that image generation does not strongly rely on the first and last layers. When zeroing out the respective weights during pooling, the image composition remains mostly unchanged. Interestingly, dropping out the middle layers breaks the prompt understanding but does not generate complete noise. Instead, early / late layers retained low-level concepts like "mountain" in the example of the top row.

Can VLM layers reconstruct images? To answer this question, we included a small percentage of image reconstruction batches (roughly 3%) from VLM encoded images tiles, during text-to-image training stage of UniFusion. The result indicates that not only can VLM layers reconstruct images, there exists a clear scaling trend as the number of tiles increases - global structure can be captured with just 1 thumbnail tile, while fine-grained details require more tiles to retrain. Interestingly, even when UniFusion is not trained with image reconstruction batches in early experiments, at test-time when a VLM-encoded image is provided, the downstream DiT generates visually similar images.

UniFusion intentionally does not train with user input prompt for most batches. Instead, we propose VLM-Enabled Rewriting Injection with Flexible Inference (Verifi), which comes with 3 significant benefits:

Different from standalone rewriter models, Verifi perform a single forward pass without re-encoding, which means the decoded text still attends to image and text inputs directly, aligning features between modalities better and grounding potential hallucination.
Repetition of important tokens from the original prompt further mitigates "position biases" in causal attention - later tokens have less activation products.
While using the same system prompt during training, we can meaningfully influence the model's behavior by adopting a different system prompt during inference.

UniFusion achieves competitive performance on DPG-Bench against much larger models trained on more data. We report average and best generation across four seeds at 1024px resolution. Macro Average is taken as mean over scores per category, whereas Micro averages scores across all prompts. Results are scored by Gemma-3-27B with extensive CoT to reduce hallucinations in scoring.

How does UniFusion perform on editing with just VLM-encoded images? In an AB test with 200 annotators across 448 prompts of diverse editing types (as shown in the pie chart below),

UniFusion beats BAGEL in overall user preference, identity preservation and visual quality
UniFusion beats Qwen-Image-Edit 2509 in identity preservation and ties on visual quality
UniFusion remains competitive with Flux.1 Kontext [dev] despite being an 8B model with a smaller training set

Part of the remaining gap can be attributed to missing data types such as "portrait retouch", which explains why UniFusion lags behind in Prompt Match aspect, but we retained these cases for a fair evaluation. The strong performance of UniFusion despite this highlights its potential when equipped with a more comprehensive dataset and scaled up in parameters.

(AB testing plot on editing will be included in the updated version of ArXiv shortly)

Capability Transfer - Editing helps text-to-image generation

UniFusion demonstrated significant improvements in prompt alignment for text-to-image generation after finetuning on editing tasks for just 11k steps. Specifically, UniFusion-Edit leads UniFusion-Base by over 2 percentage points in Micro Avg. of DPG-Bench. In an A/B test with 180 annotators across 616 prompts (2 seeds each), UniFusion-Edit achieves nearly double the win rate of UniFusion-Base.

We hypothesize that this improvement stems from the unified encoder framework. In a shared encoder space, concepts represented by different modalities become more semantically aligned compared to the conventional separation of VAE and T5 spaces. For DiT, this offers two theoretical benefits:

Reduced disruption in parameter space when transitioning from text-to-image to editing tasks, and
More coherent concept optimization as DiT learns to optimize the same flow matching (or denoising) loss given different modalities of the same concept, enabling a more well-rounded understanding, which facilitates smoother capability transfer between tasks.