Kevin Li

Kevin (Yu-Teng) Li

I am a Senior Applied Research Scientist at Adobe, working on visual foundation model training and multimodal research. My current focus is on building better video generation models and data.

Recently I co-led the multimodal pretraining of Firefly Image 5 released at Adobe MAX 2025 (Preview), which enables instruction editing, character-reference generation, layered generation and text-to-image at 2k resolution.

Previously, I graduated from University of California, Berkeley with a B.S. in Electrical Engineering and Computer Sciences, where I did my research in Active Learning under the supervision of Trevor Darrell.

I have also served as a reviewer for CVPR, ECCV, NeurIPS and ICLR.

Email / CV / Twitter / LinkedIn / Google Scholar

Industry Projects

Firefly Image 5
Multimodal pretraining (textual editing, character-reference generation, layer generation) October 2025

I co-led the training of Firefly Image 5 model for multimodal workflows such as instruction editing, character-reference generation and layered editing. Throughout model development, I led ablation studies on the architecture and data combinations, and drove decisions on the final production model's training recipe, scaling to ~1000-GPU distributed training on a daily basis (July-Oct 2025).

Firefly Image 4
Foundation model pretraining & post-training April 2025

Core member of the foundation model training team. I developed recipes for synthetic data handling and aesthetics fine-tuning (SFT), as well as sampling improvements. Firefly Image 4 is, as of Oct 2025, one of the most advanced T2I models in the industry, ahead of competitors such as Qwen-Image, Runway Gen-4 Image, Luma Photon...etc in "General & Photorealistic" category on Text-to-Image Arena.

Firefly Image 3 Custom Models
August 2024

I led the personalization effort of Firefly Image 3, which enables copyright content generations for Adobe's enterprise customers. Developed the training recipe (e.g. improved optimizer memory efficiency and Dreambooth's stability with VLM-predicted superclass) and integrated the finetuning pipeline into prodcution.

Research

	UniFusion: Vision-Language Model as Unified Encoder for Image Generation and Editing Yu-Teng Li, Manuel Brack, Sudeep Katakol, Hareesh Ravi, Ajinkya Kale ICLR Workshop on Multimodal Intelligence, 2026 UniFusion challenges the premise that semantic embeddings are insufficient for pixel generation tasks, and is the first architecture that uses only VLM as unified semantic encoder without inputs from VAE or CLIP to do editing. The unified encoder framework enables competitive generation and editing that rivals Flux.1[dev] and Bagel, with emergent capabilities such as zero-shot multi-ref generation when trained on single-ref pairs.
	Towards Text-Guided Attribute-Disentangled Multimodal Representation Learning Yibing Wei, Sudeep Katakol, Manuel Brack, Jinhong Lin, Haoyue Bai, Yu-Teng Li, Richard Zhang, Eli Shechtman, Hareesh Ravi, Ajinkya Kale CVPR, 2026 We formulated Queryable Attribute Representation Extraction (QARE) and introduced the first benchmark for attribute-disentangled multimodal embeddings. Proposed a training-free method that substantially outperforms SigLIP and contrastively-trained methods such as VLM2Vec, on attribute-conditioned retrieval.
	Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation Shuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, Yu-Jhe Li, Igor Gilitschenski Arxiv, 2026 Conditioning diffusion models on MLLMs that jointly encode text and images better harmonizes identities in subject-driven generation and reduces copy-paste artifacts. Introduced Dual Layer Aggregation (DLA) module to fuse multi-layer MLLM features and a multi-stage denoising strategy that balances MLLM semantics with details from VAE.
	Hyperbolic Active Learning for Semantic Segmentation under Domain Shift Luca Franco, Paolo Mandica, Konstantinos Kallidromitis, Devin Guillory, Yu-Teng Li, Trevor Darrell, Fabio Galasso ICML, 2024 HALO introduces a hyperbolic neural network approach to pixel-based active learning (AL) for semantic segmentation, and is the first AL method to surpass the performance of fully-supervised baseline on synthetic-to-real domain adaptation benchmarks such as GTAV → Cityscapes, with just 1% of labeled pixels.
	Neighboring State-based Exploration for Reinforcement Learning Yu-Teng Li, Justin Lin, Jeffery Cheng, Pedro Pachuca ArXiv, 2022 Inspired by adversarial attack literature, we proposed ρ-explore, a simple but effective on-policy exploration method by surveying a bounded region of nearby states during early training of an agent. Our method consistently outperforms Double DQN baseline by 49% in discrete environments on reward return [code].

Teaching

CS 182/282A Deep Neural Networks | UC Berkeley
Head Teaching Assistant of Discussions

I led the curriculum design of weekly discussion sections in the Deep Learning course at UC Berkeley, with 300+ graduate & undergraduate students in Spring 2023. I also designed various exam and homework questions on denoising diffusion models (DDPM), Transformers, and more.

Website source code from here.