ELBO-T2IAlign: A Generic ELBO-Based Method for Calibrating Pixel-level Text-Image Alignment in Diffusion Models.

Qin Zhou1, Zhiyang Zhang1, Jinglong Wang1, Xiaobin Li1, Jing Zhang1*, Member, IEEE,
Qian Yu1, Lu Sheng1, Dong Xu2, Fellow, IEEE,
1Beihang University, 2The University of Hong Kong
*Corresponding Author,
Main Comparision with Baseline

Pixel-text alignment heatmaps generated by our proposed ELBO-T2IAlign and the state-of-the-art diffusion-based segmentation methods

Abstract

Diffusion models excel at image generation. Recent studies have shown that these models not only generate high quality images but also encode text-image alignment information through attention maps or loss functions. This information is valuable for various downstream tasks, including segmentation, text-guided image editing, and compositional image generation. However, current methods heavily rely on the assumption of perfect text-image alignment in diffusion models, which is not the case.

In this paper, we propose using zero-shot referring image segmentation as a proxy task to evaluate the pixel-level image and class-level text alignment of popular diffusion models. We conduct an in-depth analysis of pixel-text misalignment in diffusion models from the perspective of training data bias. We find that misalignment occurs in images with small sized, occluded, or rare object classes. Therefore, we propose ELBO-T2IAlign—a simple yet effective method to calibrate pixel-text alignment in diffusion models based on the evidence lower bound (ELBO) of likelihood. Our method is training-free and generic, eliminating the need to identify the specific cause of misalignment and works well across various diffusion model architectures. Extensive experiments on commonly used benchmark datasets on image segmentation and generation have verified the effectiveness of our proposed calibration approach.

Methodology


Pipeline

Pipeline of ELBO-T2IAlign. Given a pre-trained frozen diffusion model, we approximate the rough pixel-text alignment through cross-attention map. Then, we compute the ELBO of likelihood \( p_\theta(x|c_i) \) of each class \( c_i \). We define an alignment score based on ELBO, which is used for calibrate the cross-attention map. Segmentation masks are generated by applying threshold to \( p_\theta(c_i|x_k) \).

Better Segmentation


Here are some segmentation results for zero-shot referring image segmentation. We can see that our proposed ELBO-T2IAlign can achieve better segmentation results than previous methods.


Pipeline

Better Editing


For editing, our method extracts entities from source text to generate heatmaps, which are scaled and used to replace target cross-attention maps, enabling more accurate generation and editing.

Downstream Task Example 1

Comparison results of image editing based on PTP before and after calibration
using our ELBO-T2IAlign.

Better Generation


For generation, our method balances the semantics of entities during generation using alignment scores to enhance text-image consistency.

Downstream Task Example 2

Qualitative compositional generation results of before and after calibration
using our ELBO-T2IAlign.

BibTeX

@article{zhou2025elbo,
      title={ELBO-T2IAlign: A Generic ELBO-Based Method for Calibrating Pixel-level Text-Image Alignment in Diffusion Models},
      author={Zhou, Qin and Zhang, Zhiyang and Wang, Jinglong and Li, Xiaobin and Zhang, Jing and Yu, Qian and Sheng, Lu and
      Xu, Dong},
      journal={arXiv preprint arXiv:2506.09740},
      year={2025}
      }