ELBO-T2IAlign: A Generic ELBO-Based Method for Calibrating Pixel-level Text-Image Alignment in Diffusion Models.

Abstract

Diffusion models excel at image generation. Recent studies have shown that these models not only generate high quality images but also encode text-image alignment information through attention maps or loss functions. This information is valuable for various downstream tasks, including segmentation, text-guided image editing, and compositional image generation. However, current methods heavily rely on the assumption of perfect text-image alignment in diffusion models, which is not the case.

In this paper, we propose using zero-shot referring image segmentation as a proxy task to evaluate the pixel-level image and class-level text alignment of popular diffusion models. We conduct an in-depth analysis of pixel-text misalignment in diffusion models from the perspective of training data bias. We find that misalignment occurs in images with small sized, occluded, or rare object classes. Therefore, we propose ELBO-T2IAlign—a simple yet effective method to calibrate pixel-text alignment in diffusion models based on the evidence lower bound (ELBO) of likelihood. Our method is training-free and generic, eliminating the need to identify the specific cause of misalignment and works well across various diffusion model architectures. Extensive experiments on commonly used benchmark datasets on image segmentation and generation have verified the effectiveness of our proposed calibration approach.

Methodology

Pipeline of ELBO-T2IAlign. Given a pre-trained frozen diffusion model, we approximate the rough pixel-text alignment through cross-attention map. Then, we compute the ELBO of likelihood \( p_\theta(x|c_i) \) of each class \( c_i \). We define an alignment score based on ELBO, which is used for calibrate the cross-attention map. Segmentation masks are generated by applying threshold to \( p_\theta(c_i|x_k) \).

Better Segmentation

Here are some segmentation results for zero-shot referring image segmentation. We can see that our proposed ELBO-T2IAlign can achieve better segmentation results than previous methods.

Better Editing

For editing, our method extracts entities from source text to generate heatmaps, which are scaled and used to replace target cross-attention maps, enabling more accurate generation and editing.

Comparison results of image editing based on PTP before and after calibration
using our ELBO-T2IAlign.

Better Generation

For generation, our method balances the semantics of entities during generation using alignment scores to enhance text-image consistency.

Qualitative compositional generation results of before and after calibration
using our ELBO-T2IAlign.

BibTeX

@article{zhou2025elbo,
      title={ELBO-T2IAlign: A Generic ELBO-Based Method for Calibrating Pixel-level Text-Image Alignment in Diffusion Models},
      author={Zhou, Qin and Zhang, Zhiyang and Wang, Jinglong and Li, Xiaobin and Zhang, Jing and Yu, Qian and Sheng, Lu and
      Xu, Dong},
      journal={arXiv preprint arXiv:2506.09740},
      year={2025}
      }