Diffusion models and fixing catastrophic neglect

Some deep learning-based systems screw up on a fairly regular basis. On Monday I was introduced to an example called “catastrophic neglect & incorrect attribute binding”. This was at Controlling Diffusion Models, a presentation by Sayak Paul of HuggingFace at the UCL Centre for Artificial Intelligence.

The problem is: you ask for an image (say a lion wearing a crown), and you get no crown. Or you ask for a yellow bow on a brown bench, but the colours are swapped or applied to both objects. What’s going on?

If you look at the internal representations at each step of the prompt, you see low attention values for the missed word:

Uploaded image — From the presentation, illustrating the the prompt “a lion with a crown” producing just a lion, with low attention on the “crown” part. Slide by Hila Chefer.

The solution, “semantic nursing”, is to ensure all tokens are attended to somewhere in the image. Essentially, find a way to strengthen the neglected parts:

Looks like a neat hack, and may be one reason why we start to get more reliable content generation.

The paper behind this (which I’ve not yet read) is: Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer, et al (2023).

ai events llms