Diffusion models and fixing catastrophic neglect
Some deep learning-based systems screw up on a fairly regular basis. On Monday I was introduced to an example called “catastrophic neglect & incorrect attribute binding”. This was at Controlling Diffusion Models, a presentation by Sayak Pau of HuggingFace at the UCL Centre for Artificial Intelligence.
The problem is: you ask for an image (say a lion wearing a crown), and you get no crown. Or you ask for a yellow bow on a brown bench, but the colours are swapped or applied to both objects. What’s going on?
If you look at the internal representations at each step of the prompt, you see low attention values for the missed word:
The solution, “semantic nursing”, is to ensure all tokens are attended to somewhere in the image. Essentially, find a way to strengthen the neglected parts:
Looks like a neat hack, and may be one reason while we start to get more reliable content generation.
The paper behind this (which I’ve not yet read) is: Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer, et al (2023).