Scott Alexander’s A Guide To Asking Robots To Design Stained Glass Windows has some great examples of both the stunning images that OpenAI’s DALL-E-2 system can produce and some of the issues it still has. Some of the problems are fairly straightforward (like text will often come out as well-rendered gibberish) but the most interesting issue he runs into is that DALL-E’s content (what the image represents) isn’t orthogonal to the style (or medium). Mention a raven in the query and you’ll get an image that’s all goth and edgy. Ask for “hot air balloon, stained glass window” and you’ll get something that looks like someone’s first project in a continuing education art class. And if your content strays too far from what you typically see in stained glass windows you’ll instead get an image that happens to have a stained glass window in the background.
He concludes:
DALL-E is clearly capable of incredible work. That having been said, I mostly couldn’t access it. Although it can make stunning stained glass of traditional stained-glass subjects, trying to get it to do anything at all unusual degrades the style until you end up picking the best of a bad lot and being grateful for it. You can try to recover the style by adding more and more stylistic cues, but it’s really hit-or-miss, and you can never be sure you’ll get two scenes depicted in the same (or even remotely similar) styles, which kind of sinks my plan for a twelve-part window.
Honestly this isn’t too surprising. DALL-E-2’s uses a subsystem called CLIP, which was trained to predict whether or not a sample text string co-occurs with a particular image scraped from about 400 million image-text pairs found on the Internet. Scott would ideally like something that takes an image request like “a stained glass window depicting Charles Darwin studying finches” and produces the requested image (prescriptive text → desired image), but what DALL-E-2 really does is take a sample caption and produce images that are likely to have that caption text (descriptive text → described image). That’s kind of the same task but not quite. In a prescriptive system you really want to be able to specify style and content separately, regardless of any biases in the original training data. But for the prediction task that CLIP is actually trained on that orthogonality would be a disadvantage. (The usefulness of captions for requests will also probably vary across genres, since stock image databases like ShutterStock use captions that are already hand-tuned for search engines with text like “Icebergs under the Northern Lights”, but Flikr captions are more often titles like “Off Into The Sunset”.)
I expect this problem can be at least partially fixed with additional training, especially if OpenAI can get examples that specifically disambiguate content from style (perhaps even from DALL-E-2 usage data itself). In the mean time there’s a whole community endeavor to discover so-called prompt engineering tips and tricks to trick the system to do what you want. (I especially love that you can improve the quality of the resulting image by adding the words “trending on ArtStation” to your query.)