There is a new trend in AI: text-to-image generators. Give these programs any text you want and they will generate remarkably accurate images that fit that description. They can match a range of styles, from oil paintings to CGI renderings and even photographs, and – although it sounds cliché – in many ways the only limit is your imagination.
To date, DALL-E has been the leader in the field, a program created by the commercial AI lab OpenAI (and only updated in April). Yesterday, however, Google announced its own take on the genre, Imagen, and it simply dethroned DALL-E in the quality of its output.
The best way to understand the amazing capabilities of these models is to simply take a look at some of the images they can generate. There are some generated by Imagen above and several more below (you can see more examples on Google’s dedicated landing page).
In any case, the text at the bottom of the image was the prompt entered into the program and the image above was the output. Just to emphasize: that’s all it takes. You type what you want to see and the program generates it. Pretty fantastic, right?
But while these photos are undeniably impressive in their consistency and accuracy, they should also be taken with a grain of salt. When research teams like Google Brain release a new AI model, they tend to pick the best results. So while these photos all look perfectly polished, they may not represent the average output of the Image system.
Often images generated by text-to-image models look unfinished, smeared, or blurry – issues we’ve seen with images generated by OpenAI’s DALL-E program. (For more information on the problem spots for text-to-image systems, check out this interesting twitter thread that dives into trouble with DALL-E† Among other things, it highlights the system’s tendency to misunderstand prompts and to struggle with both text and faces.)
However, Google claims that Imagen consistently produces better images than DALL-E 2, based on a new benchmark it created for this project called DrawBench.
DrawBench isn’t a particularly complex statistic: it’s essentially a list of some 200 text prompts that the Google team fed into Imagen and other text-to-image generators, with the output of each program then judged by human evaluators. As can be seen from the charts below, Google found that people generally preferred Imagen’s output over rivals.
However, it will be difficult to judge this for ourselves as Google is not making the Imagen model available to the public. There’s a good reason for that too. While text-to-image models certainly have fantastic creative potential, they also have a range of troubling uses. Imagine a system that generates just about any image you want used, for example, for fake news, hoaxes, or intimidation. As Google points out, these systems also code for social prejudice, and their output is often racist, sexist, or toxic in some other inventive way.
Much of this is due to the way these systems are programmed. Essentially, they are trained on huge amounts of data (in this case: lots of pairs of images and captions) which they study for patterns and learn to replicate. But these models require an awful lot of data, and most researchers — even those who work for well-funded tech giants like Google — have decided it’s too onerous to filter these inputs completely. So they scrape massive amounts of data from the web, and as a result, their models absorb (and learn to replicate) all the hateful gal you’d expect to find online.
As the researchers at Google summarize this problem in their paper: “[T]the large-scale data requirements of text-to-image models […] have led researchers to rely heavily on large, mostly uncurated, web-scraped datasets […] Dataset audits have shown that these datasets tend to reflect social stereotypes, oppressive viewpoints, and derogatory or otherwise harmful associations with marginalized identity groups.”
In other words, the age-old adage of computer scientists still applies in the fast-paced world of AI: garbage in, garbage out.
Google won’t go into too much detail about the disturbing content generated by Imagen, but notes that the model “encodes several social biases and stereotypes, including a general preference for generating images of people with lighter skin tones and a propensity for images.” aligning different professions with Western gender stereotypes.”
This is something researchers have also found when evaluating DALL-E. Ask DALL-E to generate images of, for example, a ‘stewardess’, and almost all subjects will be women. Ask for photos of a ‘CEO’ and, surprise, surprise, you get a bunch of white men.
For this reason, OpenAI has also decided not to release DALL-E publicly, but the company will give access to select beta testers. It also filters certain text input in an attempt to prevent the model from being used to generate racist, violent or pornographic images. These measures somewhat limit potentially harmful uses of this technology, but the history of AI tells us that such text-to-image models will almost certainly become public in the future, with all the troubling implications that wider access entails.
Google’s own conclusion is that Imagen “isn’t fit for public use right now,” and the company says it plans to develop a new way to benchmark “social and cultural biases in future work” and test future iterations. For now, though, we’ll have to be content with the company’s cheerful selection of images: raccoon kings and cacti wearing sunglasses. However, that is just the tip of the iceberg. The iceberg made of the unintended consequences of technological research, if Imagen wants to start generating That†