Recent research published in PNAS Nexus This suggests that while artificial intelligence chatbots can match or exceed human creativity when it comes to individual tasks, they produce very similar responses when compared to each other. This provides evidence that widespread reliance on artificial intelligence in creative work can lead to a loss of unique ideas.
Scientists Emily Wenger and Yod N. Kennett designed this study to understand how large-scale language models affect the diversity of human thought. Large-scale language models are the technology behind popular AI chatbots that predict and generate text based on user prompts.
Large-scale language models are complex computer programs designed to process and generate human language. Developers build these systems by training them on billions of sentences from books, articles, and websites. By analyzing this vast amount of text, the model learns mathematical patterns and relationships between words.
When a user prompts the chatbot, the model works by calculating the most likely next word in the sequence. It builds responses one word at a time based on the rules and associations learned during the training phase. Wenger suspected that this sharing of training methods between different systems could cause broader problems.
“Most of today’s LLMs are trained on large datasets of scraped Internet data, which means that functionally, all LLMs are trained on roughly the same data,” said Wenger, Kew Family Assistant Professor at Duke University. Traditional machine learning research shows that training models on the same dataset results in models with similar properties. I was wondering if this phenomenon occurs in commercial LLMs and what are the implications. ”
To investigate this, the researchers recruited 102 human participants through Prolific, an online platform for research studies. They screened the human participants to weed out computer bots and ensured that everyone passed a basic attention check. We also selected 22 different language models from various companies, including famous chatbots created by Google, Meta, and OpenAI.
Both humans and language models completed three standard language creativity tasks. The first is the Alternative Uses task, which asks participants to list as many creative uses as possible for everyday objects such as forks, books, and pants. This assessment tests divergent thinking, which is the ability to generate multiple unique solutions to a single problem.
The second assessment was a forward flow task that measured associative thinking. Participants receive a starting word, such as “snow” or “candle,” and must provide a chain of up to 20 words that naturally follows them in their head. Associative thinking helps individuals search their memories and combine different concepts to generate new ideas.
The final assessment was a divergence-related task. In this exercise, participants were required to generate 10 nouns that were as unrelated to each other as possible. Generating unrelated words demonstrates cognitive flexibility that is strongly linked to human creative abilities.
The scientists then evaluated the responses using computerized text analysis tools. These tools embed words in a mathematical space to measure the semantic distance between them and calculate how words and concepts differ from each other. The researchers measured both the individual originality of a single answer and the overall variation among all answers within a group.
The researchers found that individual language models performed at or slightly above the average human level on most tasks. When looking at single answers individually, the chatbot provided very creative answers. But when the scientists compared all the responses from the different models to each other, a pattern of similarities emerged.
In all tasks, the model produced answers that were much more similar than those provided by humans. Chatbots frequently rely on the same overlapping vocabulary, resulting in their creative output being grouped in a very homogeneous way. This similarity was even more pronounced when researchers compared models built by the same companies.
“My hypothesis was that there would be some homogeneity in LLM responses compared to humans, but I was surprised by the extent.”
Wenger and Kennett also tested whether they could force the model to be more diverse. They adjusted the model’s “temperature” setting, a mechanism that controls the level of randomness in the text generation process. Lower temperatures produce more predictable text, while higher temperatures result in more random word selection.
Increasing randomness will make the responses more diverse, but the model will soon start producing gibberish. These random responses no longer meet the basic requirements of a creative prompt. True creativity requires that ideas be novel and appropriate to the situation, so producing gibberish is not considered a successful creative achievement.
The researchers also tried changing the initial instructions they gave the model. They specifically tasked the chatbot with acting as a creative assistant and providing bold, out-of-the-box answers. Although this slightly improved individual originality, the broader problem of homogeneity was not completely resolved, as the models still produced similar responses to each other.
These findings suggest that relying on generative AI for brainstorming and problem-solving may limit the scope of human creativity. If everyone uses these tools to write drafts and generate ideas, the concept of society can be significantly narrowed.
“If you’re using an AI chatbot (built on LLM) for creative tasks, know that the results you get from these models are likely to be very similar to the results other people get from AI chatbots, even if they’re different than the ones you used. If you want your content to be truly unique, you should probably avoid using an AI chatbot to generate it.”
The researchers point out potential misconceptions and limitations of their study. In this study, we only measured performance on a specific linguistic creativity task. That is, the results may not apply to all forms of creative behavior. For example, a language model may not show the same homogenization when asked to perform common nonverbal tasks like drawing a picture or composing music.
Additionally, scientists only tested commercially available models that were programmed to follow strict safety and conversation guidelines. This safety training is known to influence model behavior in experimental environments. Raw, untuned models may display different creative properties, but most everyday users do not have access to these raw versions.
Future research will need to explore other aspects of creativity, such as fluency and flexibility, in addition to originality. Fluency refers to the sheer number of ideas that are generated, while flexibility refers to the different categories that ideas cover. The scientists also hope to investigate the extent of this homogenization across other types of artificial intelligence and explore potential engineering solutions to alleviate the problem.
The study “Large Language Models Are Uniformly Creative” was authored by Emily Wenger and Yoed N. Kenett.

