Leah’s PhD: Working with Word Clouds

Word Cloud 14072017

I’ve never liked word clouds. They look a bit tacky, many of the articles they’re thrown into aren’t much enhanced by their presence, and they always make me think of a beginner’s quilting club where none of the participants actually knows what quilting is.

And then I made my own word cloud. See above.

Rest assured, dear reader, that this word cloud is no way going to pop up in my final thesis. It is not the result of much academic deliberation, but is rather the result of me testing all of the available tools (even the tacky ones) for quantitative literary analysis.

This word cloud was generated using the body text (titles, bylines, and long-form quotations omitted), of the news articles listed below, which I found by seaching ‘NaNoGenMo [National Novel Generation Month]’ in Google News on July 14, 2017. The word cloud generator pulled from a corpus of 10,701 words, but from these words it ignored stop words like ‘the’ and ‘and’.

November 11, 2014: https://www.theguardian.com/books/2014/nov/11/can-computers-write-fiction-artificial-intelligence
November 9, 2014: https://singularityhub.com/2014/11/09/computers-are-writing-novels-but-do-you-really-want-to-read-them
November 25, 2014: https://www.theverge.com/2014/11/25/7276157/nanogenmo-robot-author-novel
December 3, 2014: https://www.theatlantic.com/technology/archive/2014/12/moby-dick-in-50000-meows-and-other-tales-that-computers-tell/383340
December 16, 2014: https://www.washingtonpost.com/news/the-intersect/wp/2014/12/16/this-is-what-happens-when-a-bot-writes-an-article-about-journalism
January 22, 2015: http://www.bbc.com/culture/story/20150122-could-a-robot-write-a-novel
May 4, 2016: https://www.buzzfeed.com/alexkantrowitz/googles-artificial-intelligence-engine-reads-romance-novels
August 25, 2016: https://www.theverge.com/2016/8/25/12646462/twitter-bot-unchartedatlas-fantasy-map-generator
May 1, 2017 https://www.strategy-business.com/article/AI-Is-Already-Entertaining-You

My hope when generating a word cloud of news article text was that words revealing underlying social attitudes towards natural language generation and computer-generated texts would rise to the surface. We often regard news agencies as simply reporting on world events, but writers’ biases are inevitable: writers choose what words to use for rhetorical effect, and they choose what content to include and omit to best make their points. Quite honestly, I was hoping for/anticipating words like ‘fear’, ‘excitement’, and ‘fantasy’ to pop up so that I’d have something interesting to use as a conversation starter at my next cocktail party.

No such luck.

Instead, I got words that are obviously related to National Novel Generation Month, like ‘Kazemi’ (NaNoGenMo is Internet artist Darius Kazemi‘s brainchild, ‘NaNoGenMo’, and ‘novel’. I also got words that are frequently associated with anything related to computers, such as ‘computer’, ‘technology’, ‘algorithm’, ‘program’. And then I got some words that relate to the program/content developers and consumers themselves, like ‘people’, ‘human’, ‘companies’, and ‘business’.

The problem with word clouds is that the comprising words are removed from their semantic contexts. It’s all well and good that the word ‘human’ pops up, but that doesn’t necessarily tell us very much. ‘Human’ can be applied to many contexts. For example, some uses of the word ‘human’ from the cited articles are included below. I’ve sorted them into rough categories, because I care about you.

Human as Adjective

  • ‘McCann Japan decided to pit its AI director against the human creative director to create a 30-second ad for Clorets gum.’
  • ‘The result is simple but effective, and on a quick read, perfectly human.’

Human Agency

  • ‘Ray Kurzweil has predicted that by 2029, computers will be able to outsmart even the cleverest human.’
  • ‘He also points out that the writers of the algorithms bring their own human experiences to bear on their coding, adding a necessarily human element.’

Human as Collaborator

  • ‘I’m actually excited about novels that are co-authored between humans and code.’
  • ‘The NaNoGenMo works require varying degrees of human input.’
  • ‘The purported trade-off between humans and machines may in fact be a set of synergies.’


  • ‘We turn to literature in part to deepen our understanding of the human condition.’
  • ‘Many companies start in the lower left quadrant — automating processes, often in human resources and finance.’

Looking at just these nine examples (there are 39 uses of ‘human’ in these articles, and eight of ‘humans’), we can see that the big letters the word cloud uses for ‘human’ (letter size represents the relative frequency of the word in the inputted corpus) actually don’t tell us much about what that word means in a conversation about NaNoGenMo. What the big letters do show when we take a closer look at how ‘human’ is used, though, is that articles’ authors feel inclined to distinguish human authors – presumably from computer authors – and that the discussion appears deeply rooted within conceptions of what humanity means.

Note that the only word larger than ‘human’ is ‘can’ (which appears 46 times throughout the articles). Note also that the word ‘will’ is the same size as ‘human’ (as it also appears 39 times). Some applications of these words within the cited articles follow.


  • ‘AI can work alongside people to carry out tasks that are complex yet repeatable, thus generating powerful insights and freeing up time that professionals can use to make more intelligent decisions.’
  • ‘The AI can detect which sentences contain similar meanings and gain a more nuanced understanding of language.’
  • ‘How can we teach them to talk figuratively?’
  • ‘In building a computer that can write, we are exposing the computer within the writer.’


  • ‘Even if one day the computer will pass muster at the level of the sentence, there is, on this evidence, no foreseeable way as yet that it will be able to construct a narrative that is both plausible and gripping.’
  • ‘But will they ever make anything resembling art?’
  • ‘But maybe art will forever remain a human endeavor, not because machines can’t master it – but because they just don’t give a damn.’
  • ‘But a hundred or so people are taking a very different approach to the challenge, writing computer programs that will write their texts for them.’

Given that these articles are all reporting current events, it makes sense that ‘can’ appears frequently. The authors, after all, need to make sure that their readers are all on the same page, and explaining what computers are currently capable of (e.g. ‘AI can work…’, ‘AI can detect…’) therefore seems a good use of time. Yet the use of ‘can’ also reveals skepticism: ‘Can we teach them to talk?’ and ‘In building a computer that can write…’ are both examples of the author looking into the future, applying the kind of blue sky thinking typical of a PhD thesis.

A look at the uses of ‘will’ similarly reveal skepticism: ‘Even if one day the computer will pass…’; ‘Will they ever make anything resembling art?’ The authors ponder the future, not necessarily in fear, but with one critical eye looking at the present and the other looking towards potential developments in light of what is currently possible (the ‘cans’, if you will).

When I first saw ‘can’ and ‘will’ as some of the biggest words in my word cloud, I assumed they were there as a result of sensationalist journalistic forecasting, or as a result of either the extreme fear or the extreme optimism I’m often faced with when I converse with people about artificial intelligence. I was wrong. These articles, all from widely-read sources, indicate that mass media outlets are, in fact, taking a somewhat calculated and critical stance on natural language generation and computer-generated texts. Sure, the comments sections of these articles may show that there are still indeed individuals with strong opinions about computational agency and creation, and sure, there are going to be more niche media outlets that adopt a more sensationalist stance. Nevertheless, it is reassuring to know that in an age of click bait there are some thoughtful pieces being written about the present and future human condition.

What have I learned about word clouds from this exercise? They don’t completely suck.

I don’t have the patience or inclination to generate an academically-sound word cloud (if one even exists), and I sure as heck don’t want anything like the visual assault that is the above word cloud anywhere near my final thesis. However, the basic quantitative literary analysis that a word cloud offers – Hey! These are the words most commonly used in what you’re reading! You should figure out why they’re being used so much! – has proven to be rather enlightening.

In short, word clouds can offer a good starting point to a critical analysis of literary texts and social attitudes. They may serve as useful tools when you feel stuck in a research rut, and they may be an alright addition to the kind of Powerpoint presentation that uses Comic Sans.

Give ‘er a go.

Note: The word cloud in this post was generated and customized using www.wordclouds.com, a free tool that allows you to generate word clouds in a bunch of fun shapes and colours. The word cloud was generated on July 14, 2017.

Also, for those of you who are interested, below is a list of all the words used ten or more times throughout the cited articles, along with the number of times the word is used (in descending order).

46           can
39           human
39           will
30           new
29           work
27           people
26           one
25           like
24           says
23           novels
22           companies
22           Kazemi
22           novel
21           words
20           computer
20           content
18           sentences
18           NaNoGenMo
18           program
18           media
18           text
17           technology
17           algorithms
17           business
17           write
17           many
17           just
17           make
16           creative
16           writing
16           data
16           word
16           said
14           create
14           story
14           It’s
14           it’s
14           use
13           intelligence
13           industry
13           machine
13           better
13           little
13           might
13           think
13           see
12           National
12           Novel
12           don’t
12           Month
12           must
12           way
11           something
11           company
11           Google
11           first
11           sense
11           every
11           code
11           also
11           two
11           E&M
11           may
11           get
10           algorithm
10           consumers
10           generate
10           quadrant
10           working
10           online
10           using
10           time

Sunday Book-Thought 53

Communication is the process by which culture is developed and maintained. For it is only when people develop language, and thus a way of communicating, that a culture can, in fact, emerge and be imparted. Information, the content of communications, is the basic source of all human intercourse. Over the course of human history, it has been embodied and communicated in an ever expanding variety of media, including among them spoken words, graphics, artifacts, music, dance, written text, film, recordings, and computer hardware and software. Together, these media and the channels through which they are distributed, constitute the web of society, which determine the direction and pace of social development. Seen from this perspective, the communication of information permeates the cultural environment and is essential to all aspects of social life.
– United States Office of Technology Assessment, Intellectual Property Rights in an Age of Electronics and Information (Washington: U.S. Government Printing Office, April 1986), p. 49.