All Collections
Using Dcipher Studio
Visual exploratory text analysis
Mapping word clusters through contextual word clouds
Mapping word clusters through contextual word clouds

Cluster words based on co-occurrence to generate a contextual word cloud, providing a visual map of topics.

Tomas Larsson avatar
Written by Tomas Larsson
Updated over a week ago

Keywords: contextual word cloud, word network, word co-occurrence

While classic word clouds can be a useful way of getting an overview of the content of the text, they lack information about the context. A contextual word cloud clusters words by co-occurrence to give information not only of how much words are used but also in what way they are used. Seeing what words tend to be used together in the text helps us understand what the text is about.

Technically, a contextual word cloud is a network where nodes correspond to words, and links between nodes correspond to the strengths of the connections between words. This pulls connected words together so that the position and not only the size of words convey meaning.

πŸ’‘ Quick tip: The fastest way of generating a semantic word cloud is to drag your text field to the "Word network" drop zone in the Bubble View. This triggers the Tokenize & Tag operation, filters the tokens by part-of-speech, aggregates them, and calculates links strengths between words. For full control over the parameters used in this sequence of operation, follow the steps below.

Step-by-step guide

1. Generate a classic word cloud

As a starting point, use your text field of interest to generate a classic word cloud, following the steps in this tutorial.

2. Open the network settings

Click the "Network" icon in the header of the Bubble View. From here, you can set the parameters that determine how words will be linked together. If you're not sure, use the default parameters and skip steps 3-6 below. You can always adjust later.


​

3. Specify the similarity measure

How close two words appear in the contextual word cloud depends on the measure used to calculate how similar they are. The following similarity measure is available for calculating the similarity between words:

  • Co-occurrence similarity measures the absolute number of times two tokens have appeared together. The drawback of the measure when trying to capture context is that big words tend to be strongly connected to everything else, hiding more subtle patterns.

  • Cosine similarity is a relative measure that takes the sizes of the words into account. It's useful for finding strong connections between smaller tokens, which otherwise drown in the links of the densest nodes.

  • CKC similarity is a measure not only of whether words co-occur, but whether they occur in similar contexts. It is the default option because it does a good job of connecting contextually similar words.

4. Specify the level of analysis

The level of analysis refers to the level of the data where co-occurrence is counted. For example, if articles have been split into paragraphs and sentences, the similarity between word pairs can be calculated based on how words co-occur in articles, paragraphs, and sentences respectively.

It is also possible to calculate similarity on the level of dates, authors, sources, or any other categorical field present in the data. These yield word clouds where words are connected if they tend to be used the same dates, by the same authors, and in the same sources respectively.

Link filtration is used to filter out the weakest links, enabling patterns in the network to emerge more clearly. Changing the link filtration level thus changes the density of the network. With a high filtration level, only the strongest links are included. This is useful when networks are very dense, which is often the case in word networks. You can change the network settings later, making it easily experiment with different filtration levels.

6. Set the individualization level

Individualization specifies whether link filtration should be applied to all links in the network, or individually to each node. A high individualization level means that even small nodes may keep their strongest links, even if in absolute terms they are weaker than links of bigger nodes. This helps smaller clusters emerge more clearly, which is often useful when visualizing text because the big topics tend to be much bigger than the small ones.
​

7. Explore the contextual word cloud

You can use the icons in the workbench header to hide or display links between nodes (with links it looks more like a network; without, more like a word cloud) and to choose whether nodes without any links should be displayed.


​

Pan over and zoom into the word cloud to explore it in the Bubble View. To see the documents that use one or a combination of words, select the word or words that you are interested in and drag them to the "Score by occurrence" or "Score by similarity" drop zones in the Table View, where your documents are displayed.
​
​

Did this answer your question?