Topic modeling

Use Detect Topics to identify topics in large volumes of unlabeled text.

Zafer Çavdar avatar
Written by Zafer Çavdar
Updated over a week ago

Keywords: topic detection, topic modeling, Latent Dirichlet Allocation (LDA), Correlation Explanation (CorEx), seeded topic modeling

The Detect Topics operation analyzes text input and identifies "topics", i.e. clusters of words that frequently occur together in the text. The operation uses Latent Dirichlet Allocation with Coherence Optimization as the method for topic modeling. Topics emerge most clearly when the input consists of large volumes of text.

Step-by-step guide

1. Open the operation configuration window

Select the text field that you want to segment and click the "Add operation" button at the top of the workspace.

Search for "Detect Topics" or find the operation under "Text enrichment" and click it.

2. Specify the level of analysis

Under "Traversal", select the field that you want to be your level of analysis. For example, if you are looking for words that occur together on the level of sentences, select your sentence field; on the level of paragraphs, select your paragraph field.

3. Specify the input tokens

Topics consist of clusters of tokens that occur together. Under "Tokens source", select the field containing the tokens to be used for the topic modeling. If you have not already tokenized the text, run the Tokenize & Tag operation prior to topic detection.

4. Specify the method

Two methods are available: Latent Dirichlet Allocation (LDA) and Correlation Explanation (CorEx). Both methods seek to generate semantically coherent, abstract topics in a collection of documents. If CorEx is selected, topics can be seeded, meaning tokens can be selected, around which topics crystalize.

5. Specify a range of topics

Topics can be more or less specific. If you are looking for a handful of topics, they will be broader than if you are looking for dozens of topics. Under "Minimum topic count" and "Maximum topic count", specify the range of topics that you want the operation to output. Within this range, a topic model will be trained for different topic counts and the model with the highest coherence score, in other words where the most cohesive topics emerge, will be selected.

6. Name the output field

Under "Output collection name", type the name of the output field.

7. Apply the operation

Click "Apply" to run the operation. A collection of extracted topics are extracted for each document, where each topic is represented by a label (inserted into the "Label" field) consisting of the 15 most dominant tokens for the topic in descending order. The "Strength" field contains a numeric value between 0 and 1 for each document, indicating how strongly associated a given topic is to the document. The new "Words" collection contains the words associated with the topics that are present in each document and their associated relevance scores with regards to a given topic. The relevance score is useful for sorting words of a document from a specific topic.

Did this answer your question?