Keywords: topic detection, topic modeling, Latent Dirichlet Allocation (LDA), Correlation Explanation (CorEx), seeded topic modeling
The Detect Topics operation analyzes text input and identifies "topics", i.e. clusters of words that frequently occur together in the text. The operation uses Latent Dirichlet Allocation with Coherence Optimization as the method for topic modeling. Topics emerge most clearly when the input consists of large volumes of text.
Step-by-step guide
1. Open the operation configuration window
Select the text field that you want to segment and click the "Add operation" button at the top of the workspace.
Search for "Detect Topics" or find the operation under "Text enrichment" and click it.
2. Specify the level of analysis
Under "Traversal", select the field that you want to be your level of analysis. For example, if you are looking for words that occur together on the level of sentences, select your sentence field; on the level of paragraphs, select your paragraph field.
3. Specify the input tokens
Topics consist of clusters of tokens that occur together. Under "Tokens source", select the field containing the tokens to be used for the topic modeling. If you have not already tokenized the text, run the Tokenize & Tag operation prior to topic detection.
4. Specify the method
Two methods are available: Latent Dirichlet Allocation (LDA) and Correlation Explanation (CorEx). Both methods seek to generate semantically coherent, abstract topics in a collection of documents. If CorEx is selected, topics can be seeded, meaning tokens can be selected, around which topics crystalize.
5. Specify a range of topics
Topics can be more or less specific. If you are looking for a handful of topics, they will be broader than if you are looking for dozens of topics. Under "Minimum topic count" and "Maximum topic count", specify the range of topics that you want the operation to output. Within this range, a topic model will be trained for different topic counts and the model with the highest coherence score, in other words where the most cohesive topics emerge, will be selected.
6. Name the output field
Under "Output collection name", type the name of the output field.
7. Apply the operation
Click "Apply" to run the operation. A collection of extracted topics are extracted for each document, where each topic is represented by a label (inserted into the "Label" field) consisting of the 15 most dominant tokens for the topic in descending order. The "Strength" field contains a numeric value between 0 and 1 for each document, indicating how strongly associated a given topic is to the document. The new "Words" collection contains the words associated with the topics that are present in each document and their associated relevance scores with regards to a given topic. The relevance score is useful for sorting words of a document from a specific topic.