Keywords: tokenization, lemmatization, lowercasing, punctuation removal, stop word removal, emoji removal, short token removal, part-of-speech (POS) tagging, named entity recognition (NER), phrase recognition (chunking)

Splitting text into smaller units such as words and phrases, called tokens, is useful for analyzing the content of the text. This is the task of the Tokenize & Tag operation.

The operation offers various preprocessing options for reducing noise: cleaning of punctuations, emojis, stop words, and short tokens, as well as lowercasing and lemmatization of the tokens (whereby inflection forms of a token are converted to its base form). It also includes options for detecting and tagging by parts-of-speech, named entities, and phrases.

Language coverage:

Tokenization: 185 languages
Lemmatization: 27 languages
Stop word removal: 58 languages
Part-of-speech tagging: 23 languages
Named entity recognition: 39 languages
Phrase detection: Language independent

Step-by-step guide

Before running the operation, consider cleaning and preprocessing your text data by using the Preprocessing Wizard. Removing long texts will speed up tokenization. Noise in the data can be reduced by cleaning URLs, @ tags, and other elements from the texts, as well as removing texts that are not in the desired language(s).

1. Select the text field

Select the text field you want to tokenize in the Schema View.

2. Open the operation configuration window

Click the "Add operations" button at the top of the workspace, then click the "text segmentation & tokenization" tab, then click "Tokenize and Tag".

3. Set the parameters of the operation

The following parameters can be set:

Text source: Name of the input text source field.
Language: The text language. If no language is specified, the language is automatically detected from a sample of the data.
Language source (optional): If a language field exists, it can be provided to specify the language for each row. Suitable when the text on different rows is in different languages. If no language field exists, it can be created using the Detect Language operation.

Lemmatize: Performs lemmatization on the text tokens. Lemmatization refers to the reduction of inflectional forms of a word so that analysis can be carried out on its basic dictionary form (lemma). For example, "cat" and "cats" both have the lemma "cat".
Lowercase: Lowercases all the characters in the text. This is recommended because without it, "cat" and "Cat" will count as two separate tokens.
Clean punctuation: Removes punctuations from the text. This is recommended because without it, "cat" and "cat." will count as two separate tokens.
Clean stop words: Removes stop words from the text. A stop word is a commonly used word such as "the", "a", "an", and "in". Such words are seldom useful for the analysis.

Clean emojis: Removes emojis from the text. Recommended if the text is made up of social media posts and emojis that are irrelevant for the analysis.
Remove short tokens: Removes tokens that are shorter than a user-specified number of characters.
Part-of-speech: Assigns parts-of-speech (e.g. noun or adjective) to each token. Some parts-of-speech (such as nouns) tend to be more useful for analysis than others (like conjunctions). POS tagging can be used for subsequent filtering of tokens based on parts-of-speech.
Named entity recognition: Recognizes and tags tokens that are named entities (such as "New York" and "Dcipher Analytics").
Phrase detection: Recognizes and tags phrases (rather than just individual words) in text.
Word/phrase sentiment detection: Provides the polarity of each token based on sentiment lexicons.

4. Apply the operation

Click "Apply". A new tokens collection (named "tokens" by default) has now been generated and added to the schema.

What do to next

To aggregate and view the tokens, drag the "tokens.value" field from Schema View to the "Display as bubble" drop zone in the Bubble View or the Group by field drop zone in the Table View.
Filter tokens based on POS tags to keep only the tokens that convey the most meaning.
Create a token network in the Bubble View to visually identify topics and themes.
Run the Detect topics operation to find topics in the text.