Text cleaning

The Clean Text operation provides options for cleaning elements from text, such as URLs, @ tags, hashtags, punctuation, tabs, and emojis.

Tomas Larsson avatar
Written by Tomas Larsson
Updated over a week ago

Keywords: clean text, remove emojis, remove URLs, remove XML tags, remove line breaks, remove tabs, remove punctuation, remove @ tags, remove hashtag.

Text tends to contain many elements that create noise and that are not relevant for the analysis. The Clean Text operation allows you to select elements to remove from text.

Step-by-step guide

1. Open the operation configuration window

Select the field with the text that you want to clean and click the "Add operation" button at the top of the workspace.

Search for "Clean Text" or find the operation under "Preprocessing & cleaning" and click it.

2. Name of the output field

Under "Output field name", type the name of the output field.

3. Choose elements to be cleaned

Select one or more elements to be removed:

  • Emojis

  • URLs

  • Line breaks

  • Hashtag sequence ending (useful for removing a sequence of hashtags at the end of texts, which are often used as content tags, while keeping hashtags inside the text, which are often needed for context)

  • XML tags

  • Tabs

  • Punctuation

  • Prefixes (for example "#" to remove the hash symbol from hashtags)

  • Words with prefixes (for "#" to remove entire hashtags)

  • Substrings (to remove a particular string of characters), with the choice to make the matching case-sensitive or case insensitive.

4. Run the operation

Click "Apply" to run the operation. The cleaned texts are now inserted into the output field.

Did this answer your question?