Preprocessing & cleaning

Everything you need to clean and preprocess your text data before getting started with analysis or model training.

Tomas Larsson avatar Esra Binici avatar
8 articles in this collection
Written by Tomas Larsson and Esra Binici

Near duplicates removal

Deletes almost duplicates texts based on the similarity threshold set.
Esra Binici avatar
Written by Esra Binici
Updated over a week ago

The Preprocessing Wizard: Fast text cleaning & preprocessing

Clean and preprocess text through outlier removal, deduplication, language filtering, text cleaning, text splitting, and content extraction.
Tomas Larsson avatar
Written by Tomas Larsson
Updated over a week ago

Semantic filtering & disambiguation

Filter text data based on meaning rather than keywords to account for the ambiguity and richness of natural languages.
Tomas Larsson avatar
Written by Tomas Larsson
Updated over a week ago

Case conversion

The Change Case operation converts the case of text, with options for lowercasing, uppercasing, and various mixed-case formats.
Tomas Larsson avatar
Written by Tomas Larsson
Updated over a week ago

Duplicate removal

The Delete Duplicates operation deletes duplicate values.
Tomas Larsson avatar
Written by Tomas Larsson
Updated over a week ago

Pattern matching & replacement

The Replace Pattern operation replaces a user-specified pattern in text with a user-specified text string.
Tomas Larsson avatar
Written by Tomas Larsson
Updated over a week ago

Text cleaning

The Clean Text operation provides options for cleaning elements from text, such as URLs, @ tags, hashtags, punctuation, tabs, and emojis.
Tomas Larsson avatar
Written by Tomas Larsson
Updated over a week ago

Text length measurement

The Calculate Text Length operation counts the number of letters, digits, spaces, other characters, words, and sentences in text.
Tomas Larsson avatar
Written by Tomas Larsson
Updated over a week ago