Everything you need to clean and preprocess your text data before getting started with analysis or model training.
Near duplicates removalDeletes almost duplicates texts based on the similarity threshold set.
The Preprocessing Wizard: Fast text cleaning & preprocessingClean and preprocess text through outlier removal, deduplication, language filtering, text cleaning, text splitting, and content extraction.
Semantic filtering & disambiguationFilter text data based on meaning rather than keywords to account for the ambiguity and richness of natural languages.
Case conversionThe Change Case operation converts the case of text, with options for lowercasing, uppercasing, and various mixed-case formats.
Duplicate removalThe Delete Duplicates operation deletes duplicate values.
Pattern matching & replacementThe Replace Pattern operation replaces a user-specified pattern in text with a user-specified text string.
Text cleaningThe Clean Text operation provides options for cleaning elements from text, such as URLs, @ tags, hashtags, punctuation, tabs, and emojis.
Text length measurementThe Calculate Text Length operation counts the number of letters, digits, spaces, other characters, words, and sentences in text.