Keywords: preprocessing, outlier removal, deduplication, language filtering, text cleaning, text splitting, content extraction.
Text tends to be full of things that for most purposes can be considered as noise. For most text sources and text analytics use cases, it is therefore recommended to clean and preprocess text data before applying text mining and machine learning techniques. The Preprocessing Wizard provides five preprocessing and text cleaning steps: outlier removal, deduplication, language filtering, text cleaning, and segmentation/extraction. If you don't need all these steps, you can select which ones to use. The outputs are added as new fields in the schema.
Step-by-step guide
Content of this guide
Removing outliers (steps 1-3, 8)
Removing duplicates (steps 1-2, 4, 8)
Language filtering (steps 1-2, 5, 8)
Text cleaning (steps 1-2, 6, 8)
Splitting text into smaller units (steps 1-2, 7-8)
Extracting content from the text (steps 1-2, 7-8)
☝️ Note: Each of the steps in the Preprocessing Wizard can be run as individual operations. You find links to the tutorial of each operation under the corresponding step below.
1. Select the text field
Start by selecting the text field that you want to apply preprocessing and text cleaning operations on in the Schema View.
2. Open the Preprocessing Wizard
Click the "Preprocess" button at the top of the workspace. When the wizard opens, make sure the text field you want to preprocess and clean is selected and click the "Continue" button. The wizard will guide you through the preprocessing steps one by one. If you want to skip a step, click "Skip". To move to a particular preprocessing step, click the corresponding tab.
3. Outlier removal
Very short texts often add little to the analysis and very long texts can slow operations down. Outlier removal calculates text length and filters out texts that are shorter or longer than a specified range.
Select "Number of words" or "Number of characters" to see a distribution with the number of posts (vertical axis) over the number of words or characters (horizontal axis). Filtering based on characters is especially useful for character-based languages such as Chinese.
Adjust the filter range to specify what text lengths should be filtered out.
If you are planning to split the text into smaller units or extract parts of the text in a subsequent preprocessing step, it is recommended not to set an upper limit on the filter range. But consider running outlier removal on the resulting text segments to remove very long segments.
Click "Preview" to see the number of rows that will be removed with the given settings and click "Continue" to proceed to the next preprocessing step.
4. Deduplication
It is common that text data contains duplicate entries. This is for example the case in data from social media (where posts are often reposted) and news media (where some articles are published by multiple outlets).
To keep one entry among the duplicates, select the "Among the duplicate values..." option and specify the sorting order. For example, to keep the earliest entry, keep the "first" when sorted by the "time" field in the dataset.
To remove all duplicate texts, select "Remove all duplicates".
Click "Preview" to see the number of rows that will be removed with the given settings and click "Continue" to proceed to the next preprocessing step.
5. Language filtering
While multi-language text analysis is possible, it is sometimes desired to work with text in a specific language. In the language filtering step, you can choose what language(s) to keep.
Select the language(s). Dcipher will detect the language of each text and remove those in other languages.
Click "Preview" to see the number of rows that will be removed with the given settings and click "Continue" to proceed to the next preprocessing step.
6. Text cleaning
Texts tend to contain many elements that create noise and are not relevant to the analysis. In the text cleaning step, you can specify what elements to remove from the texts.
Select the elements you want to remove: emojis, URLs, XML tags, line breaks, tabs, punctuation, @ tags, hashtags, or hashtag sequences at the end of texts. The latter allows removing sequences of hashtags at the end of the text (commonly used in social media as content tags) while keeping those inside texts (which tend to provide context).
Note that removing punctuation will make it impossible to subsequently split the text into sentences or extract sentences while removing newlines will make it impossible to subsequently split the text into paragraphs.
You can remove prefixes, words that start with given prefixes, or words that contain a given substring.
Click "Continue" to proceed to the next preprocessing step.
7. Segmentation or extraction
The original texts are often too long to be ideal units of analysis. For example, topics emerge more clearly on the level of sentences or paragraphs than entire reports.
There are two ways of generating more relevant units from texts: segmentation and extraction. Segmentation is useful in cases where the entire text is relevant for the analysis. Extraction is the preferred choice when only parts of the text, such as sentences containing a certain keyword, are useful for the analysis.
Segmentation splits texts into smaller pieces: sentences, paragraphs, based on a user-specified pattern, or through smart segmentation. The latter considers the semantic similarity between adjacent sentences as well as references between sentences to split the text into coherent segments. It is a powerful but time-consuming method.
Extraction finds and extracts relevant content in the texts. This can be done based on a user-specified pattern or based on keywords. The window size (number of sentences before and after matching sentences) can be specified. Setting the window size to 1, for example, includes one sentence before and one sentence after a matching sentence.
Click "Continue" to finalize the preprocessing.
8. Preview the results
At the end of the preprocessing wizard, you are presented with a summary of the results of the preprocessing steps. Click "Done" to apply the filters and operations to your data. The outputs have now been added as new fields in the schema.
What to do next
Preprocessing & cleaning is the starting point for almost everything you may want to do with your text data, so from here, the opportunities are endless.