The near-duplicates removal operation deletes almost duplicate texts based on the similarity threshold set. This operation is helpful in situations where text data is similar but not identical.
1. Open the operation configuration window
Click the "Add operation" button at the top of the workspace.
Search for "Delete near duplicates" or find the operation under "Preprocessing & cleaning" and click it.
2. Choose text source
In the "Text source" part, select the input text you want to process.
3. Set similarity threshold
The similarity threshold is a value between 0.0 and 1.0 to help Dcipher assign if any text pairs are almost duplicated or not based on the semantic similarity measure. It defaults to 0.97, and assigning higher threshold values increases the barrier to removing a text. Lower threshold values like 0.6 may remove almost all text.
4. Apply the operation
Click "Apply" to run the operation. The near-duplicates will then get filtered from the text data. The remaining data set is shown in Table View. As a result of this operation, the dataset gets cleaned by removing duplicates based on the set semantic similarity threshold.