Near duplicates removal

Deletes almost duplicates texts based on the similarity threshold set.

Zafer Çavdar avatar
Written by Zafer Çavdar
Updated over a week ago

The near-duplicates removal operation deletes almost duplicate texts based on the similarity threshold set. This operation is helpful in situations where text data is similar but not identical.

Step-by-step guide

1. Open the operation configuration window

Click the "Add operation" button at the top of the workspace.

Search for "Delete near duplicates" or find the operation under "Preprocessing & cleaning" and click it.

2. Choose text source

In the "Text source" part, select the input text you want to process.

3. Set similarity threshold

The similarity threshold is a value between 0.0 and 1.0 to help Dcipher assign if any text pairs are almost duplicated or not based on the semantic similarity measure. It defaults to 0.97, and assigning higher threshold values increases the barrier to removing a text. Lower threshold values like 0.6 may remove almost all text.

4. Apply the operation

Click "Apply" to run the operation. The near-duplicates will then get filtered from the text data. The remaining data set is shown in Table View. As a result of this operation, the dataset gets cleaned by removing duplicates based on the set semantic similarity threshold.

Did this answer your question?