Keywords: deduplication, duplicate removal

The Delete Duplicates operation removes duplicates among the values in the selected field. Can, among other things, be used to remove duplicate documents in a dataset, such as reposted articles and posts.

Step-by-step guide

1. Open the operation configuration window

Select the field that you want to apply the operation to in the Schema workbench and click the "Add operation" button at the top of the workspace.

Search for "Delete duplicates" or find the operation under "Preprocessing & cleaning" and click it.

2. Specify the traversal

Click the drop-down menu under "Traversal" and select the level of the dataset on which to search for duplicates.

👉 Example 1. Your dataset contains articles that you've split into sentences, and you want to remove duplicate sentences across articles. In this case, the traversal should be set to the article (root) level.

👉 Example 2. Your dataset contains articles that you've split into sentences, and you want to remove duplicate sentences within articles. In this case, the traversal should be set to the sentence level.

3. Specify the field to delete duplicates in

Click the drop-down menu under "Field" and select the field that you want to remove duplicates in.

👉 Example. In your dataset with articles that have been split into sentences you want to remove duplicate sentences. You select the field containing the sentences (typically called "sentences.value").

4. Specify whether to leave one or delete all

In order to keep one within a set of duplicate values rather than removing all of them, make sure the "Leaving first" switch is turned on.

5. Specify the sorting field and sorting order

If the "Leaving first" switch is turned on, define what "first" means by selecting a field from the two options under "Sorting field" and clicking either the "Ascending" or "Descending" button.

👉 Example. In a dataset with articles, some of which exist in duplicate copies, you want to keep the first article within each set of duplicates. To do this, you set your date field as the sorting field and click "Ascending" to sort in ascending order, i.e. from the earliest date.

6. Apply the operation

Click "Apply" to run the operation. Duplicates are now removed from the selected field.

Did this answer your question?