Keywords: extract duplicates

Unlike the Delete Duplicates operation, Extract Duplicates is used for extracting duplicates. This can be useful for understanding what sets of duplicates are present in the data.

Step-by-step guide

1. Open the operation configuration window

Select the field that you want to apply the operation to in the Schema workbench and click the "Add operation" button at the top of the workspace.

Search for "Extract duplicates" or find the operation under "Content extraction" and click it.

2. Specify the traversal

Click the drop-down menu under "Traversal" and select the level of the dataset on which to search for duplicates.

Example 1: Your dataset contains articles that you've split into sentences, and you want to extract duplicate sentences across articles. In this case, the traversal should be set to the article (root) level.

Example 2: Your dataset contains articles that you've split into sentences, and you want to extract duplicate sentences within articles. In this case, the traversal should be set to the sentence level.

3. Specify the field to extract duplicates from

Click the drop-down menu under "Field" and select the field that you want to extract duplicates from.

Example: In your dataset with articles that have been split into sentences you want to extract duplicate sentences. To do this, you select the field containing the sentences (let's call it "sentences.value") under "Field".

4. Specify whether to extract all or the first

In order to extract the first within a set of duplicate values rather than extracting all of them, make sure the "Only first" switch is turned on.

5. Specify the sorting field and sorting order

If the "Only first" switch is turned on, define what "first" means by selecting a field from two options under the "Sorting field" and clicking either the "Ascending" or "Descending" button.

Example: In a dataset with articles, some of which exist in duplicate copies, you want to extract the first article within each set of duplicates. To do this, you set your date field as the sorting field and click "Asc" to sort in ascending order, i.e. from the earliest date.

3. Apply the operation

Click "Apply" to run the operation. Duplicates are now extracted from the selected field.

Did this answer your question?