Keywords: extract duplicates
Unlike the Delete Duplicates operation, Extract Duplicates is used for extracting duplicates. This can be useful for understanding what sets of duplicates are present in the data.
Step-by-step guide
1. Open the operation configuration window
Select the field that you want to apply the operation to in the Schema workbench and click the "Add operation" button at the top of the workspace.
β
β
Search for "Extract duplicates" or find the operation under "Content extraction" and click it.
β
2. Specify the traversal
Click the drop-down menu under "Traversal" and select the level of the dataset on which to search for duplicates.
β
Example 1: Your dataset contains articles that you've split into sentences, and you want to extract duplicate sentences across articles. In this case, the traversal should be set to the article (root) level.
Example 2: Your dataset contains articles that you've split into sentences, and you want to extract duplicate sentences within articles. In this case, the traversal should be set to the sentence level.
3. Specify the field to extract duplicates from
Click the drop-down menu under "Field" and select the field that you want to extract duplicates from.
Example: In your dataset with articles that have been split into sentences you want to extract duplicate sentences. To do this, you select the field containing the sentences (let's call it "sentences.value") under "Field".
4. Specify whether to extract all or the first
In order to extract the first within a set of duplicate values rather than extracting all of them, make sure the "Only first" switch is turned on.
5. Specify the sorting field and sorting order
If the "Only first" switch is turned on, define what "first" means by selecting a field from two options under the "Sorting field" and clicking either the "Ascending" or "Descending" button.
Example: In a dataset with articles, some of which exist in duplicate copies, you want to extract the first article within each set of duplicates. To do this, you set your date field as the sorting field and click "Asc" to sort in ascending order, i.e. from the earliest date.
6. Apply the operation
Click "Apply" to run the operation. Duplicates are now extracted from the selected field.