Keywords: extract by pattern, extract by regex
The Extract Properties by Pattern operation extracts one or more properties from text based on user-specified patterns in regex format. This can be useful to extract information that follows a certain pattern, such as dates, number sequences, fields in unstructured forms, etc.
Step-by-step guide
1. Open the operation configuration window
Select the field that you want to extract properties from in the Schema workbench and click the "Add operation" button at the top of the workspace.
โ
Search for "Extract Properties by Pattern" or find the operation under "Preprocessing & cleaning" and click it.
2. Name the output collection
Under "Output field name", type the name of the field that you want the extracted patterns to be inserted into.
3. Specify the pattern
Under "Regexp", type the regular expression specifying the patterns that you want to extract. Use parentheses to mark the parts that should be extracted.
For example, the regular expression (\d{4})-(\d{2})-(\d{2})
matches all sequences of four, two, and two digits separated by dashes. For example, if "2021-01-01" appeared in the text, the properties 2021, 01, and 01 would be extracted.
4. Label the extracted properties
Under "Labels", type the names of the fields that the properties should be inserted into. In the example above, for example, we may name the labels "Year", "Month", and "Date".
5. Apply the operation
Click "Apply" to run the operation. The output fields are now created and the extracted properties are inserted into the corresponding field.