Property extraction by pattern

Extract Properties by Pattern is used to extract properties from text based on user-specified patterns in the form of regular expressions.

Tomas Larsson avatar
Written by Tomas Larsson
Updated over a week ago

Keywords: extract by pattern, extract by regex

The Extract Properties by Pattern operation extracts one or more properties from text based on user-specified patterns in regex format. This can be useful to extract information that follows a certain pattern, such as dates, number sequences, fields in unstructured forms, etc.

Step-by-step guide

1. Open the operation configuration window

Select the field that you want to extract properties from in the Schema workbench and click the "Add operation" button at the top of the workspace.


โ€‹

Search for "Extract Properties by Pattern" or find the operation under "Preprocessing & cleaning" and click it.


2. Name the output collection

Under "Output field name", type the name of the field that you want the extracted patterns to be inserted into.

3. Specify the pattern

Under "Regexp", type the regular expression specifying the patterns that you want to extract. Use parentheses to mark the parts that should be extracted.

For example, the regular expression (\d{4})-(\d{2})-(\d{2}) matches all sequences of four, two, and two digits separated by dashes. For example, if "2021-01-01" appeared in the text, the properties 2021, 01, and 01 would be extracted.

4. Label the extracted properties

Under "Labels", type the names of the fields that the properties should be inserted into. In the example above, for example, we may name the labels "Year", "Month", and "Date".

5. Apply the operation

Click "Apply" to run the operation. The output fields are now created and the extracted properties are inserted into the corresponding field.

Did this answer your question?