Keywords: split by pattern, split by regex
Splits the text by a user-specified pattern in the form of a regular expression. Useful for splitting text into paragraphs, sentences, or by some delimiter.
A couple of examples of how the operation can be used:
👉 Example 1. In a dataset with academic articles, we have a text string with author names separated by a semicolon, e.g. "R Collobert; J Weston; L Bottou; M Karlen". By splitting the string by the pattern "; " we get the collection [R Collobert, J Weston, L Bottou, M Karlen]. Splitting into individual author names is useful for answering questions such as "Who is the most cited author in relation to a research topic?" and "What does the co-authorship network in relation to the topic look like?"
👉 Example 2. In a report, chapters start with "Chapter 1: ...", etc. To split the report into chapters, we can use the pattern "Chapter \d+:".
1. Open the operation configuration window
Select the text that you want to split in the Schema workbench and click the "Add operation" button at the top of the workspace.
Search for "Split by pattern" or find the operation under "Text segmentation & tokenization" and click it.
2. Name the output collection
Under "Output field name", type the name of the field that the collection should be inserted into. In the examples above, suitable output field names might be "Authors" and "Chapters".
3. Define the pattern to split by
Under "Regexp", type the pattern in the form of a regular expression that you want to split the input text by.
4. Apply the operation
Click "Apply" to run the operation. A new field containing the output collections has now been added.