Pattern-based text splitting

The Split by Pattern operation splits text by pattern, for example into sentences or paragraphs.

Tomas Larsson avatar
Written by Tomas Larsson
Updated over a week ago

Keywords: split by pattern, split by regex

Splits the text by a user-specified pattern in the form of a regular expression. Useful for splitting text into paragraphs, sentences, or by some delimiter.

A couple of examples of how the operation can be used:

πŸ‘‰ Example 1. In a dataset with academic articles, we have a text string with author names separated by a semicolon, e.g. "R Collobert; J Weston; L Bottou; M Karlen". By splitting the string by the pattern "; " we get the collection [R Collobert, J Weston, L Bottou, M Karlen]. Splitting into individual author names is useful for answering questions such as "Who is the most cited author in relation to a research topic?" and "What does the co-authorship network in relation to the topic look like?"

πŸ‘‰ Example 2. In a report, chapters start with "Chapter 1: ...", etc. To split the report into chapters, we can use the pattern "Chapter \d+:".

Step-by-step guide

1. Open the operation configuration window

Select the text that you want to split in the Schema workbench and click the "Add operation" button at the top of the workspace.
​
​


​
Search for "Split by pattern" or find the operation under "Text segmentation & tokenization" and click it.
​
​

2. Name the output collection

Under "Output field name", type the name of the field that the collection should be inserted into. In the examples above, suitable output field names might be "Authors" and "Chapters".
​

3. Define the pattern to split by

Under "Regexp", type the pattern in the form of a regular expression that you want to split the input text by.
​
​

4. Apply the operation

Click "Apply" to run the operation. A new field containing the output collections has now been added.
​
​

Did this answer your question?