Keywords: segmentation, split text
A classic text preprocessing dilemma relates to splitting text into relevant segments. It's easy to split the text into sentences and paragraphs, but meaningful segments can span several of these units or be contained within them. Being able to identify relevant segments is particularly important in the text where newlines are missing or paragraphs are long. To solve this issue, the Segment Text operation looks for semantic similarity as well as references between adjacent sentences to split the text into coherent segments.
Due to the complexity of the task, the text segmentation operation is significantly more time-consuming than splitting into paragraphs or sentences by using the Preprocessing Wizard or Split by Pattern operation. It is therefore suggested to use the operation in cases where sentences and paragraphs do not well capture units of meaning in the text of interest.
Step-by-step guide
1. Open the operation configuration window
Select the text field that you want to segment and click the "Add operation" button at the top of the workspace.
Search for "Segment text" or find the operation under "Text segmentation & tokenization" and click it.
β
2. Name the output field
Under "Output field name", type the name of the output field.
3. Apply the operation
Click "Apply" to run the operation. Segments are now being generated and inserted into the output collection.