Keywords: sentence boundaries, split into sentences, sentencizers
"Split into sentences" operation supports language-specific sentence boundary detection and splits text into sentences at the correct boundaries for 185 languages.
Step-by-step guide
Before running the operation, consider cleaning and preprocessing your text data by using the Preprocessing Wizard. Removing long texts will speed up sentence boundary detection. Noise in the data can be reduced by cleaning URLs, @ tags, and other elements from the texts, as well as removing texts that are not in the desired language(s).
1. Select the text field
Select the text field you want to split into sentences in the Schema View.
2. Open the operation configuration window
Click the "Add operations" button at the top of the workspace, then click the "text segmentation & tokenization" tab, then click "Split into sentences".
3. Set the parameters of the operation
The following parameters can be set:
Text source: Name of the input text source field.
Language: The text language. If no language is specified, the language is automatically detected from a sample of the data.
Language source (optional): If a language field exists, it can be provided to specify the language for each row. Suitable when the text on different rows is in different languages. If no language field exists, it can be created using the Detect Language operation.
Output field name: Detected sentences will be stored as an entity collection under this field.
4. Apply the operation
Click "Apply". A new sentence collection (named "sentences" by default) has now been generated and added to the schema.