Keywords: content extraction, keyword-based extraction

The Extract Content by Keyword operation extracts sentences that contain one or more keywords provided by the user. The extraction window, defined by the number of sentences to be included on each side of the identified sentence, can be specified by the user.

For example, consider the following text:

"The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, funding was dramatically reduced."

With the keyword "machine translation" and window size 0, a single sentence is extracted:

"The authors claimed that within three or five years, machine translation would be a solved problem."

But with window size 1, all three sentences in the paragraph are extracted.

Consider the case where, among the adjacent sentences 1-6, a keyword is contained in sentences 2 and 4 and the window size is 1. In this case, sentences 1-5 are extracted as a single segment. If instead, the keyword is present in sentences 2 and 5, two segments are extracted, the first with sentences 1-3 and the other with sentences 4-6.

Step-by-step guide

1. Open the operation configuration window

Select the text field that you want to segment and click the "Add operation" button at the top of the workspace.

Search for "Extract Content by Keyword" or find the operation under "Content extraction" and click it.

2. Name the output field

Under "Output field name", type the name of the output field.

3. Specify the keywords

Under "Keywords", type the keywords that you want to extract content based on. After each keyword, press enter.

4. Set the number of sentences

Specify the "Number of sentences" you want to include at each side of matching sentences.

5. Apply the operation

Click "Apply" to run the operation. Content matching your keywords is now being extracted from the input text and inserted into the output collection.

Did this answer your question?