Keywords: semantic filtering, contextual filtering, disambiguation

Natural languages can be vague and ambiguous. It is near-to-impossible to capture what is relevant and eliminating what is not through keyword-based filters. Semantic filtering, by contrast, does not seek to explicitly define meaning. Instead, it clusters semantically similar texts and allows the user to apply filters on these clusters.

Semantic filtering is used for locating and filtering different meanings of the same expression. For example, "Paris" when referring to the capital of France is contextually different from "Paris" when used for the city in Texas, USA. Similarly, "5g" when referring to 5G in telecom is contextually distinct from "5g" when referring to "5 grams" in recipes. Texts using these terms cluster semantically into different groups depending on their meaning. Thus, rather than relying on keyword-based criteria to capture a certain meaning of "5g" or "Paris", semantic filtering is based on human interpretation and filtering of these semantic clusters.

Semantic filtering also provides a way of locating and filtering the same meaning of different expressions. Natural language offers a wide variety of ways of speaking about any particular thing. Think about, for example, the many ways in which a restaurant guest may complain about the service. Even if two texts do not have a single word in common, they can still have similar meanings and thus be semantically similar. Semantic filtering, therefore, enables filtering based on meaning rather than keywords.

Step-by-step guide

1. Add a Scatter Plot

Click "Add workbench" at the top of the workspace and click the "Scatter Plot" icon. A Scatter Plot workbench is now added to the workspace.

2. Trigger Document landscaping

Drag your text field of interest to the "Document landscaping" drop zone in the Scatter Plot workbench. If you have tens of thousands of relatively short texts, this may take a few minutes.

💡 Quick tip. While waiting for the document landscape to load, we recommended resizing the workbench from its bottom-right corner to make it as large as possible, enabling nuances to emerge more clearly.

When the document landscaping operation is completed, texts are displayed as dots in the scatter plot. The closer two texts are to each other in the plot, the more semantically similar they are.

The two axes do not have a concrete meaning. They simply represent two dimensions of a surface that the texts have been projected onto from a higher-dimensional vector space. The different methods for projecting the data available in Dcipher have in common that they are striving to preserve the distances or structures in the original space. You can read more about the methods available for dimension reduction in Dcipher and how they work here.

☝️ Technical note: By default, document landscaping uses fastText for vectorization and UMAP for reduction. You can view and change the method used by expanding the respective operation in the Manage Pipeline sidebar. Other available options include BERT and GloVe for vectorization and t-SNE for reduction.

3. Set the number of texts to be displayed

If the data contains more rows than the max value seen in the top-left corner, random samples of text are displayed in the view. You can increase the number, but since everything displayed in the workbench is read into your browser's memory, it's recommended to keep the number within 5000.

4. Interpret semantic clusters through individual texts

Hover over a dot to read the beginning of the text it represents. To read the entire texts, select and drag dots to the "View data" drop zone in the Table View.

5. Interpret semantic clusters through aggregation

Some regions of the plot are dense, in other words, contain many semantically similar texts, while others are sparse. By clicking "Show contour line and heatmap" under "View options" in the workbench header, the dense regions are displayed as hills in a landscape, while the sparse regions are displayed as valleys.

If you untick "Show dots" under "View options" you can now select and drag contours. The contours in the landscape represent all texts located inside them, even if they are not among the samples of text displayed as dots in the view. So if you drag a contour to the Table View, all the associated rows are now displayed there.

In order to interpret what a hill in the document landscape is about, it is useful to understand what tokens are characteristic of the texts in that part of the landscape.

To get this information, display tokens in the Bubble View, then drag the contours you're interested in the "Find overrepresented values" drop zone in the Bubble View. The tokens that are most overrepresented in the texts are now shown, making it easier to interpret the content of clusters.

6. Apply semantic filter

Select contours representing texts to be included or excluded and click the filter icon in the workbench header.

Specify where in the dataset the filter should be applied (e.g. root-level texts or split/extracted segments further down in the dataset's nested structure) and select "Global" to apply the filter to the global pipeline rather than locally in the workbench. Click "Apply filter" to apply the filter. The rows corresponding to the selected texts have now been removed from the dataset.

If you want to remove texts from the landscape without removing them from the dataset, for example as a way to "zoom" into a particular part of the landscape, select "Local" before applying the filter.

What to do next

Upon completion of semantic filtering, your text has been rid of irrelevant content, and noise in the data has been reduced. It is now ready to be used for text enrichment, annotation, analytics, or model training.

Semantic filtering & disambiguation