Keywords: document landscaping, semantic similarity, semantic clustering, theme detection

Document landscaping is a way of organizing texts based on their semantic similarity. It is useful for finding clusters of similar texts and the themes they correspond to. It gives a visual overview of the data.

Step-by-step guide

1. Generate a document landscape

Drag the text field of interest to the "Document landscaping" drop zone in the Scatter Plot workbench. With tens of thousands of relatively short texts, this may take a few minutes.

💡 Quick tip. While waiting for the document landscape to load, we recommended resizing the workbench from its bottom-right corner to make it as large as possible, enabling nuances to emerge more clearly.

When the document landscaping operation is completed, texts are displayed as dots in the scatter plot. The closer two texts are to each other in the plot, the more semantically similar they are.

The two axes do not have a concrete meaning. They simply represent two dimensions of a surface that the texts have been projected onto from a higher-dimensional vector space. The different methods for projecting the data available in Dcipher have in common that they are striving to preserve the distances or structures in the original space.

☝️ Technical note: By default, document landscaping uses fastText for vectorization and UMAP for reduction. You can view and change the techniques used by expanding the respective operations in the Manage Pipeline sidebar. Other available options include BERT and GloVe for vectorization and t-SNE for reduction.

2. Set the number of texts to be displayed

If the data contains more rows than the max value seen in the top-left corner of the workbench, random samples of text are displayed in the view. You can increase the number, but since everything displayed in the workbench is kept in your browser's memory, it is usually a good idea to keep the number below 5-10,000.

3. Display the density of the landscape

Some regions of the plot are denser than others. By clicking "Show contour lines and heatmap" the dense regions emerge as hills in a landscape, while the sparse regions are displayed as valleys. Each hill represents a cluster of semantically similar texts and can be interpreted as a theme in the text data.

4. Explore the landscape

Hover over a dot to read the beginning of the text it corresponds to.

You can select and drag dots to other workbenches, for example to the Table View to display the corresponding rows in the dataset.

Zoom into the plot for a more detailed view.

5. Filter out peripheral and irrelevant clusters

Some types of text data include semantic outliers that form peripheral islands in the landscape. These can make nuances in the central "landmass" more difficult to distinguish.

By unchecking "Show dots" under "View options" you can select and drag contours.

To remove peripheral islands or other irrelevant clusters of the data, select the corresponding contours, and click the filter icon in the header to apply an exclude filter. If you apply the filter locally, the texts are removed from the view without removing them from the global dataset.

Read more in our tutorial on semantic filtering.

6. Characterize hills in the landscape

The contours represent all texts located inside them, even if they are not among the samples of text displayed as dots in the view. So if you drag a contour to the Table View, all the associated rows are displayed there.

To help interpret the meaning of a hill in the landscape, select the contours you are interested in a click the "Summarize" icon in the workbench header. This triggers extractive summarization of the documents inside the selected contours. The summary is shown during mouse-over of the selected contours.

To drill down further, it is useful to understand what words are characteristic of the texts in that part of the landscape. To get this information, run tokenization on your text field if you have not already done so. Then, find over-represented tokens by dragging contours to the "Find overrepresented values" drop zone in the Bubble View after displaying your tokens there. The tokens that are most characteristic of texts are now shown, which makes it easier to interpret the theme corresponding to each hill.

For further characterization, run the Find overrepresented values operation on additional values such as time units, locations, sources, etc.