Keywords: dimensionality reduction, MDS, UMAP, tSNE, SVD, PCA
Reducing dimensionality is useful for a number of purposes, including enabling human interpretation of data and achieving fewer input variables for increased performance when training a predictive model. The Reduce Vectors operation provides several methods for this.
This is, for example, useful when for visually exploring clusters of similar text. The Document landscaping feature is based on reduction of text vectors onto a two-dimensional surface.
Overview of the characteristics of the different dimension reduction methods available in Dcipher Analytics:
Distance-based: tSNE, UMAP, MDS
Variance-based: PCA, SVD
Preserves global structure: MDS, UMAP, SVD, PCA
Preserves local structure: UMAP, tSNE
Deterministic: PCA, SVD
Non-determnistic (due to hyperparameters and randomness): tSNE, UMAP, MDS
Multidimensional scaling (MDS) seeks to preserve the distances between higher dimensional vectors in a lower dimensional space by eigenanalysis on pairwise distance matrices. The results tend to preserve global structure at the expense of local structure (which tends to be retained by tSNE, below). Read more here.
t-Distributed Stochastic Neighbor Embedding (tSNE) strives to preserve the local structure (clusters) of the data using non-linear transformations. It's able to capture the structure of tricky manifolds while ignoring inter-cluster distances during dimension reduction and is suited for large datasets. Read more about here.
Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) searches for a low dimensional projection of high dimensional data that has the closest possible equivalent fuzzy topological structure. The method is similar to tSNE but also works for general non-linear dimension reduction by, unlike tSNE, preserving the global structure. Hence, distances between data points within clusters (local structure) as well as between clusters (global structure) are preserved. Read more here.
Singular Value Decomposition (SVD) achieves dimension reduction through matrix factorization. It is a popular dimensionality reduction technique in machine learning, particularly for sparse data, and originates from the field of linear algebra.
Principal Component Analysis (PCA) disentangles correlation between variables in the data by linearly transforming it into orthogonal, uncorrelated dimensions (principal components). It is useful for displaying observations along the uncorrelated dimensions that account for most of the variance in the data, but has the drawback of ignoring information embedded in less significant components in a single, low dimensional projection.
Step-by-step guide
1. Open the operation configuration window
Click the "Add operation" button at the top of the workspace and search for "Reduce Vectors" or find the operation under "Machine Learning" and click it.
2. Specify the vector field
In the "Input vector field" drop-down, select the field with your vectors, for example the vectors resulting from the text vectorization.
3. Name the output field
Under "Name for reduced vector", type the name of the output vector field.
4. Specify dimension reduction method
In the "Method drop-down", select the dimension reduction method that you want to use. Read about the different methods in the description above.
5. Specify the number of dimensions
Under "Desired dimensions", type the number of dimensions that you want to project the higher dimensional space onto. The default number is 2, meaning you will be able to display the results in two dimensions.
6. Apply the operation
Click "Apply" to run the operation. The reduced vectors are now inserted into the output field.