Keywords: text transformers, word embedding, text embedding, Doc2Vec, Word2Vec, FastText, ELMo, GloVe, BERT, bag-of-words

Text vectorization is the process of mathematically representing text in a more useful way than as a string of characters. Classic bag-of-words vectorization uses the words contained in a text to represent it. Newer text embeddings techniques represent text more efficiently, typically in a lower-dimensional space with a few hundred dimensions, improving the performance of machine learning models. Text embeddings allow texts with similar meanings to have similar representations, regardless of whether they contain the same words.

Text vectorization is the foundation of many features in Dcipher Analytics, including semantic search and document landscaping.

Dcipher Analytics offers a number of text vectorization techniques: bag of words, Doc2Vec, Word2Vec, fastText, ELMo, GloVe, and BERT. The usefulness of each depends on the use case and dataset characteristics.

fastText is a relatively fast method that reads text on the character level and therefore is less sensitive to spelling errors and works well for agglutinative languages.

BERT, often described as a game-changer in NLP, with impressive results for many downstream NLP tasks, reads text both forward and backward (i.e. masks the previous and next words for each word during training and tries to predict them) and achieves context-aware text representation.

GloVe and Word2vec are used for representing words rather than text, which among other things can be useful for classifying and finding semantically similar words.

Step-by-step guide

1. Open the operation configuration window

Click the "Add operation" button at the top of the workspace.

Search for "Vectorize text" or find the operation under "Vector operations" and click it.

2. Specify the traversal

In the "Traversal path" drop-down, select the level of the dataset where the vectorization should take place. For example, if the root level contains articles and these articles have been split into segments, use the root level to vectorize articles and the segment level to vectorize segments.

3. Specify the token field

In the "Tokens source" drop-down, select the field containing the text tokens. If this field does not already exist in your dataset, first apply tokenization of your input text.

4. Choose the vectorization method

In the "Vectorization method" drop-down, select the vectorization method you want to use. See above for information about the different vectorization methods.

5. Determine whether to train a model from your data

If you want to train a vectorizer from your data, make sure the "Train a new model" switch is turned on. This is the right choice if you want to represent the text in your dataset independently only in the light of itself, independently of other text in the information universe.

For example, if you train a new model from your data, the semantic and contextual connections discovered will reflect patterns in the dataset. If, on the other hand, you use a pre-trained model, the semantic similarity between documents or words will be based on the larger corpus, such as Wikipedia articles, that the model was trained on.

If you want to use a pre-trained model, which could especially be useful if your dataset is small, switch off the "Train a new model" switch and select one of the available models in the "ML model name" drop-down.

6. Name the output field

Name the field that the output vectors should be written to under "Output vector name".

7. Apply the operation

Click "Apply" to run the operation. The output vectors are now inserted into the output field.