Keywords: machine learning, classification, neural networks, decision trees, train-test split, cross-validation

Classification in Machine Learning is one of the most studied supervised learning topics in data science. Though almost all classifiers are complex mathematical models that map high dimensional numeric values (vectors) to discrete labels (classes) after being trained on previous observations, they are also widely used in Natural Language Processing to solve text-based classification problems such as sentiment, emotion, stance, and spam analysis thanks to the unsupervised language embeddings models which convert text values to vectors. If you haven't read about how to vectorize a text yet, we suggest reading this article.

The rest of the article talks about how to apply the "Classify vectors" operation using the sample-amazon-movie-reviews.json starter file. Before getting started, we applied Tokenization on "reviewText" field and generated vectors using Vectorize text operation. Finally, we changed the type of "overall" field from "decimal number" to categorical text so that it can be used as a target (class) during classifier training.

Step-by-step guide

1. Select the vector field

Select the vector field you want to classify in the Schema View.

2. Open the operation configuration window

Click the "Add operations" button at the top of the workspace, then click the "Vector operations" tab, then click "Classify vectors".

3. Set the parameters of the operation

The following parameters can be set:

Vector source: Path to the input vector values which will be used in training and predicting classes.
Method: Name of the classification algorithms to be trained. We offer both decision tree based classifiers and neural classifiers. The default method is "Multilayer Perceptron Classifier" with auto-adjusted neural network layers.
Train a new model?: Whether to train a new model or load a previously trained one.
- If this option is turned on, you'll see the following parameters:
  - Name of output ML model: The trained model will be saved with this name and accessible in the AI Models menu.
  - Target class: Path to the labels associated with each vector.
  - Train split ratio: The classifier model will use the specified proportion of the data for training and the rest of the data will be used for testing and self-evaluation. In order to keep the distribution of target values in training and test set equal, we apply Stratified sampling instead of purely random sampling. The default value is 0.8 (80%).
- If training is disabled,
  - ML model to load: Choose one of the previously trained classifier models to get class predictions on the input vectors.
Output field name: Class predictions will be stored as a new entity under this field. Step 4 explains the sub-fields in the output entity.

4. Apply the operation

Click "Apply". A new entity field (named "classification_output" by default) has now been generated and added to the schema. This new entity has 3 sub-fields:

subset: A categoric text field that indicates whether the data point was in a TRAIN or TEST fold. This sub-field will only appear in training mode.
prediction: The classifier's estimation. Hence the predicted values will be limited to the values in the target field, it will be a categoric text field.
probability: Confidence of the classifier on the predicted class. 1.0 probability means that the classifier is 100% confident regarding the prediction.

To view all prediction results, drag "classification_output" to the Table view, click the arrow button in the column header, and click Flatten. Flattening will expand all sub-fields into separate columns.

5. Check evaluation metrics

You'll see a notification with accuracy, precision, recall, and F1 scores for training and test datasets once the training is completed. If the classification problem has more than two classes (multi-class classification); reported precision, recall, and F1 scores will be the weighted average of precision, recall, and F1 scores for each class.

This notification will disappear after a while. However, you can always access these evaluation results from the AI models menu which can be accessed from the top navigation bar.

After opening the AI models menu, hover over the Metrics icon displayed next to the trained model.