Keywords: machine learning, regression, regressors, neural networks, decision trees, train-test split, cross-validation
Regression in Machine Learning is one of the most studied supervised learning topics in data science. Though almost all regressors are complex mathematical models that map high dimensional numeric values (vectors) to continuous values after being trained on previous observations, they are also widely used in Natural Language Processing to solve text-based regression problems such as price or risk score prediction from unstructured text thanks to the unsupervised language embeddings models which convert text values to vectors. If you haven't read about how to vectorize a text yet, we suggest reading this article.
The rest of the article talks about how to apply the "Apply regression on vectors" operation using the sample-amazon-movie-reviews.json starter file. Before getting started, we applied Tokenization on "reviewText" field and generated vectors using Vectorize text operation. We'll use "overall" field which has Amazon customers' movie ratings between 1-5 as a target score during regressor training.
Step-by-step guide
1. Select the vector field
Select the vector field on which you want to train a regressor in the Schema View.
2. Open the operation configuration window
Click the "Add operations" button at the top of the workspace, then click the "Vector operations" tab, then click "Apply regression on vectors".
3. Set the parameters of the operation
The following parameters can be set:
Vector source: Path to the input vector values which will be used in training and/or getting predictions.
Method: Name of the regression algorithms to be trained. We offer both decision tree based regressors and neural regressors. The default method is "Multilayer Perceptron Regressor" with auto-adjusted neural network layers.
Train a new model?: Whether to train a new model or load a previously trained one.
If this option is turned on, you'll see the following parameters:
Name of output ML model: The trained model will be saved with this name and accessible in the AI Models menu.
Target class: Path to the continuous values associated with each vector.
Train split ratio: The regressor model will use the specified proportion of the data for training and the rest of the data will be used for testing and self-evaluation. The splitter randomly splits all available data points at the specified rate. The default value is 0.8 (80%).
If training is disabled,
ML model to load: Choose one of the previously trained regressor models to get predictions on the input vectors.
Output field name: Predictions will be stored as a new entity under this field. Step 4 explains the sub-fields in the output entity.
4. Apply the operation
Click "Apply". A new entity field (named "regression_output" by default) has now been generated and added to the schema. This new entity has 2 sub-fields:
subset: A categoric text field that indicates whether the data point was in a TRAIN or TEST fold. This sub-field will only appear in training mode.
prediction: The regressor's estimation. Hence the target values were numerical, this field will have decimal values.
To view all prediction results, drag "regression_output" to the Table view, click the arrow button in the column header, and click Flatten. Flattening will expand all sub-fields into separate columns.
5. Check evaluation metrics
You'll see a notification with mean squared error, root mean squared error, and mean absolute error scores for training and test datasets once the training is completed.
This notification will disappear after a while. However, you can always access these evaluation results from the AI models menu which can be accessed from the top navigation bar.
After opening the AI models menu, hover over the Metrics icon displayed next to the trained model.