Skip to main content
All CollectionsUsing Dcipher StudioText enrichment
Training and deployment of custom text classifiers with Dcipher Analytics
Training and deployment of custom text classifiers with Dcipher Analytics

Train custom text classifiers on unlabeled or partially labeled data using the Active Learning approach.

Zafer Çavdar avatar
Written by Zafer Çavdar
Updated over 9 months ago

Keywords: classification, labeling, tagging, active learning

Active Learning is a subtopic in Machine Learning which offers interactive human (as a teacher) and AI model (student) collaboration to create training data while solving the classification problem. AI model actively comes up with suggestions and expects to receive accept, reject or ignore feedback from the teacher and learns semantic definitions of labels or tags accordingly.

Dcipher offers a dedicated workbench to interactively train classification models using the Active Learning approach and enables access to saved models through other operations.

Step by step guide

This tutorial uses the sample-amazon-movie-reviews.json dataset from Starter files.

1. Add a new Active Learning workbench

Create a new Active Learning workbench from Add Workbench menu.

2. Initialize Active Learning workbench

Drag and drop your text field to be classified from Schema workbench to Active Learning workbench.

Configure workbench settings by providing classification mode, a name to your model, and a path to the existing labels field if existing.

Existing labels field: If an existing labels field is provided, the Active Learning model will treat labeled texts as ground truth information and will start training itself on them. If an existing labels field is provided, labels are not supposed to exist for all texts. If you have a partially labeled dataset where some of the values in the labels field are empty, the Active Learning model will only use the labeled texts to run initial training and will come up with label suggestions for unlabeled texts.

Model name: Your model will be saved with the name you provided once training is completed.

Model modes:

  • Categorization mode allows labeling each text with a single label and is useful to train text classifiers

  • Tagging mode allows multiple labeling of each text and is useful to train multiple independent taggers.

Click on Continue to initialize the Active Learning workbench. Depending on your dataset size, it may take a few minutes to analyze text, initialize the AI model and run initial training.

3. Investigate workbench layout

You can control the number of rows to display from the upper left corner. 100 rows are displayed by default.

The number of labeled texts is displayed in the bottom-right corner. This number will increase after accepting suggestions or adding labels to unlabeled rows.

Each row displays text and associated labels. A label can be added by the teacher or come from the AI model as a suggestion. Accept/Reject buttons give feedback to the model and the "+" (add) button can be used to introduce a new label independent of model suggestions.

The next section of the tutorial explains the actions available in the workbench header: View, Score, Histogram, Add label, Fast assign, Update, Finalize.

4. Interact with the workbench and train model

Workbench header consists of helper options to speed up the training process.

a) View options

This workbench offers 3 view options:

  • Unlabeled rows: Selecting this option displays rows without any human-introduced or approved labels. This option should be selected in order to receive and evaluate suggestions.

  • Labeled rows: Selecting this option displays rows with at least 1 label. Since each row gets at most 1 label in "categorization mode", we suggest turning off this option in this mode. You can keep it selected in "tagging mode" since each row may get 1 or more label suggestions.

  • Ignored rows: Selecting this option displays ignored rows. Ignored rows are not included in the training process and the AI model does not produce any label suggestions for them. The picture below shows how to select ignore selected rows.

Select the rows you would like to ignore. Click on the icon next to the Filter button and choose "Ignore selected"

b) Scoring options

Active Learning workbench rows are sorted by the scoring option selected in descending order. It offers 3 sorting options:

  • Sort by uncertainty on a selected label: Rows with texts on which the AI model can't make a strong opinion to either accept or reject will get higher scores. Sorting by this option and providing feedback on uncertain cases by resolving ambiguities helps the AI model learn faster.

  • Sort by certainty on a selected label: Rows with texts on which the AI model has high confidence on the final decision will get higher scores. Sorting by this option and providing feedback on certain cases can speed up the labeling process and are useful for the AI model to verify itself but don't contribute learning process a lot.

  • Sort by relevance on provided keywords: Rows with texts which have high semantic similarity with the provided keywords will get higher scores. This option is especially useful in the initial stages where the AI model has very little information about the classification task and there is a need to find relevant texts about particular topics. For example, you can provide "love", "great", "excellent" keywords to this sorting option to find a sample of positive reviews and label them "Positive".

c) Histogram

The Histogram menu displays the distribution of the selected scoring metric. All uncertainty, certainty, and relevance scores are normalized to the 0-1 range. Selecting a range in the histogram and clicking on Apply button applies a range filter on the rows and rows having a score in the selected range will be displayed in the table. Histogram-based filtering is useful to focus on a specific sample from the data, such as texts with low certainty scores or pretty high relevance scores.

d) Add label

While the "+" button in each rows' Labels column allows adding new labels to each row one by one, Add label menu enables adding labels to multiple rows at once. Selecting at least 1 row enables this option. Find from previously introduced labels or type a new one to add this label to all selected rows.

e) Fast assign

"Fast assign" enables bulk approves or rejections on the selected rows. Upon selecting 1 or more rows, this menu shows all suggested labels with Approve or Reject option. It's also possible to Accept or Reject each suggestion row by row.

f) Update

After adding new labels, accepting or rejecting suggestions from the AI model, clicking on the Update button sends all new information and new feedback to the AI model. The AI model trains itself on the latest information and comes back with new suggestions while updating certainty and uncertainty scores for all rows. It's the key button that transfers information from the human to the AI model.

g) Finalize

After running a number of labeling iterations with the AI model and ensuring that most of the suggestions from the model are accurate enough for your use case, it's time to finalize the training. As described above Apply button, Finalize button will save the AI model to your private model storage on Dcipher Cloud and it will be accessible via Classify text (AL models) operation (also known as Predict Labels). Additionally, all human-introduced or approved labels from suggestions will be transferred to the source dataset under provided field name as Tag Collection as shown below.

Saved models are available in the AI models menu which is opened from the top-left bar.

5. Use trained models in your pipeline

Classify text (AL models) operation enables running trained classifier models from the Active Learning workbench on different datasets. This operation is available in Add operation menu under the Text enrichment category and supports selecting any long text field you want, choosing one of the AI models from the Active Learning workbench, and setting a custom confidence/likelihood classification threshold.

Did this answer your question?