Keywords: overrepresentation, characterization

It is often useful to understand what distinguishes a certain subset of data from the whole. For example, what words are over-represented in free-form text responses from one segment of respondents? What news sources are over-represented in news articles expressing a certain narrative?

To get this information, Dcipher has an operation for calculating the degree of over-representation of values in a selection of rows compared to all the rows in the dataset.

Step-by-step guide

1. Display the relevant subset of data

Filter your dataset in the Table View so that it contains the subset of data that you want to compare to the entire dataset.

👉 Example. If you want to know what words are over-represented in posts from India in a dataset with tweets, start by applying a filter on the country column in the Table View to only include rows where the location is India.



2. Display the relevant values

Drag the field containing the values that you want to measure the degree of over-representation for the "Display as bubbles" drop zone in the Bubble workbench. This aggregates the values and uses a bubble radius to represent count so that the biggest bubbles correspond to the most frequent values.

👉 Example. In the example where we want to find over-represented words in Indian tweets, drag the field containing words to be displayed in the Bubble workbench.

3. Calculate the degree of over-representation for values

Now that the relevant subset of data is displayed in the Table View and the relevant values are displayed in the Bubble workbench, we are ready to trigger the operation that calculates the degree of over-representation for values.

Select the subset of data displayed in the Table View and drag it to the "Find overrepresented values" drop zone in the Bubble workbench. The degree of over-representation is now calculated for each value, where their occurrence in the dropped rows is compared to the entire dataset. Bubble radius now shows the degree of over-representation, so that the biggest bubbles correspond to the most over-represented values.

☝️ Technical note: The over-representation calculation uses a generative probabilistic model to calculate the probability of a given value's observed occurrence in the subset of data given its occurrence in the entire data. The lower the likelihood, the higher the degree of over-representation (values that occur less frequently in the subset than in the entire data are eliminated). This approach gets around issues associated with other measures such as absolute difference (which favors large values) and relative difference (which favors small values).

👉 Example. In the previous example with tweets, the biggest bubbles represent the words that are most characteristic of Indian tweets compared to all the tweets in the dataset.

Example: Overrepresented words in tweets from India

In this example, the dataset contains 10,000 tweets containing the keywords "antibody" or "antibodies". We would like to know what words are overrepresented in tweets from India.

We drag the entire dataset to the Table View. After locating the "country" column, we click the "filter" icon and set the filter to only include tweets from India. After applying the filter, the Table View is now displaying the subset of tweets that are posted from India.

We then drag the "tokens" field to the Bubble workbench. We do the same thing again in order to be able to make a comparison between the overrepresented tokens and the overall token frequencies. By changing from "bubble" to "word" mode, the tokens are now displayed as words.

To drag-and-drop the tweets from the Table View, we click a column header and drag the selection to the "Find overrepresented tokens" drop zone. This triggers the operation.

The words that are most over-represented in tweets from India and now shown, with word size reflecting the degree of over-representation.

Did this answer your question?