Keywords: data pivoting, descriptive statistics, grouping, group by field, aggregation, Table View
Data pivoting rearranges data through groupings and aggregations, enabling viewing the data from a different perspective. It's useful for generating descriptive statistics, for example by counting, summarizing, or calculating averages, minimums, maximums, and so on.
A few examples of what data pivoting can be used for:
To count researchers' articles and summarizing their citations in a dataset of academic articles (see Example 1 below).
To count the number of sentences, paragraphs, and articles that words are used in, as well as the total number of times they are used (see Example 2 below).
To calculate the average sentiment of tweets by author and year (see Example 3 below).
If you are used to using GROUP BY statements in SQL, data pivoting in Dcipher Analytics will be intuitive.
☝️ Note: While this guide describes how to group and aggregate data locally through drop zones in Table View, you can use the Aggregates operation to generate an aggregated dataset in the global pipeline.
1. Select grouping field(s)
Drag your primary grouping field from the Schema workbench to the "Group by field" drop zone in the Table View. If you want to include a second (or third, etc) grouping field, drag it to the "Add field to grouping" drop zone in the Table View.
The values in the grouping field(s) now appear in the Table View, along with the corresponding aggregated data, which by default uses
count over "id" as the aggregation function.
☝️ Note: Unlike a typical pivot table (where one grouping field is displayed across columns, the second across rows, and the aggregated data is displayed in the cells), here each grouping field and each aggregate field is assigned a column, meaning that each row represents a combination of grouped values and their aggregated data. The benefit of this structure is that (1) more than two grouping fields can be used to generate more granular combinations (i.e. a multi-dimensional pivot table; see Example 3 below); and (2) more than one aggregation function can be applied (see Example 1 and Example 2 below).
2. Specify or change the aggregation function(s)
To change the aggregation function, double-click the column header of the column with the aggregated data. In the settings window appearing, specify the function in the "Function" drop-down and the field that the function should be applied on in the "Field" drop-down.
💡 Quick tip: To count the total number of occurrences, for example of words in the text, select the asterisk (*) in the "Field" drop-down.
To add additional aggregations, drag the field you want to aggregate over from the Schema workbench to the "Apply function" drop zone in the Table View. After dropping the field, click the drop-down menu inside the drop zone and specify the function. A new aggregation column will be added with the aggregated data.
Example 1: Counting researchers' articles and summarizing their citations
We have a dataset with scholarly articles, where each row corresponds to an article. The dataset contains a numeric field, "Citations", with each article's number of citations, as well as an entity collection field, "Authors", with each paper's authors (created by splitting the original authors' text field where author names were separated by semicolon). We want to create a table with researchers as rows, displaying each researcher's number of articles and citations.
To do this, we first drag the "Authors.value" field to the "Group by field" drop zone in the Table View. This gives author names in the first column and the number of articles (based on the default function
count(id)) in the second column.
To add a third column with the sum of citations for each author, we drag the "Citations" field to the "Apply function" drop zone in the Table View and select "Sum" as the function. This results in the desired table, showing authors, their number of articles, and their number of citations across these articles. The authors can be sorted by the number of articles or the number of citations, in ascending or descending order, by clicking the arrows in the column headers.
Example 2: Counting words by different units of text
Our dataset contains articles, which have been split into paragraphs (in the "paragraphs" field), which have in turn been split into sentences (in the "sentences" field), which have finally been tokenized into words (in the "tokens" field). The dataset also contains the field "date" with the publication date of each article. We want to count the number of times each word has occurred in total, the number of sentences, paragraphs, and articles it has occurred in, and the number of dates it occurred during.
We start by dragging "tokens.value" to the "Group by field" drop zone in the Table View. To count the total number of occurrences each word, we double-click the aggregation column header and change the default function (
count(*). We have now grouped our words and counted their number of mentions across the articles.
We then drag "id", "paragraphs.id", and "sentences.id", and "date" one-by-one from the Schema workbench to the "Apply function" drop zone in the Table View and select "count" as the function. This adds the four additional aggregation columns to the table so that we have now arrived at the desired outcome.
Example 3: Calculating the average sentiment of tweets by author and year
We have a dataset with tweets, containing each tweet's author (in the "author" field), publication date (in the "created_at" field, which we got by converting the date format of the original date field), and sentiment (in the "sentiment" field, which we got by running sentiment analysis on the tweets). We want to calculate the average sentiment score of posts posted by each account, each year. This would give us a picture of how the sentiment has changed over time for each account.
To do this, we drag "author.name" to the "Group by field" drop zone and "created_at" to the "Add field to grouping" drop zone in the Table View. We then drag "sentiment.score" to the "Apply function" drop zone and select "Average" as the function. This gives the desired table with authors, publication dates, and the average sentiment score for each author-date pair.