The prolific Hadley Wickham did it again! This time he came up, together with Romain Francois, with an amazing library for data manipulation that turns the task of making Pivot Tables in R a real breeze. Enter dplyr. Along the lines of ggplot2, also from the same main author, dplyr implements a grammar of data manipulation and also introduces a new syntax using “pipe” operators.
What is a “pipe” operator? Well, the best way to learn about it is to use it. Ready to experiment with dplyr? Let’s get started.
1. Get the sample data file
We will be using the same Sales data file from my previous posts Pivot Tables in R with melt and cast and Pivot Tables in R with aggregate. Please refer to those posts for detailed instructions on how to download and prepare the data file for use in R.
2. Load the data file into R
This can be done simply with the command:
1 |
data <– read.csv(“sales_data.csv”, header=TRUE) |
Here the first 6 rows from the resulting data frame:
1 2 3 4 5 6 7 8 |
> head(data) Country Salesperson Order.Date OrderID Order.Amount 1 UK Suyama 37812 10249 1863.40 2 USA Peacock 37813 10252 3597.90 3 USA Peacock 37814 10250 1552.60 4 UK Dodsworth 37817 10255 2490.50 5 USA Leverling 37817 10251 654.06 6 UK Buchanan 37818 10248 440.00 |
3. Install and load the dplyr library
dplyr is available directly from CRAN (the latest version is 0.2 at the time of this writing). It can be installed and made available to R with the following commands:
1 2 |
> install.packages(“dplyr”) > library(dplyr) |
After having loaded the dplyr library, you may get some warnings that some objects are masked by other packages. This is ok and it is not to worry about at this time.
4. Getting familiar with dplyr
dplyr comes with a set of functions that can be used to perform the most common manipulation on data. They all work on data frames and table data frames, a new “smarter” version of data frames supported by dplyr.
You can convert a data frame to a table data frame through the tbl_df() function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
data_tdf <– tbl_df(data) > data_tdf Source: local data frame [799 x 5] Country Salesperson Order.Date OrderID Order.Amount 1 UK Suyama 37812 10249 1863.40 2 USA Peacock 37813 10252 3597.90 3 USA Peacock 37814 10250 1552.60 4 UK Dodsworth 37817 10255 2490.50 5 USA Leverling 37817 10251 654.06 6 UK Buchanan 37818 10248 440.00 7 USA Leverling 37818 10253 1444.80 8 USA Leverling 37819 10256 517.80 9 USA Peacock 37824 10257 1119.90 10 UK Buchanan 37825 10254 556.62 .. ... ... ... ... ... > |
While converting to a table data frame is strictly not necessary to use dplyr, the new format provides a smarter way to display large data sets with many variables (columns). Since our data set is pretty limited, we will keep working with our data in a data frame format. Note however that the output of most dplyr functions will be a table data frame, then converted back transparently to a data frame.
5. Data manipulation verbs
dplyr introduces five data manipulation verbs, namely filter(), arrange(), select(), mutate() and summarise(). When it comes to the task of producing Pivot Tables, summarise() is our working horse.
Note to my american friends: you can use summarize() in place of summarise() and it will work as well! 🙂
summarise() takes multiple rows of a data frame and summarises (collapses) them into a single one by applying a function, for example sum() to sum their values or mean() to calculate the average.
summarise() alone is not very useful though because it does not allow any grouping of the results before applying the summarizing function to the data. Since Pivot Tables are obtained by first grouping the rows according to the value of a variable (column) and then applying a summarizing function to each group, we need a way to group rows in dplyr first. Enter group_by().
6. Group_by()
group_by() enables data manipulation verbs to be applied to each subgroup of data, bringing then back the result of each group in a single data frame.
Back to our sample data, we want to obtain the total amount each Sales Person has sold. Using dplyr, first we group the data by Salesperson with group_by(), then apply summarise() to each group to find the total sum.
Since group_by() should be called first and the results passed to summarise(), we end up with the following fully working but quite convoluted syntax:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
> summarise(group_by(data, Salesperson), Order.Amount=sum(Order.Amount)) Source: local data frame [9 x 2] Salesperson Order.Amount 1 Buchanan 68792.25 2 Callahan 123032.67 3 Davolio 182500.09 4 Dodsworth 75048.04 5 Fuller 162503.78 6 King 116962.99 7 Leverling 201196.27 8 Peacock 225763.68 9 Suyama 72527.63 |
While the results are exactly as expected (please compare with the previous articles), the nested functions we had to use are pretty hard to read. Not surprisingly, the authors of dplyr have come up with a brilliant idea to simplify this complexity and make chained data operation extremely easy.
7. The pipe operator
What we are doing with the previous function calls is basically a chain of operations like this:
- take data
- group data by Salesperson using group_by()
- take the result from the previous step and calculate the total for each group (aka Salesperson) it using summarise()
In other words, data goes into group_by() and the result of group_by() goes into summarise() producing the final pivot table.
The “pipe” operator introduced by dplyr does exactly this. It sends a piece of data as input to a function and then allows the output from the function to go into another function and so on. Using the pipe operator, we can produce the exact same pivot tables with the following much more readable syntax:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
> data %>% group_by(Salesperson) %>% summarise(Order.Amount=sum(Order.Amount)) Source: local data frame [9 x 2] Salesperson Order.Amount 1 Buchanan 68792.25 2 Callahan 123032.67 3 Davolio 182500.09 4 Dodsworth 75048.04 5 Fuller 162503.78 6 King 116962.99 7 Leverling 201196.27 8 Peacock 225763.68 9 Suyama 72527.63 |
Where %>% is the pipe operator. Basically we have taken our data frame data, we have sent it to group_by() telling it to group by Salesperson, then we have taken the output from group_by() and sent it to summarise() telling it to summarise each group by calculating the total of Order. Amount within each group. Isn’t it great? 🙂
No unnecessary repetition of data, no need to use temporary variable to store the results, no need to nest functions. Everything happens automatically behind the scenes and is amazingly fast on large data sets thanks to the use of optimized C/C++ code. Great job Handy and Romain!!!
8. Filtering
How about doing some filtering too? Say that, like we did in a previous article, we want to product a Pivot Tables only for the Sales person in USA. We can achieve this easily with dplyr using the filter() verb. We filter the data before grouping and summarising them. Here is the syntax using again the pipe operator:
1 2 3 4 5 6 7 8 9 |
> data %>% filter(Country==“USA”) %>% group_by(Salesperson) %>% summarise(Order.Amount=sum(Order.Amount)) Source: local data frame [5 x 2] Salesperson Order.Amount 1 Callahan 123032.7 2 Davolio 182500.1 3 Fuller 162503.8 4 Leverling 201196.3 5 Peacock 225763.7 |
Fast, clean and easy to read!
9. Unleashing the power of dplyr
dplyr is particularly powerful also because it also allows multiple operations to be performed at once on the same data.
For the sake of this example, let’s add to our data set a fictitious Quantity column, representing the quantity sold for each order. We do this by generating a normally distributed Unit Sales Price with mean = 10 and standard deviation = 1, then calculating the Quantity as Order.Amount / Unit.Sales.Price.
Even if calculating the Quantity requires the Unit Sales Price to be defined first, with dplyr we can do everything at once, that is in one single pass, using mutate(). Here is the syntax and the outcome:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
> data <– data %>% mutate(Unit.Sales.Price = rnorm(n=n(), mean=10, sd=1), Quantity = Order.Amount / Unit.Sales.Price) > head(data) Country Salesperson Order.Date OrderID Order.Amount Quantity 1 UK Suyama 37812 10249 1863.40 201.77949 2 USA Peacock 37813 10252 3597.90 307.72409 3 USA Peacock 37814 10250 1552.60 173.32053 4 UK Dodsworth 37817 10255 2490.50 266.73936 5 USA Leverling 37817 10251 654.06 67.22134 6 UK Buchanan 37818 10248 440.00 40.88416 Unit.Sales.Price 1 9.234834 2 11.691967 3 8.957970 4 9.336830 5 9.729946 6 10.762114 |
Incidentally, note that the visualization of our modified data set wraps-around. Not very user friendly. Converting it to a table data frame makes its visualization nicer:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
> tbl_df(data) Source: local data frame [799 x 7] Country Salesperson Order.Date OrderID Order.Amount Quantity 1 UK Suyama 37812 10249 1863.40 201.77949 2 USA Peacock 37813 10252 3597.90 307.72409 3 USA Peacock 37814 10250 1552.60 173.32053 4 UK Dodsworth 37817 10255 2490.50 266.73936 5 USA Leverling 37817 10251 654.06 67.22134 6 UK Buchanan 37818 10248 440.00 40.88416 7 USA Leverling 37818 10253 1444.80 142.07045 8 USA Leverling 37819 10256 517.80 56.71080 9 USA Peacock 37824 10257 1119.90 115.06074 10 UK Buchanan 37825 10254 556.62 60.60626 .. ... ... ... ... ... ... Variables not shown: Unit.Sales.Price (dbl) |
Not only the new class table data frame shows the dimensions of the data frame as [799 x 7], but it prints nicely without wrapping around and specifying that one variable, Unite.Sales.Price, was not displayed.
Ok, now that we have our Quantities in place, we want to produce a Pivot Tables that has, for each sales person, the total amount sold and the average sold quantity. This is very easy too to achieve with dplyr. Here is the syntax:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
> data %>% group_by(Salesperson) %>% summarise(Total.Order.Amount=sum(Order.Amount), Average.Quantity=mean(Quantity)) Source: local data frame [9 x 3] Salesperson Total.Order.Amount Average.Quantity 1 Buchanan 68792.25 167.7801 2 Callahan 123032.67 125.1866 3 Davolio 182500.09 156.8355 4 Dodsworth 75048.04 188.9096 5 Fuller 162503.78 176.6365 6 King 116962.99 183.9184 7 Leverling 201196.27 159.4946 8 Peacock 225763.68 150.8685 9 Suyama 72527.63 112.2168 |
In fact summarise() can process multiple variables at once, each one with a different summarisation function, or apply different summarisation functions to the same variable.
Say, for example, we want to calculate for each sales person the total amount sold AND the maximum order size. This is very simple with dplyr compared to Excel, where you can only calculate one summary per column (unless you duplicate a column):
1 2 3 4 5 6 7 8 9 10 11 12 13 |
> data %>% group_by(Salesperson) %>% summarise(Total.Order.Amount=sum(Order.Amount), Max.Order.Amount=max(Order.Amount)) Source: local data frame [9 x 3] Salesperson Total.Order.Amount Max.Order.Amount 1 Buchanan 68792.25 9210.90 2 Callahan 123032.67 4825.00 3 Davolio 182500.09 15810.00 4 Dodsworth 75048.04 11380.00 5 Fuller 162503.78 16387.50 6 King 116962.99 12615.05 7 Leverling 201196.27 10952.84 8 Peacock 225763.68 11188.40 9 Suyama 72527.63 4707.54 |
Well done! This is it for today. I plan to look again at dplyr and its handy pipe syntax in other upcoming posts. As always, your comments and remarks are more than welcome.
Till next time!
Ciao,
Marco.
9 responses to “Pivot Tables in R with dplyr”
Great article, thanks!
How do we get multiple column labels in the pivot table? I would like to have the year or the month as a column. If there are orders in three different year, you would have 3 columns; 2012,2013,2014 with order totals for each row.
Hi Marcel,
the most elegant way to do what you ask for is to use the new tidyr package, also from Hadley Wickham (the same author of dplyr). tidyr is an evolution of reshape and reshape2 and uses the same pipe syntax as dplyr. Once you have your pivot table with the Total.Order amount for each Year, Country and Salesperson, you can spread it over the years using spread() from tidyr with the following syntax:
pivot2 <- pivot %>% spread(Order.Year, Total.Order)
This will turn pivot which looks like this:
> pivot
Source: local data frame [27 x 4]
Groups: Order.Year, Country
Order.Year Country Salesperson Total.Order
1 2003 UK Buchanan 17667.20
2 2003 UK Dodsworth 9894.51
3 2003 UK King 15232.16
4 2003 UK Suyama 14519.68
5 2003 USA Callahan 19024.70
6 2003 USA Davolio 30861.76
7 2003 USA Fuller 17811.46
8 2003 USA Leverling 18137.56
9 2003 USA Peacock 49945.11
10 2004 UK Buchanan 31433.16
...
Into pivot2 which looks like this:
> pivot2
Source: local data frame [9 x 5]
Country Salesperson 2003 2004 2005
1 UK Buchanan 17667.20 31433.16 19691.89
2 UK Dodsworth 9894.51 24756.89 40396.64
3 UK King 15232.16 59827.19 41903.64
4 UK Suyama 14519.68 40826.37 17181.58
5 USA Callahan 19024.70 57090.02 46917.95
6 USA Davolio 30861.76 91763.51 59874.82
7 USA Fuller 17811.46 71168.14 73524.18
8 USA Leverling 18137.56 103805.47 79253.24
9 USA Peacock 49945.11 121628.71 54189.86
Hope this helps, otherwise please feel free to comment back.
Cheers,
Marco.
Very usefull, thank you.
I was trying to add a column with SUM (as Excel does by default).
I met two problems:
1> when adding something like
mutate(Total=sum(2003:2005))
to the pipe, I got 6012 value for every row (so it’s treated as values, not column names),2> looking for solution, I tried to change column names to y2003, y2004, y2005, but could not achieve it in pipe
Hi Milos,
you can use colSums() for that. If your data frame is a:
mutate(a, Total=colSum(a))
will add a Total column to it with totals by row. sum(2003:2005) would simply return 2003+2004+2005 as 2003:2005 is equivalent to the vector c(2003,2004,2005)
Hope this helps.
Ciao,
Marco.
Hi Marco,
Thanks for sharing this. Found it pretty useful.
Is it possible to pass a dynamic variable (let’s call it grp_var) to group_by(), which may take the value ‘Salesperson’ or ‘Country’ depending on certain condition(s)? In effect, I would like to set the dimension for ‘Rows’ of the pivot dynamically. Any pointers to this? Thanks in advance!
Regroup function does that for me 🙂
http://stackoverflow.com/questions/21815060/dplyr-how-to-use-group-by-inside-a-function
Sorry for the churn!
Excellent article! could you shed light on how one can add a row for grand total? Thx, MD.
Just figured it out! Sry.
No worries, glad you did!