If you are making the transition from Excel to R and still can’t figure out how to quickly obtain pivot tables like Excel has, this article is for you!

Actually it is pretty easy to produce Pivot Tables in R. All you need is a package called *reshape* by Hadley Wickham (yes, the same prolific author of *plyr* and *ggplot2*) and some understanding of how *reshape* “thinks” and works.

If you are interested in an alternative, easier but less powerful, method to create Pivot Tables in R using *aggregate*, you can read my other post here.

For this tutorial, we will be using a sample data set called *Salespeople PivotTable report*, which is used in many articles by Microsoft teaching the basics of Pivot Tables in Excel. The link to the sample data file can be found at the bottom of this article.

We will first obtain each result in Excel and then reproduce it with R. I recommend that you download the sample data file and follow along.

### 1. Download the sample data file

Use the link at the bottom of the article and save the file somewhere on your hard drive. So far so good!

### 2. Open the file in Excel and save a CSV copy of the first sheet

Double-click the file from the location where you saved it to open it inside Excel. The Excel worksheet has 3 tabs. We will be using the first one, containing the source data, and the last one, which contains a ready-made Pivot Table to play with. If you do not know how to make Pivot Tables in Excel, I recommend you read the Microsoft tutorial. In this article we will focus on how to obtain the same results with R and we will use the ready-made pivot table on sheet 3.

Once you are on the first sheet of the sample data set (the one with the source data), save a copy of it as CSV (Comma Separated Values) file. This will generate a file which is easier to import into R. R has packages which allow you to read directly Excel files without converting them to CSV, but for the sake of simplicity we will stick to CSV.

Before saving as CSV, change the format of column E (Order Amount) to General so that the orders amounts appear as numbers with 2 decimals and not with a $ sign in front and comma separated thousands. If you export to CSV without doing this first, there will be extra steps required in R to convert the sales amounts to numeric (see the note at the end of the article).

### 3. Load the CSV file in R

This is an easy step accomplished with the *read.CSV* function as follows.

1 |
data <- read.CSV("file_name.txt", header=TRUE) |

file_name.txt is the name of the CSV file you created from Excel. If it is not in the current working directory in R, you can either change R’s working directory with *setwd* or move the file to the current working directory or add a path to the file before the file name.

At this point you should have the same base data open in Excel and loaded into R. In R the first lines will look like this:

1 2 3 4 5 6 7 |
Country Salesperson Order.Date OrderID Order.Amount 1 UK Suyama 10/07/2003 10249 1863.40 2 USA Peacock 11/07/2003 10252 3597.90 3 USA Peacock 12/07/2003 10250 1552.60 4 UK Dodsworth 15/07/2003 10255 2490.50 5 USA Leverling 15/07/2003 10251 654.06 6 UK Buchanan 16/07/2003 10248 440.00 |

### 4. Your first Pivot Table in R with melt and cast

Here we go. This is what we have been waiting for, the rest was just preparation to get to this point. Ready?

We will re-produce in R a pivot table like the default one that appears in the 3rd tab of the sample Excel file and which looks like this.

Basically it shows how much each sales person has sold in total across all orders and and countries (note that the Country filter is set to All). It also shows a Grand Total which is the sum of all sales.

To get the job done in R we will use the *reshape* package. There are other packages available which can achieve the same results, but *reshape* is particularly versatile and easy to use once you have grasped some basic concepts. (And, hey, it has been written by Hadley Wickham, so we _have_ to use it!)

We can install (if not already installed) and load the *reshape* package with:

1 2 |
install.packages("reshape") library(reshape) |

If you have *ggplot2* installed and loaded, *reshape* will already be available.

The *reshape* package contains two functions that are key to easily generate pivot tables in R. They are *melt* and *cast*.

*melt* transforms a data frame from the original format to a so called long format, where all the observed variables (called measures) appear, together with their respective value, in two adjacent columns named **variable** and **value**. Each row of this new data format is identified by a unique combination of the **id** variables, also part of the original data frame.

An example will clarify how *melt* works. Let’s take the first 6 rows of our data set.

1 2 3 4 5 6 7 |
Country Salesperson Order.Date OrderID Order.Amount 1 UK Suyama 10/07/2003 10249 1863.40 2 USA Peacock 11/07/2003 10252 3597.90 3 USA Peacock 12/07/2003 10250 1552.60 4 UK Dodsworth 15/07/2003 10255 2490.50 5 USA Leverling 15/07/2003 10251 654.06 6 UK Buchanan 16/07/2003 10248 440.00 |

It contains 5 columns. The first 4 identify each order by a combination of Country, Salesperson, Order.Date and OrderID. These are all non-numeric and there are no calculations we can do on them except, maybe, counting their frequency. Using *melt*‘s terminology, Country, Salesperson, Order.Date, OrderID are **id** variables, while Order.Amount, which is a numeric and which is the one we would like to sum up in our pivot table, is a **measure**.

When melting your data, you can indicate multiple **id** variables and also multiple **measure** variables. The **id** variable will appear as-is in the resulting melted format, while the **measure** variable will be stacked in the **variable** column with their respective value in the **value** column.

Note that *melt* preserves all data. Nothing is lost or modified, just the way the data are collected within the data frame. It is therefore possible to “un-*melt”* the melted data and go back to the original format at any time.

Let’s melt our sales data frame and see what the result looks like.

1 |
data.m <- melt(data, id=c(1:4), measure=c(5)) |

Pretty simple. Instead of listing the column names we have used their numerical identifiers. We have indicated to *melt* that columns 1 to 4 ( Country, Salesperson, Order.Date, OrderID) are **id** variables while column 5 (Order.Amount) is a **measure** variable.

Here is how the first 6 rows of data.m look like at this point:

1 2 3 4 5 6 7 |
Country Salesperson Order.Date OrderID variable value 1 UK Suyama 10/07/2003 10249 Order.Amount 1863.40 2 USA Peacock 11/07/2003 10252 Order.Amount 3597.90 3 USA Peacock 12/07/2003 10250 Order.Amount 1552.60 4 UK Dodsworth 15/07/2003 10255 Order.Amount 2490.50 5 USA Leverling 15/07/2003 10251 Order.Amount 654.06 6 UK Buchanan 16/07/2003 10248 Order.Amount 440.00 |

As said the column Order.Amount has been “melted” into a variable column with the respective values in the value column.

Now that our data have been melted (and you have to do this only once, unless you change idea on what you want as id and as measure), we are ready to *cast* them to build the pivot table.

*cast* requires us to indicate, beside a reference to the melted data, how we want to re-aggregate the values.

The basic syntax is (and, yes, we are omitting some parameters that we won’t need in this tutorial):

1 |
cast(data, formula = ... ~ variable, fun.aggregate=NULL, ..., margins=FALSE, subset=TRUE) |

*formula* and* fun.aggregate* are the most important two because they indicate how we want to reshape the data and which functions to use for aggregating the values.

Think of * formula* as saying:

1 |
what do you want as rows ~ what do you want as columns |

While *fun.aggregate* says how you want to aggregate the values of the variable(s) in order to reshape them as described by *formula*. This is the equivalent of selecting “Count of”, or “Sum of” etc. in Excel.

Going back to our example will make this point clearer.

We want to produce a pivot table which contains the total sales for each sales person. Therefore we chose Salesperson as row and variable (which is Sales.Amount) as column. The aggregate function will be *sum* to obtain the sum of the variable (Sales.Amount) for each sales person. Here the command:

1 |
data.c <- cast(data.m, Salesperson ~ variable, sum) |

And here is the resulting pivot table:

1 2 3 4 5 6 7 8 9 10 |
Salesperson Order.Amount 1 Buchanan 68792.25 2 Callahan 123032.67 3 Davolio 182500.09 4 Dodsworth 75048.04 5 Fuller 162503.78 6 King 116962.99 7 Leverling 201196.27 8 Peacock 225763.68 9 Suyama 72527.63 |

As expected (or not?!?) is identical to the Excel one. Congratulations on your first pivot table in R!

But wait before you pat yourself on the shoulder. In Excel there is an extra row showing the Grand Total and we don’t have it in R. How can we fix it?

### 5. Adding Grand Totals

That’s easy actually. Look back at the syntax for *cast*. There is an argument called *margins* which comes to rescue. *margins* accepts a vector that can contain the strings “grand_col” and “grand_row”. This adds an extra column or extra row (or both) with a grand total to the resulting data frame. The grand total is obtained through the same *fun.aggregate.*

Let’s add a row with the grand total.

1 |
data.c <- cast(data.m, Salesperson ~ variable, sum, margins=c("grand_row")) |

And here it is, identified by (all):

1 2 3 4 5 6 7 8 9 10 11 |
Salesperson Order.Amount 1 Buchanan 68792.25 2 Callahan 123032.67 3 Davolio 182500.09 4 Dodsworth 75048.04 5 Fuller 162503.78 6 King 116962.99 7 Leverling 201196.27 8 Peacock 225763.68 9 Suyama 72527.63 10 (all) 1228327.40 |

Well done!

### 6. And how to Filter

Ok, ok. You are a picky one! I know you noted that our pivot table lacks one more feature respect to the Excel one. It has in fact no ability to Filter by Country. Did you get it yet? No!?!

Yes, the solution is in the other parameter for *cast* which we did not consider so far (beside those we omitted): *subset*.

*subset* allows us to limit the casting to only a subset of the melted data selected according to certain criteria. It requires a vector of TRUE or FALSE corresponding to the rows of the melted data we want to select (TRUE) or exclude (FALSE) before the casting is performed.

Say that we want to calculate the performance only for the American sales people. We need to select only those rows for which Country==”USA”. This is exactly the additional condition we need to specify within *cast* and it corresponds to selecting the Filter Country = USA a the top of the Excel pivot table.

1 |
data.c <- cast(data.m, Salesperson ~ variable, sum, margins=c("grand_row"), subset=(Country=="USA")) |

And here is the filtered pivot. The grand total has also been adjusted accordingly.

1 2 3 4 5 6 7 |
Salesperson Order.Amount 1 Callahan 123032.7 2 Davolio 182500.1 3 Fuller 162503.8 4 Leverling 201196.3 5 Peacock 225763.7 6 (all) 894996.5 |

Congratulations, now we are really done! 🙂

I hope you have enjoyed this tutorial on how to make simple pivot tables in R. Please leave a comment below if you would like to see more tutorials on the same topic.

Till next time!

### Note: Stripping $ and commas and converting to numeric

Ok, so you decided not to change the format of Column E (Amount) to General before saving the table as CSV. I had warned you. Now your punishment will be terrible and will include… regular expressions! Read on.

Once you open the CSV file in R, you can verify that the column Order.Amount has indeed been imported as a Factor. In order to be able to do calculations on it, we need to convert it to numeric first. This can be easily achieved using the function *as.numeric*, however the $ sign at the beginning and the commas separating the thousands will cause it to generate unexpected results. Before running through *as.numeric* we need therefore to strip both the $ sign and the commas. This can be quickly done using *gsub* and regular expressions. Teaching you regular expressions goes beyond the scope of this article, but the following code does the job.

1 2 |
data$Order.Amount <- gsub("([/$,])", "", data$Order.Amount) data$Order.Amount <- as.numeric(data$Order.Amount) |

Obviously the two steps above can be combined into one by nesting *gsub* into *as.numeric*.

### Source data:

Update: it seems that the *Salespeople PivotTable report* sample data file is no longer available on Microsoft’s web site. You can still get a copy from the link below.

Originally the file was located here:

http://office.microsoft.com/en-001/templates/salespeople-pivottable-report-TC001091651.aspx?CTT=5&origin=HA001087565

## 29 Comments

Super useful tutorial, thanks!

Thanks, glad you liked it! Did you also take a look at the alternative with aggregate?

Cheers,

Marco.

Thanks a bunch! This was very helpful. I’m an advanced EXCEL user that is now learning and transitioning to R. I have had it rough going to the non-GUI world. I want to read your other articles.

You are welcome Mike! Good luck with R, I am sure you will find out soon how powerful it can be.

Thanks Marco, very helpful.

I’d like to aggregate on more than one variable,. Also, I need to aggregate some variables by sum, and some by max value.

Do you have any ideas on how to do it?

Thanks

Hi Arianna,

the easiest way to do this is to use the new dplyr package by Hadley Wickham. Please check out the article I just published on Pivot Tables in R with dplyr. It contains examples about aggregating multiple variables with different summary statistics. including sum, average and max. I believe it should help you to solve your problem, otherwise please feel free to ask again.

Cheers,

Marco.

How can I pivot a data frame with value as a customized field? Unlike cast’s FUN as sum or mean, can I do field1 / field2?

Hi Edward,

you can always add extra calculated columns (aka variables) in the usual way (data$new.column = data$column.1 / data$column.2) or you can take advantage of libraries like dplyr that make the task easy. I just wrote a new article about using dplyr to make Pivot Tables, maybe you can take a look and see if it helps to solve your problem. I added an example on calculating new columns from existing ones and then pivoting them.

Cheers,

Marco.

It is a great article and I loved it. But instead of sum, i want to calculate: Average, count, percentage count etc. I tried to replace sum with average etc but it is giving me an error stating: object ‘count’ not found. Could you please help me out.

Thanks

Hi Rupinder,

It is difficult to help without seeing your code. I just wrote an article on using dplyr to make Pivot Tables. dplyr makes it easier to calculate multiple statistics at once for the same variable. Please have a look and let me know if it works for you.

Ciao,

Marco.

thanks Marco…a minor tip: after importing the file into R and in the step when I did the cast operation, I got the error message

“Error in Summary.factor(integer(0), na.rm = FALSE) :

sum not meaningful for factors”

…this drove me around the bend till I realized that the Amount column, though it looked like numbers was in text form. The solution is to go back to the data.csv file in Excel, select the Amount column, and change its format from General t Number. Hope this helps

ajit

Hi Ajit,

thanks for your comment. Indeed, one should always check with the str() command with which types the variables are being imported into R. This is known to be a source of headaches, especially when variables get or do not get converted to factors. Good that you were able to spot the problem and solve it.

Ciao,

Marco.

Just a big THANK YOU for this concise real-life example.

After seeing the results of your little example (and realizing the flexibility of the coding vs. the traditional Excel “point&click” orgies plus the data outpout format which is better suitable for further analysis than the default Excel pivot result), I am determinded to dare the transition to R (…daunting as the task my seem; the learning curve is really steep for a start – but the reward is bound to come….).

One very positive & motivating factor is the amount of information and help that is available for R from people like you – so you may take this compliment literally personal; but you can also take this as a “pars-pro-toto” acknowledgement for the entire R-community 🙂

Mille grazie !

Hi Ulf,

sehr gerne!

It is nice to receive comments like yours as they provide the energy to keep sharing. I am glad you are making the transition from Excel to R. When I introduced it into my company for a special financial analysis project which used to be done with Excel, we cut the total time from many days to few hours. The quality and our confidence in the results also greatly improved to everyone’s satisfaction.

By the way, I also started to make a series of YouTube videos for anyone wanting to make the move from Excel to R. They are called R for Excel Users and show side by side how the same tasks are done in Excel and in R. You may want to check them out, maybe they are useful to you too. I just need to find the time to continue the series. 🙂

For any questions, please feel free to contact me, I will be glad to help if I can.

MfG,

Marco.

Great Post my friend !

Thank you Nikhil, glad you liked it! You may want to check out also my other posts about Pivot Tables in R with aggregate and with dplyr.

Cheers,

Marco.

Excellent tutorial!

Thanks Carlos! 🙂

Hi Marco, very helpful tutorial, thanks! I have a question:

I start with 3 columns: TCR, Tcell (which are my id’s) and Seq_count which has my measured data.

I used melt as follows: md =5))

or variations thereof, I continually get “Error in cast(data, formula, fun.aggregate, …, subset = subset, fill = fill, :

object ‘variable’ not found” as my message.

Can you give me an idea of what I’m doing wrong?

Thanks!

Hi Zoe,

not easy to guess what could be wrong without seeing an example of your data and the exact syntax of the cast command you are issuing (if it was in your original comment it didn’t come through unfortunately).

I would recommend that you first use str() on your data frame to check you have the right data types in it, then check your cast syntax. Are you referring a variable which is not defined anywhere?

Cheers,

Marco.

Hi Marco,

Very comprehensive explanation indeed. Thank You very much for sharing.

Regards,

Sharad Naidu.

You are welcome Sharad!

Hi Marco,

What is the code i should use if I want the count instead of the sum?

Eg Column 1 = Salesperson, Column2 = Count the no of times the particular salesperson appeared in the data set

Best Regards,

LH

Hi LH,

the plyr package has a count() function which does what you need. Have a look at it.

Cheers,

Marco.

Hi Marc,

I am trying to migrate from Excel to R and find your articles very easy to comprehend.

In this context, I am stuck with an issue. I have a “Price” variable with $ appended and converted it into numeric using the gsub and as.numeric functions mentioned by you. However, when I am melting it, it is reconverting it into factor. Consequently, when I am trying to aggregate it in the Cast function, I am getting a factor error. Please suggest a way out.

Thanks

Prashanth

Hi Prashanth,

melt and cast should not be changing the type of the variables they work with, so I am wondering whether your Price variable has the right type in the first place. Did you check with str() that your conversion to numeric has been successful? Please post some sample data if you like, so I can simulate what you are doing.

Cheers,

Marco.

Great tutorial.Awesome!