Pivot Tables in R with melt and cast

[Total: 3    Average: 2/5]

If you are making the transition from Excel to R and still can’t figure out how to quickly obtain pivot tables like Excel has, this article is for you!

Actually it is pretty easy to produce Pivot Tables in R. All you need is a package called reshape by Hadley Wickham (yes, the same prolific author of plyr and ggplot2) and some understanding of how reshape “thinks” and works.

If you are interested in an alternative, easier but less powerful, method to create Pivot Tables in R using aggregate, you can read my other post here.

For this tutorial, we will be using a sample data set called Salespeople PivotTable report, which is used in many articles by Microsoft teaching the basics of Pivot Tables in Excel. The link to the sample data file can be found at the bottom of this article.

We will first obtain each result in Excel and then reproduce it with R. I recommend that you download the sample data file and follow along.

1. Download the sample data file

Use the link at the bottom of the article and save the file somewhere on your hard drive. So far so good!

2. Open the file in Excel and save a CSV copy of the first sheet

Double-click the file from the location where you saved it to open it inside Excel. The Excel worksheet has 3 tabs. We will be using the first one, containing the source data, and the last one, which contains a ready-made Pivot Table to play with. If you do not know how to make Pivot Tables in Excel, I recommend you read the Microsoft tutorial. In this article we will focus on how to obtain the same results with R and we will use the ready-made pivot table on sheet 3.

Once you are on the first sheet of the sample data set (the one with the source data), save a copy of it as CSV (Comma Separated Values) file. This will generate a file which is easier to import into R. R has packages which allow you to read directly Excel files without converting them to CSV, but for the sake of simplicity we will stick to CSV.

Before saving as CSV, change the format of column E (Order Amount) to General so that the orders amounts appear as numbers with 2 decimals and not with a $ sign in front and comma separated thousands. If you export to CSV without doing this first, there will be extra steps required in R to convert the sales amounts to numeric (see the note at the end of the article).

3. Load the CSV file in R

This is an easy step accomplished with the read.CSV function as follows.

file_name.txt is the name of the CSV file you created from Excel. If it is not in the current working directory in R, you can either change R’s working directory with setwd or move the file to the current working directory or add a path to the file before the file name.

At this point you should have the same base data open in Excel and loaded into R. In R the first lines will look like this:

4. Your first Pivot Table in R with melt and cast

Here we go. This is what we have been waiting for, the rest was just preparation to get to this point. Ready?

We will re-produce in R a pivot table like the default one that appears in the 3rd tab of the sample Excel file and which looks like this.

Capture_1

Basically it shows how much each sales person has sold in total across all orders and and countries (note that the Country filter is set to All). It also shows a Grand Total which is the sum of all sales.

To get the job done in R we will use the reshape package. There are other packages available which can achieve the same results, but reshape is particularly versatile and easy to use once you have grasped some basic concepts. (And, hey, it has been written by Hadley Wickham, so we _have_ to use it!)

We can install (if not already installed) and load the reshape package with:

If you have ggplot2 installed and loaded, reshape will already be available.

The reshape package contains two functions that are key to easily generate pivot tables in R. They are melt and cast.

melt transforms a data frame from the original format to a so called long format, where all the observed variables (called measures) appear, together with their respective value, in two adjacent columns named variable and value. Each row of this new data format is identified by a unique combination of the id variables, also part of the original data frame.

An example will clarify how melt works. Let’s take the first 6 rows of our data set.

It contains 5 columns. The first 4 identify each order by a combination of Country, Salesperson, Order.Date and OrderID. These are all non-numeric and there are no calculations we can do on them except, maybe, counting their frequency. Using melt‘s terminology, Country, Salesperson, Order.Date, OrderID are id variables, while Order.Amount, which is a numeric and which is the one we would like to sum up in our pivot table, is a measure.

When melting your data, you can indicate multiple id variables and also multiple measure variables. The id variable will appear as-is in the resulting melted format, while the measure variable will be stacked in the variable column with their respective value in the value column.

Note that melt preserves all data. Nothing is lost or modified, just the way the data are collected within the data frame. It is therefore possible to “un-melt” the melted data and go back to the original format at any time.

Let’s melt our sales data frame and see what the result looks like.

Pretty simple. Instead of listing the column names we have used their numerical identifiers. We have indicated to melt that columns 1 to 4 ( Country, Salesperson, Order.Date, OrderID) are id variables while column 5 (Order.Amount) is a measure variable.

Here is how the first 6 rows of data.m look like at this point:

As said the column Order.Amount has been “melted” into a variable column with the respective values in the value column.

Now that our data have been melted (and you have to do this only once, unless you change idea on what you want as id and as measure), we are ready to cast them to build the pivot table.

cast requires us to indicate, beside a reference to the melted data, how we want to re-aggregate the values.

The basic syntax is (and, yes, we are omitting some parameters that we won’t need in this tutorial):

formula and fun.aggregate are the most important two because they indicate how we want to reshape the data and which functions to use for aggregating the values.

Think of  formula as saying:

While fun.aggregate says how you want to aggregate the values of the variable(s) in order to reshape them as described by formula. This is the equivalent of selecting “Count of”, or “Sum of” etc. in Excel.

Going back to our example will make this point clearer.

We want to produce a pivot table which contains the total sales for each sales person. Therefore we chose Salesperson as row and variable (which is Sales.Amount) as column. The aggregate function will be sum to obtain the sum of the variable (Sales.Amount) for each sales person. Here the command:

And here is the resulting pivot table:

As expected (or not?!?) is identical to the Excel one. Congratulations on your first pivot table in R!

But wait before you pat yourself on the shoulder. In Excel there is an extra row showing the Grand Total and we don’t have it in R. How can we fix it?

5. Adding Grand Totals

That’s easy actually. Look back at the syntax for cast. There is an argument called margins which comes to rescue. margins accepts a vector that can contain the strings “grand_col” and “grand_row”. This adds an extra column or extra row (or both) with a grand total to the resulting data frame. The grand total is obtained through the same fun.aggregate.

Let’s add a row with the grand total.

And here it is, identified by (all):

Well done!

6. And how to Filter

Ok, ok. You are a picky one! I know you noted that our pivot table lacks one more feature respect to the Excel one. It has in fact no ability to Filter by Country. Did you get it yet? No!?!

Yes, the solution is in the other parameter for cast which we did not consider so far (beside those we omitted): subset.

subset allows us to limit the casting to only a subset of the melted data selected according to certain criteria. It requires a vector of TRUE or FALSE corresponding to the rows of the melted data we want to select (TRUE) or exclude (FALSE) before the casting is performed.

Say that we want to calculate the performance only for the American sales people. We need to select only those rows for which Country==”USA”. This is exactly the additional condition we need to specify within cast and it corresponds to selecting the Filter Country = USA a the top of the Excel pivot table.

And here is the filtered pivot. The grand total has also been adjusted accordingly.

Congratulations, now we are really done! :-)

I hope you have enjoyed this tutorial on how to make simple pivot tables in R. Please leave a comment below if you would like to see more tutorials on the same topic.

Till next time!

 

Note: Stripping $ and commas and converting to numeric

Ok, so you decided not to change the format of Column E (Amount) to General before saving the table as CSV. I had warned you. Now your punishment will be terrible and will include… regular expressions! Read on.

Once you open the CSV file in R, you can verify that the column Order.Amount has indeed been imported as a Factor. In order to be able to do calculations on it, we need to convert it to numeric first. This can be easily achieved using the function as.numeric, however the $ sign at the beginning and the commas separating the thousands will cause it to generate unexpected results. Before running through as.numeric we need therefore to strip both the $ sign and the commas. This can be quickly done using gsub and regular expressions. Teaching you regular expressions goes beyond the scope of this article, but the following code does the job.

Obviously the two steps above can be combined into one by nesting gsub into as.numeric.

 

Source data:

Link to the Salespeople PivotTable report sample data file:
http://office.microsoft.com/en-001/templates/salespeople-pivottable-report-TC001091651.aspx?CTT=5&origin=HA001087565

18 thoughts on “Pivot Tables in R with melt and cast

    • Thanks, glad you liked it! Did you also take a look at the alternative with aggregate?
      Cheers,
      Marco.

  1. Thanks a bunch! This was very helpful. I’m an advanced EXCEL user that is now learning and transitioning to R. I have had it rough going to the non-GUI world. I want to read your other articles.

    • You are welcome Mike! Good luck with R, I am sure you will find out soon how powerful it can be.

  2. Thanks Marco, very helpful.
    I’d like to aggregate on more than one variable,. Also, I need to aggregate some variables by sum, and some by max value.
    Do you have any ideas on how to do it?
    Thanks

    • Hi Arianna,
      the easiest way to do this is to use the new dplyr package by Hadley Wickham. Please check out the article I just published on Pivot Tables in R with dplyr. It contains examples about aggregating multiple variables with different summary statistics. including sum, average and max. I believe it should help you to solve your problem, otherwise please feel free to ask again.
      Cheers,
      Marco.

  3. How can I pivot a data frame with value as a customized field? Unlike cast’s FUN as sum or mean, can I do field1 / field2?

    • Hi Edward,
      you can always add extra calculated columns (aka variables) in the usual way (data$new.column = data$column.1 / data$column.2) or you can take advantage of libraries like dplyr that make the task easy. I just wrote a new article about using dplyr to make Pivot Tables, maybe you can take a look and see if it helps to solve your problem. I added an example on calculating new columns from existing ones and then pivoting them.
      Cheers,
      Marco.

  4. It is a great article and I loved it. But instead of sum, i want to calculate: Average, count, percentage count etc. I tried to replace sum with average etc but it is giving me an error stating: object ‘count’ not found. Could you please help me out.

    Thanks

    • Hi Rupinder,
      It is difficult to help without seeing your code. I just wrote an article on using dplyr to make Pivot Tables. dplyr makes it easier to calculate multiple statistics at once for the same variable. Please have a look and let me know if it works for you.
      Ciao,
      Marco.

  5. thanks Marco…a minor tip: after importing the file into R and in the step when I did the cast operation, I got the error message
    “Error in Summary.factor(integer(0), na.rm = FALSE) :
    sum not meaningful for factors”
    …this drove me around the bend till I realized that the Amount column, though it looked like numbers was in text form. The solution is to go back to the data.csv file in Excel, select the Amount column, and change its format from General t Number. Hope this helps
    ajit

    • Hi Ajit,
      thanks for your comment. Indeed, one should always check with the str() command with which types the variables are being imported into R. This is known to be a source of headaches, especially when variables get or do not get converted to factors. Good that you were able to spot the problem and solve it.
      Ciao,
      Marco.

  6. Just a big THANK YOU for this concise real-life example.

    After seeing the results of your little example (and realizing the flexibility of the coding vs. the traditional Excel “point&click” orgies plus the data outpout format which is better suitable for further analysis than the default Excel pivot result), I am determinded to dare the transition to R (…daunting as the task my seem; the learning curve is really steep for a start – but the reward is bound to come….).

    One very positive & motivating factor is the amount of information and help that is available for R from people like you – so you may take this compliment literally personal; but you can also take this as a “pars-pro-toto” acknowledgement for the entire R-community :-)

    Mille grazie !

    • Hi Ulf,
      sehr gerne!
      It is nice to receive comments like yours as they provide the energy to keep sharing. I am glad you are making the transition from Excel to R. When I introduced it into my company for a special financial analysis project which used to be done with Excel, we cut the total time from many days to few hours. The quality and our confidence in the results also greatly improved to everyone’s satisfaction.
      By the way, I also started to make a series of YouTube videos for anyone wanting to make the move from Excel to R. They are called R for Excel Users and show side by side how the same tasks are done in Excel and in R. You may want to check them out, maybe they are useful to you too. I just need to find the time to continue the series. :-)
      For any questions, please feel free to contact me, I will be glad to help if I can.
      MfG,
      Marco.

    • Thank you Nikhil, glad you liked it! You may want to check out also my other posts about Pivot Tables in R with aggregate and with dplyr.
      Cheers,
      Marco.

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*
Website

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">