Identifying outliers in R with ggplot2

One of the first steps when working with a fresh data set is to plot its values to identify patterns and outliers. When outliers appear, it is often useful to know which data point corresponds to them to check whether they are generated by data entry errors, data anomalies or other causes.

Unfortunately ggplot2 does not have an interactive mode to identify a point on a chart and one has to look for other solutions like GGobi (package rggobi) or iPlots.

However, if all is needed is to give a “name” to the outliers, it is possible to use ggplot labeling capabilities for the purpose. While labeling all points would usually produce a crowded and difficult to read plot, we can limit the labeling only to those points that respect certain conditions, namely our outliers.

Here is an example to illustrate this useful technique. We will be using the following data set consisting of 100 observations. The data set has been generated using rnorm for x and y. The label column provides an identifier for each observation in the form of “Data N” where N is the number of the observation.

To generate an outlier in the data set, the x value for observation number 87 has been changed to 100.

Let’s use qplot to plot the data.

And here is the resulting plot.

Rplot

 

The next step is to label the outlier (the point with x=100, observation number 87) and the outlier only with a label corresponding to its name. This is as easy as adding a geom_text call to qplot and setting the condition according to which the label has to be added.

The call to geom_text as it appears above adds a label to all points, but only those for which either x is greater than four times the Inter Quartile Range of all x in data or y is greater than four times the IQR of all y in data receive a non empty label (equal to the corresponding name in the label column). All the other points, those that are not outliers according to the condition we have set, receive an empty label, which means no label is displayed for them.

The hjust parameter is used to slightly offset on the horizontal direction the label respect to the point, so it doesn’t overlap with it.

Here the graphical result, correctly identifying the outlier as being “Data 87”.

Rplot2

The right condition to specify within the ifelse statement to correctly select the outliers to label largely depends on the data set. Often it is a matter of trial and errors (trying 1.5 * IQR, 2 *IQR, 3 * IQR, …) until only the “right” outliers are labeled.

A small addition to the code above allows us to label the outliers also with their x and y values.

Rplot3

I hope you found this quick trick useful. If so, please let me know by leaving a comment below.

Till next time!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.