Get the lesson R script: Viz_and_Graphics.R
Get the lesson data: download zip
This section will introduce you to concepts of data visualization and graphics in R. The entire workflow of data exploration is enhanced through looking at your data, whether you’re exploring a dataset for the first time or creating publication-ready figures. Viewing your data provides insight into patterns that can help you explore different hypotheses. No analysis is complete without a solid graphic.
Graphics capabilities in R have improved tremendously in the last ten years. The de facto plotting library is
ggplot2 as part of the
tidyverse, but there are many other packages available that enhance or build on existing functionality. The base R installation also comes with its own set of plotting functions. These are useful for quick and dirty exploration but you’ll quickly find that these methods are tedious. We’ll start this lesson with a cursory look at some of the base R plotting functions. Later, we’ll focus on
After this lesson you should be able to answer (or be able to find answers to) the following:
In theory, all of your plotting needs can be accomplished with functions in base R. However, these graphics are not easy to customize and it’s very tedious to create publication-quality graphics. For these reasons, base R graphics are rarely taught but I believe they still serve a purpose in the data exploration workflow. Base R graphics are extremely fast and simple to use for exploring data. I caution you in using these functions for anything beyond simple exploration.
My most frequent use for base R graphics include:
We’ll explore some of these simple tools in base R with the
mpg dataset that comes with R. The workhorse plotting function is
plot. Let’s make a simple plot of highway mileage vs engine size.
library(tidyverse) data(mpg) plot(hwy ~ displ, data = mpg)
We created this plot using the
formula notation. We can also create the plot by calling the columns directly.
Confusing right? Let’s have a look at the help file for
This help file lays out some of the basics of plot formatting. We can change the type of plot using the
type argument (here, we can only use points), give it a title with
main, and change labels on the x and y axes with
plot(hwy ~ displ, data = mpg, main = 'Highway mileage by engine size', xlab = 'engine size (l)', ylab = 'highway (mpg)')
The help file for the plot function also has this ambiguous
... argument. You’ll see this argument for many functions in the help files. This means that the function accepts arguments from other functions. In this case, the plot function accepts arguments from the
par function which is used for setting graphical parameters in base R. Let’s look at the help file for
The most useful section in this help file is the description of the actual graphical parameters you can set. Some useful ones are the
cex family of arguments for sizing,
col family of arguments for color, the
pch argument for point types, and the
family argument for font. All of these arguments can be passed directly to the
plot function or separately with the
plot(hwy ~ displ, data = mpg, main = 'Highway mileage by engine size', xlab = 'engine size (l)', ylab = 'highway (mpg)', col = 'blue', pch = 18, cex = 0.75, family = 'serif')
Or like this…
par(col = 'blue', pch = 18, cex = 0.75, family = 'serif') plot(hwy ~ displ, data = mpg, main = 'Highway mileage by engine size', xlab = 'engine size (l)', ylab = 'highway (mpg)')
Notice how the plots are different depending on where the additional arguments are used. Confusing, right? You’ll waste a lot of time trying to tweak the base graphics.
hist (histogram) function is the only other plotting function from base R that is worth showing. This gets you info on the basic distribution of a variable and is a graphical way of assessing the spread and central tendency (i.e., variance and mean) of a variable.
The only useful argument in
breaks that controls the number of bins.
hist(mpg$hwy, breaks = 20)
Let’s make a simple graphic of our fisheries dataset using base R. We’ll make a time series plot of catch using the
plot function. We’ll also add a title and better axis labels to the plot to make it slightly more palatable.
First, start by loading the fish data as you’ve done before. Use
read_csv() from the tidyverse.
Make a time series plot of
Pinfish catch over time using the
plot function. Try creating the plot using the
formula notation (hint:
Pinfish ~ Sampling_Date). Remember you have to use the
data argument when you’re using the formula (i.e.,
data = fishdat).
Give the plot an informative title using the
main argument directly in the plot function, something like
main = "Pinfish catch for all stations over time".
# load the fish data fishdat <- read_csv('data/fishdat.csv') # time series plot plot(Pinfish ~ Sampling_Date, data = fishdat) # time series plot with title plot(Pinfish ~ Sampling_Date, data = fishdat, main = 'Pinfish catch for all stations over time')
ggplot2 package (full reference) is a huge improvement over base R because it was developed following a strict philosophy known as the grammar of graphics. This philosophy was designed to make thinking, reasoning, and communicating about graphs easier by following a few simple rules. Like building a sentence in speech (aka grammar), all graphs start with a foundational component that is used for building other graph pieces.
With ggplot2, you begin a plot with the function
ggplot() creates a coordinate system that you can add layers to. The first argument of
ggplot() is the dataset to use in the graph. So
ggplot(data = mpg) creates an empty base graph.
ggplot(data = mpg)
The next step is to add one or more layers (aka
ggplot(). The function
geom_point() adds a layer of points to your plot, which creates a scatterplot. Ggplot2 comes with many geom functions that each add a different type of layer to a plot.
ggplot(data = mpg) + geom_point()
Each geom function in ggplot2 takes a
mapping argument. This defines how variables in your dataset are mapped to visual properties. The
mapping argument is defined with
aes(), and the
y arguments of
aes() specify which variables to map to the x and y axes. ggplot2 looks for the mapped variable in the
data argument, in this case,
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
Just remember these requirements:
The core unit of every ggplot looks like this:
ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
Applied to the data:
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
More commonly, the
aes function that defines the mapping is included in the initial call to
ggplot. This will globally define the mapping to all geoms in a plot, instead of for only one geom. There may be different reasons to globally apply the aesthetics or separately for each geom depending on the data.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point()
Why the need for this complicated structure? The syntax of mapping aesthetics to a dataset lets you easily modify components of an existing plot. Additional datasets and geoms can easily be added to the plot with
Let’s explore some of the other geoms.
ggplot(mpg, aes(x = displ, y = hwy)) + geom_line()
ggplot(mpg, aes(x = displ, y = hwy)) + geom_count()
ggplot(mpg, aes(x = displ, y = hwy)) + geom_density2d()
As a line range…
ggplot(mpg, aes(x = displ, y = hwy)) + geom_linerange()
Error: geom_linerange requires the following missing aesthetics: ymin, ymax Execution halted
Oh snap, what happened? This error is telling us that the geom we just tried is missing some required aesthetics in the plot. Here we’ve only used the
y aesthetics but it looks like it requires
ymax. Lets look at the help file for
Looks like we don’t have the required aeshetics, nor does it make sense to use this geom because it’s not appropriate for the data. What about the requirements for
We’ve got the required aesthetics in our plot. Let’s add some others that we can use with
Changing the color:
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) + geom_point()
Changing the size:
ggplot(mpg, aes(x = displ, y = hwy, size = displ)) + geom_point()
Changing the shapes:
ggplot(mpg, aes(x = displ, y = hwy, shape = drv)) + geom_point()
In all of the above examples we’ve mapped an aesthetic to a variable in our dataset. We can just as easily modify the plot without mapping it to a variable (i.e., changing a part of the plot independent of the data). For example, maybe we want to change the color of the points using a single color for everything. Notice the placement of
colour outside of the
aes mapping function.
ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(colour = 'red')
Let’s reproduce the time series plot we created in exercise 9 using ggplot. This requires us to map the
Pinfish variables to the x and y aesthetics for
Setup your initial plot with
ggplot. Map the x variable to
Sampling_Date, the y variable to
Pinfish. The setup should look like this:
ggplot(fishdat, aes(x = Sampling_Date, y = Pinfish))). What happens if you run only this code?
geom_point() geom to your plot using the
+ operator. Remember the correct placement of
+, it always occurs in the line preceding the layer that is being added to the plot.
The catch data are heavily skewed. We can easily transform the y-axis by adding a scale. Ggplot has multiple scale functions that accomplish different tasks, all of which relate to setting limits or characteristics of measured variables in your data. Add the following after
geom_point() (remember to put a
scale_y_continuous('log-Catch, Pinfish', trans = 'log10'). How does the plot look now?
ggplot(fishdat, aes(x = Sampling_Date, y = Pinfish)) + geom_point() + scale_y_continuous('log-Catch, Pinfish', trans = 'log10')
There are countless ways we can modify a ggplot, either by manipulating the appearance or adding information that improves our understanding of what we see. In the previous exercise, we used a scale to transform an axis. In this next section we’ll cover three additional ways to modify a plot:
themefunction for individual parts or by using a pre-packaged theme that modifies many parts at once.
First I’ll show you how to modify the appearance to your liking. Individual components can be modified with
theme. Here the legend is moved to the top, the minor grid lines between axis ticks are removed, and the gray panel background is changed to light blue. Check the help file for
theme to see all the options.
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) + geom_point() + theme( legend.position = 'top', panel.grid.minor = element_blank(), panel.background = element_rect(fill = 'lightblue') )
Changing plot elements with theme can take some practice because there’s lots to modify. In some respects, this is how
ggplot is similar to base R. With flexibility comes tedium. Fortunately, there are several pre-packaged themes that modify several components at once. See here for the full documentation. There are also additional packages available that supplement the existing themes in
ggplot (see here).
Black and white:
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) + geom_point() + theme_bw()
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) + geom_point() + theme_minimal()
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) + geom_point() + theme_classic()
The pre-packaged themes with ggplot have the added benefit of easily changing the global font types and sizes in the plot. These are modified by including the arguments
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) + geom_point() + theme_bw(base_family = 'serif', base_size = 16)
In addition to its functional syntax, the real power of
ggplot is the ability to add components to the plot that let you quickly evaluate relationships or trends in the data. Statistical relationships can be added with
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) + geom_point() + stat_smooth()
Notice that we get one smooth for each type of drive train. This is because we’ve mapped colour to the drive train. Both the
stat_smooth functions use colour as an aesthetic so the global
aes function applies to both. We can change this behavior by moving the location of the
colour aesthetic. Here color is only mapped to the points.
ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(colour = drv)) + stat_smooth()
By default the
stat_smooth function uses a non-linear smooth, either a locally estimated polynomial or generalized additive model depending on size of the dataset. We can change this using the
method argument, as for a linear model.
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) + geom_point() + stat_smooth(method = 'lm')
ggplot provides some simple functions to create multi-panelled plots. This is useful for viewing different groups of the data as they relate to a variable in the dataset. The
facet_wrap function is one of two functions in ggplot that can be used for multi-panel plots (the other being
facet_grid, which is similar but different). To create facets, we have to specify which variable you’re using that acts as a grouping variable for each subplot. This is defined using the
~ sign followed by the variable name within
nrow) argument also indicates how many columns (or rows) in the multi-panel to create.
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) + geom_point() + stat_smooth(method = 'lm') + facet_wrap(~ drv, ncol = 3)
scales argument is also a useful part of
facet_wrap. By default, the x and y axes are fixed between the panels. You can change this behavior by using
scales = "free". Be careful using this feature because it can lead to different interpretations of the magnitude of trends.
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) + geom_point() + stat_smooth(method = 'lm') + facet_wrap(~ drv, ncol = 3, scales = 'free')
Let’s modify the time series plot we created in exercise 10 by adding a theme, a statistical summary, and facets. We’ll map the
stat_smooth function to the Gear type and also use this variable to create a three-panel plot.
Setup the initial plot as before. Map the x variable to
Sampling_Date and the y variable to
Pinfish. The intiial
ggplot line will look like this
ggplot(fishdata, aes(x = Sampling_Date, y = Pinfish). Add the
geom_point geom and the
scale_y_continuous scale with
trans = "log10" to transform catch.
Add one of these thems to the plot:
stat_smooth function with the argument
method = "lm" to add linear smooths between catch and time for each gear type.
facet_wrap to create a three-panel plot by depth category (hint:
facet_wrap(~ Gear, ncol = 3)).
ggplot(fishdat, aes(x = Sampling_Date, y = Pinfish)) + geom_point() + scale_y_continuous('log-Catch, Pinfish', trans = 'log10') + theme_minimal() + stat_smooth(method = 'lm') + facet_wrap(~ Gear, ncol = 3)
A quality plot deserves to be saved and shared. As you can imagine, there are several ways to save a plot in RStudio. The easiest way (which I don’t recommend) is to use the export feature from the plot viewing pane.
Although this is convenient, you don’t have fine control over many options that can really make your graphic pop. I recommend using either the
ggsave function that comes with the tidyverse, or preferably, one of the available graphics devices from base R (
wmf, and my favorite,
ggsave function works only with ggplot objects. You can contol where the plot is saved, the file type, plotting dimensions, resolution, and a few other minor options. By default, it will save the last ggplot that you made.
ggsave('figure/myfig.jpg', device = 'jpeg', width = 5, height = 4, units = 'in', dpi = 300)
ggsave functions uses the graphics devices from base R behind the scenes. You can always use these directly to save any type of plot (ggplot, base, etc.). These work a bit differently from regular functions. First, you “open” a graphics device by executing the function (e.g.,
jpeg(), etc.) at the command line. Then you execute your plot command and finish by “closing” the device with
dev.off(). In a script, it will look something like this.
# save a plot as png file png('figure/myfig.png', width = 5, height = 4, units = 'in', res = 300) plot dev.off()
The value of this approach is the ability to fine tune the output for your graphics. You can save as many figures as you like when the graphics device is open, just make sure to use
dev.off() when you’re done.
Now you should be able to answer (or be able to find answers to) the following:
In the next lesson we’ll learn about the basics of spatial analysis in R.