This is an extremely condensed introduction to R’s base graphics and—more importantly—the powerful data-visualization package ggplot2, developed by Hadley Wickham. Run the codes shown and study the outputs to learn about these tools. When questions are posed, do your best to answer them.
For your convenience, the R codes for this document are provided in a script which you can download, edit, and run.
Let’s load the data on transgenic mosquito survival time.
dat <- read.csv("https://kingaa.github.io/R_Tutorial/data/mosquitoes.csv")Let’s compare the average lifespan of transgenic vs wildtype mosquitoes from this experiment. The following split the data into two subsets, one for each genetic type.
wt <- subset(dat,type=="wildtype",select=lifespan)
tg <- subset(dat,type=="transgenic",select=-type)Let’s try and visualize the data.
dat$type <- factor(dat$type)
plot(dat)plot
op <- par(mfrow=c(1,2))
hist(tg$lifespan,breaks=seq(0,55,by=5),ylim=c(0,40))
hist(wt$lifespan,breaks=seq(0,55,by=5),ylim=c(0,40))plot
par(op)Question: What does the second par command accomplish?
Another way to visualize a distribution is via the empirical cumulative distribution plot.
plot(sort(dat$lifespan),seq(1,nrow(dat))/nrow(dat),type='n')
lines(sort(wt$lifespan),seq(1,nrow(wt))/nrow(wt),type='s',col='blue')
lines(sort(tg$lifespan),seq(1,nrow(tg))/nrow(tg),type='s',col='red')plot
Question: What does type="n" do in the first line above?
The data on mammal body and brain sizes is included in the MASS package:
library(MASS)
plot(mammals)plot
plot(mammals,log='x')plot
plot(mammals,log='xy')plot
plot(mammals$body,mammals$brain,log='xy')plot
plot(brain~body,data=mammals,log='xy')plot
read.csv(
"https://kingaa.github.io/R_Tutorial/data/oil_production.csv",
comment.char="#"
) -> oil
head(oil)output
year region Gbbl
1900 Asia.and.Oceania 0.0040996064
1900 Central.and.South.America 0.0002737874
1900 Eurasia 0.0745134084
1900 Europe 0.0046760010
1900 North.America 0.0619912363
1901 Asia.and.Oceania 0.0064123896
summary(oil)output
year region Gbbl
Min. :1900 Length:784 Min. : 0.000007
1st Qu.:1931 Class :character 1st Qu.: 0.061123
Median :1958 Mode :character Median : 0.884031
Mean :1958 Mean : 1.716531
3rd Qu.:1986 3rd Qu.: 2.666378
Max. :2014 Max. :10.190196
plot(oil)plot
plot(Gbbl~year,data=oil,subset=region=="North.America",type='l')
lines(Gbbl~year,data=oil,subset=region=="Eurasia",type="l",col='red')plot
library(tidyr)
library(dplyr)
oil |>
group_by(year) |>
summarize(Gbbl=sum(Gbbl)) -> total
plot(Gbbl~year,data=total,type='l')plot
Parts of a graphic:
You construct a graphical visualization by choosing the constituent parts. This is implemented in the ggplot2 package.
library(readr)
read_csv(
"https://kingaa.github.io/R_Tutorial/data/energy_production.csv",
comment="#"
) -> energy
library(ggplot2)
ggplot(data=energy,mapping=aes(x=year,y=TJ,color=region,linetype=source))+
geom_line()plot
ggplot(data=energy,mapping=aes(x=year,y=TJ,color=region))+
geom_line()+
facet_wrap(~source)plot
ggplot(data=energy,mapping=aes(x=year,y=TJ,color=source))+
geom_line()+
facet_wrap(~region,ncol=2)plot
What can you conclude from the above? Try plotting these data on the log scale (scale_y_log10()). How does your interpretation change?
ggplot(data=energy,mapping=aes(x=year,y=TJ))+
geom_line()plot
ggplot(data=energy,mapping=aes(x=year,y=TJ,group=source))+
geom_line()plot
Question: How do you account for the appearance of the two plots immediately above?
ggplot(data=energy,mapping=aes(x=year,y=TJ,group=interaction(source,region)))+
geom_line()plot
Question: What does the group aesthetic do?
Let’s aggregate across regions by year and source of energy.
energy |>
group_by(year,source) |>
summarize(TJ=sum(TJ)) |>
ungroup() -> tot
tot |>
ggplot(aes(x=year,y=TJ,color=source))+
geom_line()plot
tot |>
ggplot(aes(x=year,y=TJ,fill=source))+
geom_area()plot
Now let’s aggregate across years by region and source.
See the data munging tutorial for more information on manipulating and reshaping data frames.
energy |>
group_by(region,source) |>
summarize(TJ=mean(TJ)) |>
ungroup() -> reg
reg |>
ggplot(aes(x=region,y=TJ,fill=source))+
geom_bar(stat="identity")+
coord_flip()plot
reg |>
group_by(region) |>
mutate(frac = TJ/sum(TJ)) |>
ungroup() -> reg
reg |>
ggplot(aes(x=region,y=frac,fill=source))+
geom_bar(stat="identity")+
coord_flip()+
labs(y="fraction of production",x="region")plot
In the above, we first average across years for every region and source. Then, for each region, we compute the fraction of the total production due to each source. Finally, we plot the fractions using a barplot. The coord_flip coordinate specification gives us horizontal bars instead of the default vertical bars. Fancy!
Let’s compare fossil fuel production to renewable. We divide the sources into three types: Carbon-based, Nuclear, and Renewable. We accomplish this using a “crosswalk” table:
data.frame(
source=c("Coal","Gas","Oil","Nuclear","Hydro","Other Renewables"),
source1=c("Carbon","Carbon","Carbon","Nuclear","Renewable","Renewable")
) |>
right_join(energy,by="source") -> energy
energy |>
group_by(source1,region,year) |>
summarize(TJ = sum(TJ)) |>
ungroup() -> x
x |>
ggplot(aes(x=year,y=TJ,fill=source1))+
geom_area()+
facet_wrap(~region,ncol=2)+
labs(fill="source")plot
x |>
ggplot(aes(x=year,y=TJ,fill=source1))+
geom_area()+
facet_wrap(~region,scales="free_y",ncol=2)+
labs(fill="source")plot
x |>
group_by(source1,year) |>
summarize(TJ = sum(TJ)) |>
ungroup() -> y
y |>
ggplot(aes(x=year,y=TJ,fill=source1))+
geom_area()+
labs(fill="source")plot
Ask a question regarding one of the datasets shown here and devise a visualization to answer it.
Produced with R version 4.3.1.
Licensed under the Creative Commons Attribution-NonCommercial license. Please share and remix noncommercially, mentioning its origin.