This is an extremely condensed introduction to R’s base graphics and—more importantly—the powerful data-visualization package ggplot2, developed by Hadley Wickham. Run the codes shown and study the outputs to learn about these tools. When questions are posed, do your best to answer them.
For your convenience, the R codes for this document are provided in a script which you can download, edit, and run.
Let’s load the data on transgenic mosquito survival time.
read.csv("https://kingaa.github.io/R_Tutorial/data/mosquitoes.csv") dat <-
Let’s compare the average lifespan of transgenic vs wildtype mosquitoes from this experiment. The following split the data into two subsets, one for each genetic type.
subset(dat,type=="wildtype",select=lifespan)
wt <- subset(dat,type=="transgenic",select=-type) tg <-
Let’s try and visualize the data.
$type <- factor(dat$type)
datplot(dat)
par(mfrow=c(1,2))
op <-hist(tg$lifespan,breaks=seq(0,55,by=5),ylim=c(0,40))
hist(wt$lifespan,breaks=seq(0,55,by=5),ylim=c(0,40))
par(op)
Question: What does the second par
command accomplish?
Another way to visualize a distribution is via the empirical cumulative distribution plot.
plot(sort(dat$lifespan),seq(1,nrow(dat))/nrow(dat),type='n')
lines(sort(wt$lifespan),seq(1,nrow(wt))/nrow(wt),type='s',col='blue')
lines(sort(tg$lifespan),seq(1,nrow(tg))/nrow(tg),type='s',col='red')
Question: What does type="n"
do in the first line above?
The data on mammal body and brain sizes is included in the MASS package:
library(MASS)
plot(mammals)
plot(mammals,log='x')
plot(mammals,log='xy')
plot(mammals$body,mammals$brain,log='xy')
plot(brain~body,data=mammals,log='xy')
read.csv(
"https://kingaa.github.io/R_Tutorial/data/oil_production.csv",
comment.char="#"
oil
) ->head(oil)
year region Gbbl
1900 Asia.and.Oceania 0.0040996064
1900 Central.and.South.America 0.0002737874
1900 Eurasia 0.0745134084
1900 Europe 0.0046760010
1900 North.America 0.0619912363
1901 Asia.and.Oceania 0.0064123896
summary(oil)
year region Gbbl
Min. :1900 Length:784 Min. : 0.000007
1st Qu.:1931 Class :character 1st Qu.: 0.061123
Median :1958 Mode :character Median : 0.884031
Mean :1958 Mean : 1.716531
3rd Qu.:1986 3rd Qu.: 2.666378
Max. :2014 Max. :10.190196
plot(oil)
plot(Gbbl~year,data=oil,subset=region=="North.America",type='l')
lines(Gbbl~year,data=oil,subset=region=="Eurasia",type="l",col='red')
library(tidyr)
library(dplyr)
|>
oil group_by(year) |>
summarize(Gbbl=sum(Gbbl)) -> total
plot(Gbbl~year,data=total,type='l')
Parts of a graphic:
You construct a graphical visualization by choosing the constituent parts. This is implemented in the ggplot2 package.
library(readr)
read_csv(
"https://kingaa.github.io/R_Tutorial/data/energy_production.csv",
comment="#"
energy
) ->
library(ggplot2)
ggplot(data=energy,mapping=aes(x=year,y=TJ,color=region,linetype=source))+
geom_line()
ggplot(data=energy,mapping=aes(x=year,y=TJ,color=region))+
geom_line()+
facet_wrap(~source)
ggplot(data=energy,mapping=aes(x=year,y=TJ,color=source))+
geom_line()+
facet_wrap(~region,ncol=2)
What can you conclude from the above? Try plotting these data on the log scale (scale_y_log10()
). How does your interpretation change?
ggplot(data=energy,mapping=aes(x=year,y=TJ))+
geom_line()
ggplot(data=energy,mapping=aes(x=year,y=TJ,group=source))+
geom_line()
Question: How do you account for the appearance of the two plots immediately above?
ggplot(data=energy,mapping=aes(x=year,y=TJ,group=interaction(source,region)))+
geom_line()
Question: What does the group
aesthetic do?
Let’s aggregate across regions by year and source of energy.
|>
energy group_by(year,source) |>
summarize(TJ=sum(TJ)) |>
ungroup() -> tot
|>
tot ggplot(aes(x=year,y=TJ,color=source))+
geom_line()
|>
tot ggplot(aes(x=year,y=TJ,fill=source))+
geom_area()
Now let’s aggregate across years by region and source.
See the data munging tutorial for more information on manipulating and reshaping data frames.
|>
energy group_by(region,source) |>
summarize(TJ=mean(TJ)) |>
ungroup() -> reg
|>
reg ggplot(aes(x=region,y=TJ,fill=source))+
geom_bar(stat="identity")+
coord_flip()
|>
reg group_by(region) |>
mutate(frac = TJ/sum(TJ)) |>
ungroup() -> reg
|>
reg ggplot(aes(x=region,y=frac,fill=source))+
geom_bar(stat="identity")+
coord_flip()+
labs(y="fraction of production",x="region")
In the above, we first average across years for every region and source. Then, for each region, we compute the fraction of the total production due to each source. Finally, we plot the fractions using a barplot. The coord_flip
coordinate specification gives us horizontal bars instead of the default vertical bars. Fancy!
Let’s compare fossil fuel production to renewable. We divide the sources into three types: Carbon-based, Nuclear, and Renewable. We accomplish this using a “crosswalk” table:
data.frame(
source=c("Coal","Gas","Oil","Nuclear","Hydro","Other Renewables"),
source1=c("Carbon","Carbon","Carbon","Nuclear","Renewable","Renewable")
|>
) right_join(energy,by="source") -> energy
|>
energy group_by(source1,region,year) |>
summarize(TJ = sum(TJ)) |>
ungroup() -> x
|>
x ggplot(aes(x=year,y=TJ,fill=source1))+
geom_area()+
facet_wrap(~region,ncol=2)+
labs(fill="source")
|>
x ggplot(aes(x=year,y=TJ,fill=source1))+
geom_area()+
facet_wrap(~region,scales="free_y",ncol=2)+
labs(fill="source")
|>
x group_by(source1,year) |>
summarize(TJ = sum(TJ)) |>
ungroup() -> y
|>
y ggplot(aes(x=year,y=TJ,fill=source1))+
geom_area()+
labs(fill="source")
Ask a question regarding one of the datasets shown here and devise a visualization to answer it.
Produced with R version 4.3.1.
Licensed under the Creative Commons Attribution-NonCommercial license. Please share and remix noncommercially, mentioning its origin.