Examples of Graphs

Last Updated: 2020-04-22

データの可視化（Data Visualization）は、データサイエンスのひとつの核であるが、表現能力、コミュニケーション能力とともに、基本的な技術も必要とされ、R によるプログラミングのたいせつな部分である。少しずつ学びながら、例を蓄積していく。参考としたもの（References）は最後に記す。

# message = FALSE
library(tidyverse)

## ─ Attaching packages ───────────────────────────── tidyverse 1.3.0 ─

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ─ Conflicts ─────────────────────────────── tidyverse_conflicts() ─
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

1. R Base Plot プロット関数によるグラフィック

ggplot2 について学ぶ前に、R Base のプロット（plot）の概要を記す。 ``ggplot2 を使ったプロット（plot）を参考のために、併記する。
Help に付属の example も考察する。

R Base のプロットの詳細については、R Study Group, Week 4 を参照のこと。

1.1 Scatter Plots 散布図

二次元のプロット。base では、二つのベクトルで可。ggplot2 では基本的に、データフレームの二つの列を使う。

1.1.1 Generic X-Y Plotting

1.1.1.a Scatter Plot　点での描画

Data: mtcars
Usage: plot(x, y, ...)

二つのベクトルを、x, y に割り付ける。

plot(mtcars$wt, mtcars$mpg)

# ggplot2
ggplot(mtcars, aes(x = wt, y = mpg)) + 
  geom_point()

ベクトルを引数 (augments) として渡す、下の code も可能だが、推奨されていない。

ggplot(data = NULL, aes(x = mtcars$wt, y = mtcars$mpg)) + 
  geom_point()

1.1.1.b Line Graph　線と点・二つ目のグラフを追加

Data: pressure
Usage: plot(x, y, type = "l")

二つ目のグラフを加える。

plot(pressure$temperature, pressure$pressure, type = "l")
points(pressure$temperature, pressure$pressure)

lines(pressure$temperature, pressure$pressure/2, col = "red")
points(pressure$temperature, pressure$pressure/2, col = "red")

ggplot2 では、階層を追加していく

ggplot(pressure) + 
  geom_line(aes(x = temperature, y = pressure), color = "black") + 
  geom_point(aes(x = temperature, y = pressure), color = "black") +
  geom_line(aes(x = temperature, y = pressure/2), color = "red") + 
  geom_point(aes(x = temperature, y = pressure/2), color = "red")

tidy data にして、group を使う方法もある。

pressure1 <- pressure %>% mutate(type = "A")
pressure2 <- pressure %>% mutate(pressure2 = pressure/2, type = "B") %>% select(temperature, pressure2, type) 
colnames(pressure2) <- colnames(pressure1)
pressure0 <- bind_rows(pressure1, pressure2)
# ggplot2
ggplot(pressure0, aes(x = temperature, y = pressure, group = type)) + 
  geom_line(aes(color = type)) + 
  geom_point(aes(color = type))

1.1.8 Help of Generic X-Y Plotting

Description: Generic function for plotting of R objects. For more details about the graphical parameter arguments, see par.
For simple scatter plots, plot.default will be used. However, there are plot methods for many R objects, including functions, data.frames, density objects, etc. Use methods(plot) and the documentation for these.

methods(plot)

##  [1] plot,ANY-method        plot,color-method      plot.acf*             
##  [4] plot.ACF*              plot.augPred*          plot.compareFits*     
##  [7] plot.data.frame*       plot.decomposed.ts*    plot.default          
## [10] plot.dendrogram*       plot.density*          plot.ecdf             
## [13] plot.factor*           plot.formula*          plot.function         
## [16] plot.ggplot*           plot.gls*              plot.gtable*          
## [19] plot.hcl_palettes*     plot.hclust*           plot.histogram*       
## [22] plot.HoltWinters*      plot.intervals.lmList* plot.isoreg*          
## [25] plot.lm*               plot.lme*              plot.lmList*          
## [28] plot.medpolish*        plot.mlm*              plot.nffGroupedData*  
## [31] plot.nfnGroupedData*   plot.nls*              plot.nmGroupedData*   
## [34] plot.pdMat*            plot.ppr*              plot.prcomp*          
## [37] plot.princomp*         plot.profile.nls*      plot.R6*              
## [40] plot.ranef.lme*        plot.ranef.lmList*     plot.raster*          
## [43] plot.shingle*          plot.simulate.lme*     plot.spec*            
## [46] plot.stepfun           plot.stl*              plot.table*           
## [49] plot.trans*            plot.trellis*          plot.ts               
## [52] plot.tskernel*         plot.TukeyHSD*         plot.Variogram*       
## see '?methods' for accessing help and source code

Usage: plot(x, y, …)

Arguments
x   
the coordinates of points in the plot. Alternatively, a single plotting structure, function or any R object with a plot method can be provided.

y   
the y coordinates of points in the plot, optional if x is an appropriate structure.

... 
Arguments to be passed to methods, such as graphical parameters (see par). Many methods will accept the following arguments:

type
what type of plot should be drawn. Possible types are

"p" for points,

"l" for lines,

"b" for both,

"c" for the lines part alone of "b",

"o" for both ‘overplotted’,

"h" for ‘histogram’ like (or ‘high-density’) vertical lines,

"s" for stair steps,

"S" for other steps, see ‘Details’ below,

"n" for no plotting.

All other types give a warning or an error; using, e.g., type = "punkte" being equivalent to type = "p" for S compatibility. Note that some methods, e.g. plot.factor, do not accept this.

main
an overall title for the plot: see title.

sub
a sub title for the plot: see title.

xlab
a title for the x axis: see title.

ylab
a title for the y axis: see title.

asp
the y/x aspect ratio, see plot.window.

Details
The two step types differ in their x-y preference: Going from (x1,y1) to (x2,y2) with x1 < x2, type = "s" moves first horizontal, then vertical, whereas type = "S" moves the other way around.

See Also
plot.default, plot.formula and other methods; points, lines, par. For thousands of points, consider using smoothScatter() instead of plot().

For X-Y-Z plotting see contour, persp and image.

Note:

methods(plot) で表示したように、膨大な形式がある。

1.1.9 Example in Help

example(plot)

## 
## plot> require(stats) # for lowess, rpois, rnorm
## 
## plot> plot(cars)

## 
## plot> lines(lowess(cars))
## 
## plot> plot(sin, -pi, 2*pi) # see ?plot.function

## 
## plot> ## Discrete Distribution Plot:
## plot> plot(table(rpois(100, 5)), type = "h", col = "red", lwd = 10,
## plot+      main = "rpois(100, lambda = 5)")

## 
## plot> ## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:
## plot> plot(x <- sort(rnorm(47)), type = "s", main = "plot(x, type = \"s\")")

## 
## plot> points(x, cex = .5, col = "dark red")

Data: cars をプロットし、lowess(Local Polynomial Regression Fitting) で、線分で補間。
-pi から 2pi までの sin（正弦）曲線
ポアッソン分布（rpois）において lambda = 5 として 100 点を取り、赤でヒストグラムを作成し、主タイトルをつける。

type = “h”: ‘histogram’ like (or ‘high-density’) vertical lines
lwd The line width, a positive number, defaulting to 1. The interpretation is device-specific, and some devices do not implement line widths less than one. (par)

plot(x, type = "s"): 標準正規分布（平均0,標準偏差1）の47個のサンプルを小さい順に並べて、点を標準の半分の大きさで、濃い赤で階段状にプロットし、主タイトルを付ける。

type = “s”: stair steps
cex A numerical value giving the amount by which plotting text and symbols should be magnified relative to the default. This starts as 1 when a device is opened, and is reset when the layout is changed, e.g. by setting mfrow. Note that some graphics functions such as plot.default have an argument of this name which multiplies this graphical parameter, and some functions such as points and text accept a vector of values which are recycled. cex = .5 となっているので、点を通常の半分にしている。

1.2 Bar Plots 棒グラフ

1.2.1 Creates a bar plot with vertical or horizontal bars.

1.2.1.a

Data: BOD
Usage: barplot(height, ...)

棒の高さとなるベクトルを与える。棒のラベルを指定するときは、names.arg

barplot(BOD$demand)

barplot(BOD$demand, names.arg = BOD$Time)

ベクトル内の、それぞれの値の個数を table で生成して、棒グラフとする。

Data: mtcars

# cyl = number of cylinders
table(mtcars$cyl)

## 
##  4  6  8 
## 11  7 14

barplot(table(mtcars$cyl))

ggplot2 では、geom_col を使う。

Data: BOD

# ggplot2
ggplot(BOD, aes(x = Time, y = demand)) + 
  geom_col()

変数 x を離散値 (discrete value) として使うときは、ファクター（factor）を使う。

# ggplot2
ggplot(BOD, aes(x = factor(Time), y = demand)) + 
  geom_col()

geom_bar を使うと、各カテゴリの個数をグラフ化できる。x は連続値。

Data: mtcars

ggplot(mtcars, aes(x = cyl)) + 
  geom_bar()

個数データの棒グラフ。x は factor (category)

ggplot(mtcars, aes(x = factor(cyl))) + 
  geom_bar()

1.2.8 Help of Bar Plots

Description: Creates a bar plot with vertical or horizontal bars.
Usage: barplot(height, …)

## Default S3 method:
barplot(height, width = 1, space = NULL,
        names.arg = NULL, legend.text = NULL, beside = FALSE,
        horiz = FALSE, density = NULL, angle = 45,
        col = NULL, border = par("fg"),
        main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
        xlim = NULL, ylim = NULL, xpd = TRUE, log = "",
        axes = TRUE, axisnames = TRUE,
        cex.axis = par("cex.axis"), cex.names = par("cex.axis"),
        inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,
        add = FALSE, ann = !add && par("ann"), args.legend = NULL, ...)

## S3 method for class 'formula'
barplot(formula, data, subset, na.action,
        horiz = FALSE, xlab = NULL, ylab = NULL, ...)
Arguments
height  
either a vector or matrix of values describing the bars which make up the plot. If height is a vector, the plot consists of a sequence of rectangular bars with heights given by the values in the vector. If height is a matrix and beside is FALSE then each bar of the plot corresponds to a column of height, with the values in the column giving the heights of stacked sub-bars making up the bar. If height is a matrix and beside is TRUE, then the values in each column are juxtaposed rather than stacked.

width   
optional vector of bar widths. Re-cycled to length the number of bars drawn. Specifying a single value will have no visible effect unless xlim is specified.

space   
the amount of space (as a fraction of the average bar width) left before each bar. May be given as a single number or one number per bar. If height is a matrix and beside is TRUE, space may be specified by two numbers, where the first is the space between bars in the same group, and the second the space between the groups. If not given explicitly, it defaults to c(0,1) if height is a matrix and beside is TRUE, and to 0.2 otherwise.

names.arg   
a vector of names to be plotted below each bar or group of bars. If this argument is omitted, then the names are taken from the names attribute of height if this is a vector, or the column names if it is a matrix.

legend.text 
a vector of text used to construct a legend for the plot, or a logical indicating whether a legend should be included. This is only useful when height is a matrix. In that case given legend labels should correspond to the rows of height; if legend.text is true, the row names of height will be used as labels if they are non-null.

beside  
a logical value. If FALSE, the columns of height are portrayed as stacked bars, and if TRUE the columns are portrayed as juxtaposed bars.

horiz   
a logical value. If FALSE, the bars are drawn vertically with the first bar to the left. If TRUE, the bars are drawn horizontally with the first at the bottom.

density 
a vector giving the density of shading lines, in lines per inch, for the bars or bar components. The default value of NULL means that no shading lines are drawn. Non-positive values of density also inhibit the drawing of shading lines.

angle   
the slope of shading lines, given as an angle in degrees (counter-clockwise), for the bars or bar components.

col 
a vector of colors for the bars or bar components. By default, grey is used if height is a vector, and a gamma-corrected grey palette if height is a matrix.

border  
the color to be used for the border of the bars. Use border = NA to omit borders. If there are shading lines, border = TRUE means use the same colour for the border as for the shading lines.

main,sub    
overall and sub title for the plot.

xlab    
a label for the x axis.

ylab    
a label for the y axis.

xlim    
limits for the x axis.

ylim    
limits for the y axis.

xpd 
logical. Should bars be allowed to go outside region?

log 
string specifying if axis scales should be logarithmic; see plot.default.

axes    
logical. If TRUE, a vertical (or horizontal, if horiz is true) axis is drawn.

axisnames   
logical. If TRUE, and if there are names.arg (see above), the other axis is drawn (with lty = 0) and labeled.

cex.axis    
expansion factor for numeric axis labels.

cex.names   
expansion factor for axis names (bar labels).

inside  
logical. If TRUE, the lines which divide adjacent (non-stacked!) bars will be drawn. Only applies when space = 0 (which it partly is when beside = TRUE).

plot    
logical. If FALSE, nothing is plotted.

axis.lty    
the graphics parameter lty applied to the axis and tick marks of the categorical (default horizontal) axis. Note that by default the axis is suppressed.

offset  
a vector indicating how much the bars should be shifted relative to the x axis.

add 
logical specifying if bars should be added to an already existing plot; defaults to FALSE.

ann 
logical specifying if the default annotation (main, sub, xlab, ylab) should appear on the plot, see title.

args.legend 
list of additional arguments to pass to legend(); names of the list are used as argument names. Only used if legend.text is supplied.

formula 
a formula where the y variables are numeric data to plot against the categorical x variables. The formula can have one of three forms:

      y ~ x
      y ~ x1 + x2
      cbind(y1, y2) ~ x
, see the examples.

data    
a data frame (or list) from which the variables in formula should be taken.

subset  
an optional vector specifying a subset of observations to be used.

na.action   
a function which indicates what should happen when the data contain NA values. The default is to ignore missing values in the given variables.

... 
arguments to be passed to/from other methods. For the default method these can include further arguments (such as axes, asp and main) and graphical parameters (see par) which are passed to plot.window(), title() and axis.

Value
A numeric vector (or matrix, when beside = TRUE), say mp, giving the coordinates of all the bar midpoints drawn, useful for adding to the graph.

If beside is true, use colMeans(mp) for the midpoints of each group of bars, see example.

Author(s)
R Core, with a contribution by Arni Magnusson.

1.2.9 Example in Help

example(barplot)

## 
## barplt> # Formula method
## barplt> barplot(GNP ~ Year, data = longley)

## 
## barplt> barplot(cbind(Employed, Unemployed) ~ Year, data = longley)

## 
## barplt> ## 3rd form of formula - 2 categories :
## barplt> op <- par(mfrow = 2:1, mgp = c(3,1,0)/2, mar = .1+c(3,3:1))
## 
## barplt> summary(d.Titanic <- as.data.frame(Titanic))
##   Class       Sex        Age     Survived      Freq       
##  1st :8   Male  :16   Child:16   No :16   Min.   :  0.00  
##  2nd :8   Female:16   Adult:16   Yes:16   1st Qu.:  0.75  
##  3rd :8                                   Median : 13.50  
##  Crew:8                                   Mean   : 68.78  
##                                           3rd Qu.: 77.00  
##                                           Max.   :670.00  
## 
## barplt> barplot(Freq ~ Class + Survived, data = d.Titanic,
## barplt+         subset = Age == "Adult" & Sex == "Male",
## barplt+         main = "barplot(Freq ~ Class + Survived, *)", ylab = "# {passengers}", legend = TRUE)

## 
## barplt> # Corresponding table :
## barplt> (xt <- xtabs(Freq ~ Survived + Class + Sex, d.Titanic, subset = Age=="Adult"))
## , , Sex = Male
## 
##         Class
## Survived 1st 2nd 3rd Crew
##      No  118 154 387  670
##      Yes  57  14  75  192
## 
## , , Sex = Female
## 
##         Class
## Survived 1st 2nd 3rd Crew
##      No    4  13  89    3
##      Yes 140  80  76   20
## 
## 
## barplt> # Alternatively, a mosaic plot :
## barplt> mosaicplot(xt[,,"Male"], main = "mosaicplot(Freq ~ Class + Survived, *)", color=TRUE)

## 
## barplt> par(op)
## 
## barplt> # Default method
## barplt> require(grDevices) # for colours
## 
## barplt> tN <- table(Ni <- stats::rpois(100, lambda = 5))
## 
## barplt> r <- barplot(tN, col = rainbow(20))

## 
## barplt> #- type = "h" plotting *is* 'bar'plot
## barplt> lines(r, tN, type = "h", col = "red", lwd = 2)
## 
## barplt> barplot(tN, space = 1.5, axisnames = FALSE,
## barplt+         sub = "barplot(..., space= 1.5, axisnames = FALSE)")

## 
## barplt> barplot(VADeaths, plot = FALSE)
## [1] 0.7 1.9 3.1 4.3
## 
## barplt> barplot(VADeaths, plot = FALSE, beside = TRUE)
##      [,1] [,2] [,3] [,4]
## [1,]  1.5  7.5 13.5 19.5
## [2,]  2.5  8.5 14.5 20.5
## [3,]  3.5  9.5 15.5 21.5
## [4,]  4.5 10.5 16.5 22.5
## [5,]  5.5 11.5 17.5 23.5
## 
## barplt> mp <- barplot(VADeaths) # default

## 
## barplt> tot <- colMeans(VADeaths)
## 
## barplt> text(mp, tot + 3, format(tot), xpd = TRUE, col = "blue")
## 
## barplt> barplot(VADeaths, beside = TRUE,
## barplt+         col = c("lightblue", "mistyrose", "lightcyan",
## barplt+                 "lavender", "cornsilk"),
## barplt+         legend = rownames(VADeaths), ylim = c(0, 100))

## 
## barplt> title(main = "Death Rates in Virginia", font.main = 4)
## 
## barplt> hh <- t(VADeaths)[, 5:1]
## 
## barplt> mybarcol <- "gray20"
## 
## barplt> mp <- barplot(hh, beside = TRUE,
## barplt+         col = c("lightblue", "mistyrose",
## barplt+                 "lightcyan", "lavender"),
## barplt+         legend = colnames(VADeaths), ylim = c(0,100),
## barplt+         main = "Death Rates in Virginia", font.main = 4,
## barplt+         sub = "Faked upper 2*sigma error bars", col.sub = mybarcol,
## barplt+         cex.names = 1.5)

## 
## barplt> segments(mp, hh, mp, hh + 2*sqrt(1000*hh/100), col = mybarcol, lwd = 1.5)
## 
## barplt> stopifnot(dim(mp) == dim(hh))  # corresponding matrices
## 
## barplt> mtext(side = 1, at = colMeans(mp), line = -2,
## barplt+       text = paste("Mean", formatC(colMeans(hh))), col = "red")
## 
## barplt> # Bar shading example
## barplt> barplot(VADeaths, angle = 15+10*1:5, density = 20, col = "black",
## barplt+         legend = rownames(VADeaths))

## 
## barplt> title(main = list("Death Rates in Virginia", font = 4))
## 
## barplt> # Border color
## barplt> barplot(VADeaths, border = "dark blue")

## 
## barplt> # Log scales (not much sense here)
## barplt> barplot(tN, col = heat.colors(12), log = "y")

## 
## barplt> barplot(tN, col = gray.colors(20), log = "xy")

## 
## barplt> # Legend location
## barplt> barplot(height = cbind(x = c(465, 91) / 465 * 100,
## barplt+                        y = c(840, 200) / 840 * 100,
## barplt+                        z = c(37, 17) / 37 * 100),
## barplt+         beside = FALSE,
## barplt+         width = c(465, 840, 37),
## barplt+         col = c(1, 2),
## barplt+         legend.text = c("A", "B"),
## barplt+         args.legend = list(x = "topleft"))

longley データセットの年ごとの GNP の推移を棒グラフでプロット
longley データセットの年ごとの雇用・非雇用を色を変えて棒グラフでプロット
Titanic

1.3 Histgram ヒストグラム

Data: mtcars
Usage: hist(x, …)

一つのベクトルを引数とする。

1.3.1 Simple Example 基本的な例

hist(mtcars$mpg)

bin を分割する個数は、breaks で指定

hist(mtcars$mpg, breaks = 10)

ggplot2 では、x を指定し、geom_histogram() を使う。初期値は、bins = 30

ggplot(mtcars, aes(x = mpg)) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

binwidth を調整

ggplot(mtcars, aes(x = mpg)) + 
  geom_histogram(binwidth = 1)

ggplot(mtcars, aes(x = mpg)) + 
  geom_histogram(binwidth = 4)

1.3.8 Help of Histograms

Description: The generic function hist computes a histogram of the given data values. If plot = TRUE, the resulting object of class “histogram” is plotted by plot.histogram, before it is returned.
Usage: hist(x, …)

## Default S3 method:
hist(x, breaks = "Sturges",
     freq = NULL, probability = !freq,
     include.lowest = TRUE, right = TRUE,
     density = NULL, angle = 45, col = NULL, border = NULL,
     main = paste("Histogram of" , xname),
     xlim = range(breaks), ylim = NULL,
     xlab = xname, ylab,
     axes = TRUE, plot = TRUE, labels = FALSE,
     nclass = NULL, warn.unused = TRUE, ...)
Arguments
x   
a vector of values for which the histogram is desired.

breaks  
one of:

a vector giving the breakpoints between histogram cells,

a function to compute the vector of breakpoints,

a single number giving the number of cells for the histogram,

a character string naming an algorithm to compute the number of cells (see ‘Details’),

a function to compute the number of cells.

In the last three cases the number is a suggestion only; as the breakpoints will be set to pretty values, the number is limited to 1e6 (with a warning if it was larger). If breaks is a function, the x vector is supplied to it as the only argument (and the number of breaks is only limited by the amount of available memory).

freq    
logical; if TRUE, the histogram graphic is a representation of frequencies, the counts component of the result; if FALSE, probability densities, component density, are plotted (so that the histogram has a total area of one). Defaults to TRUE if and only if breaks are equidistant (and probability is not specified).

probability 
an alias for !freq, for S compatibility.

include.lowest  
logical; if TRUE, an x[i] equal to the breaks value will be included in the first (or last, for right = FALSE) bar. This will be ignored (with a warning) unless breaks is a vector.

right   
logical; if TRUE, the histogram cells are right-closed (left open) intervals.

density 
the density of shading lines, in lines per inch. The default value of NULL means that no shading lines are drawn. Non-positive values of density also inhibit the drawing of shading lines.

angle   
the slope of shading lines, given as an angle in degrees (counter-clockwise).

col 
a colour to be used to fill the bars. The default of NULL yields unfilled bars.

border  
the color of the border around the bars. The default is to use the standard foreground color.

main, xlab, ylab    
main title and axis labels: these arguments to title() get “smart” defaults here, e.g., the default ylab is "Frequency" iff freq is true.

xlim, ylim  
the range of x and y values with sensible defaults. Note that xlim is not used to define the histogram (breaks), but only for plotting (when plot = TRUE).

axes    
logical. If TRUE (default), axes are draw if the plot is drawn.

plot    
logical. If TRUE (default), a histogram is plotted, otherwise a list of breaks and counts is returned. In the latter case, a warning is used if (typically graphical) arguments are specified that only apply to the plot = TRUE case.

labels  
logical or character string. Additionally draw labels on top of bars, if not FALSE; see plot.histogram.

nclass  
numeric (integer). For S(-PLUS) compatibility only, nclass is equivalent to breaks for a scalar or character argument.

warn.unused 
logical. If plot = FALSE and warn.unused = TRUE, a warning will be issued when graphical parameters are passed to hist.default().

... 
further arguments and graphical parameters passed to plot.histogram and thence to title and axis (if plot = TRUE).

Details
The definition of histogram differs by source (with country-specific biases). R's default with equi-spaced breaks (also the default) is to plot the counts in the cells defined by breaks. Thus the height of a rectangle is proportional to the number of points falling into the cell, as is the area provided the breaks are equally-spaced.

The default with non-equi-spaced breaks is to give a plot of area one, in which the area of the rectangles is the fraction of the data points falling in the cells.

If right = TRUE (default), the histogram cells are intervals of the form (a, b], i.e., they include their right-hand endpoint, but not their left one, with the exception of the first cell when include.lowest is TRUE.

For right = FALSE, the intervals are of the form [a, b), and include.lowest means ‘include highest’.

A numerical tolerance of 1e-7 times the median bin size (for more than four bins, otherwise the median is substituted) is applied when counting entries on the edges of bins. This is not included in the reported breaks nor in the calculation of density.

The default for breaks is "Sturges": see nclass.Sturges. Other names for which algorithms are supplied are "Scott" and "FD" / "Freedman-Diaconis" (with corresponding functions nclass.scott and nclass.FD). Case is ignored and partial matching is used. Alternatively, a function can be supplied which will compute the intended number of breaks or the actual breakpoints as a function of x.

Value
an object of class "histogram" which is a list with components:

breaks  
the n+1 cell boundaries (= breaks if that was a vector). These are the nominal breaks, not with the boundary fuzz.

counts  
n integers; for each cell, the number of x[] inside.

density 
values f^(x[i]), as estimated density values. If all(diff(breaks) == 1), they are the relative frequencies counts/n and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i] = breaks[i].

mids    
the n cell midpoints.

xname   
a character string with the actual x argument name.

equidist    
logical, indicating if the distances between breaks are all the same.

References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

Venables, W. N. and Ripley. B. D. (2002) Modern Applied Statistics with S. Springer.

See Also
nclass.Sturges, stem, density, truehist in package MASS.

Typical plots with vertical bars are not histograms. Consider barplot or plot(*, type = "h") for such bar plots.

1.3.9 Example in Help

example(hist)

## 
## hist> op <- par(mfrow = c(2, 2))
## 
## hist> hist(islands)

## 
## hist> utils::str(hist(islands, col = "gray", labels = TRUE))

## List of 6
##  $ breaks  : num [1:10] 0 2000 4000 6000 8000 10000 12000 14000 16000 18000
##  $ counts  : int [1:9] 41 2 1 1 1 1 0 0 1
##  $ density : num [1:9] 4.27e-04 2.08e-05 1.04e-05 1.04e-05 1.04e-05 ...
##  $ mids    : num [1:9] 1000 3000 5000 7000 9000 11000 13000 15000 17000
##  $ xname   : chr "islands"
##  $ equidist: logi TRUE
##  - attr(*, "class")= chr "histogram"
## 
## hist> hist(sqrt(islands), breaks = 12, col = "lightblue", border = "pink")

## 
## hist> ##-- For non-equidistant breaks, counts should NOT be graphed unscaled:
## hist> r <- hist(sqrt(islands), breaks = c(4*0:5, 10*3:5, 70, 100, 140),
## hist+           col = "blue1")

## 
## hist> text(r$mids, r$density, r$counts, adj = c(.5, -.5), col = "blue3")
## 
## hist> sapply(r[2:3], sum)
##    counts   density 
## 48.000000  0.215625 
## 
## hist> sum(r$density * diff(r$breaks)) # == 1
## [1] 1
## 
## hist> lines(r, lty = 3, border = "purple") # -> lines.histogram(*)
## 
## hist> par(op)
## 
## hist> require(utils) # for str
## 
## hist> str(hist(islands, breaks = 12, plot =  FALSE)) #-> 10 (~= 12) breaks
## List of 6
##  $ breaks  : num [1:10] 0 2000 4000 6000 8000 10000 12000 14000 16000 18000
##  $ counts  : int [1:9] 41 2 1 1 1 1 0 0 1
##  $ density : num [1:9] 4.27e-04 2.08e-05 1.04e-05 1.04e-05 1.04e-05 ...
##  $ mids    : num [1:9] 1000 3000 5000 7000 9000 11000 13000 15000 17000
##  $ xname   : chr "islands"
##  $ equidist: logi TRUE
##  - attr(*, "class")= chr "histogram"
## 
## hist> str(hist(islands, breaks = c(12,20,36,80,200,1000,17000), plot = FALSE))
## List of 6
##  $ breaks  : num [1:7] 12 20 36 80 200 1000 17000
##  $ counts  : int [1:6] 12 11 8 6 4 7
##  $ density : num [1:6] 0.03125 0.014323 0.003788 0.001042 0.000104 ...
##  $ mids    : num [1:6] 16 28 58 140 600 9000
##  $ xname   : chr "islands"
##  $ equidist: logi FALSE
##  - attr(*, "class")= chr "histogram"
## 
## hist> hist(islands, breaks = c(12,20,36,80,200,1000,17000), freq = TRUE,
## hist+      main = "WRONG histogram") # and warning

## Warning in plot.histogram(r, freq = freq1, col = col, border = border, angle =
## angle, : the AREAS in the plot are wrong -- rather use 'freq = FALSE'

## 
## hist> ## No test: ##D 
## hist> ##D ## Extreme outliers; the "FD" rule would take very large number of 'breaks':
## hist> ##D XXL <- c(1:9, c(-1,1)*1e300)
## hist> ##D hh <- hist(XXL, "FD") # did not work in R <= 3.4.1; now gives warning
## hist> ##D ## pretty() determines how many counts are used (platform dependently!):
## hist> ##D length(hh$breaks) ## typically 1 million -- though 1e6 was "a suggestion only"
## hist> ## End(No test)
## hist> require(stats)
## 
## hist> set.seed(14)
## 
## hist> x <- rchisq(100, df = 4)
## 
## hist> ## Don't show: 
## hist> op <- par(mfrow = 2:1, mgp = c(1.5, 0.6, 0), mar = .1 + c(3,3:1))
## 
## hist> ## End(Don't show)
## hist> ## Comparing data with a model distribution should be done with qqplot()!
## hist> qqplot(x, qchisq(ppoints(x), df = 4)); abline(0, 1, col = 2, lty = 2)

## 
## hist> ## if you really insist on using hist() ... :
## hist> hist(x, freq = FALSE, ylim = c(0, 0.2))

## 
## hist> curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)
## 
## hist> ## Don't show: 
## hist> par(op)
## 
## hist> ## End(Don't show)
## hist> 
## hist> 
## hist>

1.4 Box Plot 箱ひげ図

1.4.1 x が数値ベクトルではなく、Factor のとき

Data: ToothGrowth

head(ToothGrowth)

plot(ToothGrowth$supp, ToothGrowth$len)

boxplot(len ~ supp, data = ToothGrowth)

# ggplot2
ggplot(ToothGrowth, aes(x = supp, y = len)) + 
  geom_boxplot()

# ggplot2
ggplot(ToothGrowth, aes(x = interaction(supp, dose), y = len)) + 
  geom_boxplot()

1.4.8 Help of Box Plots

Description: Produce box-and-whisker plot(s) of the given (grouped) values.
Usage: boxplot(x, …)

## S3 method for class 'formula'
boxplot(formula, data = NULL, ..., subset, na.action = NULL,
        xlab = mklab(y_var = horizontal),
        ylab = mklab(y_var =!horizontal),
        add = FALSE, ann = !add, horizontal = FALSE,
        drop = FALSE, sep = ".", lex.order = FALSE)

## Default S3 method:
boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,
        notch = FALSE, outline = TRUE, names, plot = TRUE,
        border = par("fg"), col = NULL, log = "",
        pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),
         ann = !add, horizontal = FALSE, add = FALSE, at = NULL)
Arguments
formula 
a formula, such as y ~ grp, where y is a numeric vector of data values to be split into groups according to the grouping variable grp (usually a factor). Note that ~ g1 + g2 is equivalent to g1:g2.

data    
a data.frame (or list) from which the variables in formula should be taken.

subset  
an optional vector specifying a subset of observations to be used for plotting.

na.action   
a function which indicates what should happen when the data contain NAs. The default is to ignore missing values in either the response or the group.

xlab, ylab  
x- and y-axis annotation, since R 3.6.0 with a non-empty default. Can be suppressed by ann=FALSE.

ann 
logical indicating if axes should be annotated (by xlab and ylab).

drop, sep, lex.order    
passed to split.default, see there.

x   
for specifying data from which the boxplots are to be produced. Either a numeric vector, or a single list containing such vectors. Additional unnamed arguments specify further data as separate vectors (each corresponding to a component boxplot). NAs are allowed in the data.

... 
For the formula method, named arguments to be passed to the default method.

For the default method, unnamed arguments are additional data vectors (unless x is a list when they are ignored), and named arguments are arguments and graphical parameters to be passed to bxp in addition to the ones given by argument pars (and override those in pars). Note that bxp may or may not make use of graphical parameters it is passed: see its documentation.

range   
this determines how far the plot whiskers extend out from the box. If range is positive, the whiskers extend to the most extreme data point which is no more than range times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.

width   
a vector giving the relative widths of the boxes making up the plot.

varwidth    
if varwidth is TRUE, the boxes are drawn with widths proportional to the square-roots of the number of observations in the groups.

notch   
if notch is TRUE, a notch is drawn in each side of the boxes. If the notches of two plots do not overlap this is ‘strong evidence’ that the two medians differ (Chambers et al, 1983, p. 62). See boxplot.stats for the calculations used.

outline 
if outline is not true, the outliers are not drawn (as points whereas S+ uses lines).

names   
group labels which will be printed under each boxplot. Can be a character vector or an expression (see plotmath).

boxwex  
a scale factor to be applied to all boxes. When there are only a few groups, the appearance of the plot can be improved by making the boxes narrower.

staplewex   
staple line width expansion, proportional to box width.

outwex  
outlier line width expansion, proportional to box width.

plot    
if TRUE (the default) then a boxplot is produced. If not, the summaries which the boxplots are based on are returned.

border  
an optional vector of colors for the outlines of the boxplots. The values in border are recycled if the length of border is less than the number of plots.

col 
if col is non-null it is assumed to contain colors to be used to colour the bodies of the box plots. By default they are in the background colour.

log 
character indicating if x or y or both coordinates should be plotted in log scale.

pars    
a list of (potentially many) more graphical parameters, e.g., boxwex or outpch; these are passed to bxp (if plot is true); for details, see there.

horizontal  
logical indicating if the boxplots should be horizontal; default FALSE means vertical boxes.

add 
logical, if true add boxplot to current plot.

at  
numeric vector giving the locations where the boxplots should be drawn, particularly when add = TRUE; defaults to 1:n where n is the number of boxes.

Details
The generic function boxplot currently has a default method (boxplot.default) and a formula interface (boxplot.formula).

If multiple groups are supplied either as multiple arguments or via a formula, parallel boxplots will be plotted, in the order of the arguments or the order of the levels of the factor (see factor).

Missing values are ignored when forming boxplots.

Value
List with the following components:

stats   
a matrix, each column contains the extreme of the lower whisker, the lower hinge, the median, the upper hinge and the extreme of the upper whisker for one group/plot. If all the inputs have the same class attribute, so will this component.

n   
a vector with the number of observations in each group.

conf    
a matrix where each column contains the lower and upper extremes of the notch.

out 
the values of any data points which lie beyond the extremes of the whiskers.

group   
a vector of the same length as out whose elements indicate to which group the outlier belongs.

names   
a vector of names for the groups.

References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). The New S Language. Wadsworth & Brooks/Cole.

Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983). Graphical Methods for Data Analysis. Wadsworth & Brooks/Cole.

Murrell, P. (2005). R Graphics. Chapman & Hall/CRC Press.

See also boxplot.stats.

See Also
boxplot.stats which does the computation, bxp for the plotting and more examples; and stripchart for an alternative (with small data sets).

1.4.9 Example in Help

example(boxplot)

## 
## boxplt> ## boxplot on a formula:
## boxplt> boxplot(count ~ spray, data = InsectSprays, col = "lightgray")

## 
## boxplt> # *add* notches (somewhat funny here <--> warning "notches .. outside hinges"):
## boxplt> boxplot(count ~ spray, data = InsectSprays,
## boxplt+         notch = TRUE, add = TRUE, col = "blue")

## Warning in bxp(list(stats = structure(c(7, 11, 14, 18.5, 23, 7, 12, 16.5, : some
## notches went outside hinges ('box'): maybe set notch=FALSE

## 
## boxplt> boxplot(decrease ~ treatment, data = OrchardSprays, col = "bisque",
## boxplt+         log = "y")

## 
## boxplt> ## horizontal=TRUE, switching  y <--> x :
## boxplt> boxplot(decrease ~ treatment, data = OrchardSprays, col = "bisque",
## boxplt+         log = "x", horizontal=TRUE)

## 
## boxplt> rb <- boxplot(decrease ~ treatment, data = OrchardSprays, col = "bisque")

## 
## boxplt> title("Comparing boxplot()s and non-robust mean +/- SD")
## 
## boxplt> mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean)
## 
## boxplt> sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd)
## 
## boxplt> xi <- 0.3 + seq(rb$n)
## 
## boxplt> points(xi, mn.t, col = "orange", pch = 18)
## 
## boxplt> arrows(xi, mn.t - sd.t, xi, mn.t + sd.t,
## boxplt+        code = 3, col = "pink", angle = 75, length = .1)
## 
## boxplt> ## boxplot on a matrix:
## boxplt> mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100),
## boxplt+              `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2))
## 
## boxplt> boxplot(mat) # directly, calling boxplot.matrix()

## 
## boxplt> ## boxplot on a data frame:
## boxplt> df. <- as.data.frame(mat)
## 
## boxplt> par(las = 1) # all axis labels horizontal
## 
## boxplt> boxplot(df., main = "boxplot(*, horizontal = TRUE)", horizontal = TRUE)

## 
## boxplt> ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand :
## boxplt> boxplot(len ~ dose, data = ToothGrowth,
## boxplt+         boxwex = 0.25, at = 1:3 - 0.2,
## boxplt+         subset = supp == "VC", col = "yellow",
## boxplt+         main = "Guinea Pigs' Tooth Growth",
## boxplt+         xlab = "Vitamin C dose mg",
## boxplt+         ylab = "tooth length",
## boxplt+         xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = "i")

## 
## boxplt> boxplot(len ~ dose, data = ToothGrowth, add = TRUE,
## boxplt+         boxwex = 0.25, at = 1:3 + 0.2,
## boxplt+         subset = supp == "OJ", col = "orange")
## 
## boxplt> legend(2, 9, c("Ascorbic acid", "Orange juice"),
## boxplt+        fill = c("yellow", "orange"))
## 
## boxplt> ## With less effort (slightly different) using factor *interaction*:
## boxplt> boxplot(len ~ dose:supp, data = ToothGrowth,
## boxplt+         boxwex = 0.5, col = c("orange", "yellow"),
## boxplt+         main = "Guinea Pigs' Tooth Growth",
## boxplt+         xlab = "Vitamin C dose mg", ylab = "tooth length",
## boxplt+         sep = ":", lex.order = TRUE, ylim = c(0, 35), yaxs = "i")

## 
## boxplt> ## more examples in  help(bxp)
## boxplt> 
## boxplt> 
## boxplt>

1.5 Plot Curves

1.5.1 Simple Example

curve(x^3 - 5*x, from = -4, to = 4)

myfun <- function(xvar){
  1/(1+exp(-xvar + 10))
}

curve(myfun(x), from = 0, to = 20)
curve(1 - myfun(x), add = TRUE, col = "red")

p <- ggplot(data.frame(x = c(0,20)), aes(x = x)) 
p <- p +  stat_function(fun = myfun, geom = "line", color = "black") 
p +  stat_function(fun = function(t) 1 - myfun(t), geom = "line", color = "red")

1.5.9 Help of Draw Function Plots

Description: Draws a curve corresponding to a function over the interval [from, to]. curve can plot also an expression in the variable xname, default x.
Usage:
curve(expr, from = NULL, to = NULL, n = 101, add = FALSE, type = “l”, xname = “x”, xlab = xname, ylab = NULL, log = NULL, xlim = NULL, …)

## S3 method for class 'function'
plot(x, y = 0, to = 1, from = y, xlim = NULL, ylab = NULL, ...)
Arguments
expr    
The name of a function, or a call or an expression written as a function of x which will evaluate to an object of the same length as x.

x   
a ‘vectorizing’ numeric R function.

y   
alias for from for compatibility with plot

from, to    
the range over which the function will be plotted.

n   
integer; the number of x values at which to evaluate.

add 
logical; if TRUE add to an already existing plot; if NA start a new plot taking the defaults for the limits and log-scaling of the x-axis from the previous plot. Taken as FALSE (with a warning if a different value is supplied) if no graphics device is open.

xlim    
NULL or a numeric vector of length 2; if non-NULL it provides the defaults for c(from, to) and, unless add = TRUE, selects the x-limits of the plot – see plot.window.

type    
plot type: see plot.default.

xname   
character string giving the name to be used for the x axis.

xlab, ylab, log, ...    
labels and graphical parameters can also be specified as arguments. See ‘Details’ for the interpretation of the default for log.

For the "function" method of plot, ... can include any of the other arguments of curve, except expr.

Details
The function or expression expr (for curve) or function x (for plot) is evaluated at n points equally spaced over the range [from, to]. The points determined in this way are then plotted.

If either from or to is NULL, it defaults to the corresponding element of xlim if that is not NULL.

What happens when neither from/to nor xlim specifies both x-limits is a complex story. For plot(<function>) and for curve(add = FALSE) the defaults are (0, 1). For curve(add = NA) and curve(add = TRUE) the defaults are taken from the x-limits used for the previous plot. (This differs from versions of R prior to 2.14.0.)

The value of log is used both to specify the plot axes (unless add = TRUE) and how ‘equally spaced’ is interpreted: if the x component indicates log-scaling, the points at which the expression or function is plotted are equally spaced on log scale.

The default value of log is taken from the current plot when add = TRUE, whereas if add = NA the x component is taken from the existing plot (if any) and the y component defaults to linear. For add = FALSE the default is ""

This used to be a quick hack which now seems to serve a useful purpose, but can give bad results for functions which are not smooth.

For expensive-to-compute expressions, you should use smarter tools.

The way curve handles expr has caused confusion. It first looks to see if expr is a name (also known as a symbol), in which case it is taken to be the name of a function, and expr is replaced by a call to expr with a single argument with name given by xname. Otherwise it checks that expr is either a call or an expression, and that it contains a reference to the variable given by xname (using all.vars): anything else is an error. Then expr is evaluated in an environment which supplies a vector of name given by xname of length n, and should evaluate to an object of length n. Note that this means that curve(x, ...) is taken as a request to plot a function named x (and it is used as such in the function method for plot).

The plot method can be called directly as plot.function.

Value
A list with components x and y of the points that were drawn is returned invisibly.

Warning
For historical reasons, add is allowed as an argument to the "function" method of plot, but its behaviour may surprise you. It is recommended to use add only with curve.

See Also
splinefun for spline interpolation, lines.

1.5.9 Example in Help

example(curve)

## 
## curve> plot(qnorm) # default range c(0, 1) is appropriate here,

## 
## curve>             # but end values are -/+Inf and so are omitted.
## curve> plot(qlogis, main = "The Inverse Logit : qlogis()")

## 
## curve> abline(h = 0, v = 0:2/2, lty = 3, col = "gray")
## 
## curve> curve(sin, -2*pi, 2*pi, xname = "t")

## 
## curve> curve(tan, xname = "t", add = NA,
## curve+       main = "curve(tan)  --> same x-scale as previous plot")

## 
## curve> op <- par(mfrow = c(2, 2))
## 
## curve> curve(x^3 - 3*x, -2, 2)

## 
## curve> curve(x^2 - 2, add = TRUE, col = "violet")
## 
## curve> ## simple and advanced versions, quite similar:
## curve> plot(cos, -pi,  3*pi)

## 
## curve> curve(cos, xlim = c(-pi, 3*pi), n = 1001, col = "blue", add = TRUE)
## 
## curve> chippy <- function(x) sin(cos(x)*exp(-x/2))
## 
## curve> curve(chippy, -8, 7, n = 2001)

## 
## curve> plot (chippy, -8, -5)

## 
## curve> for(ll in c("", "x", "y", "xy"))
## curve+    curve(log(1+x), 1, 100, log = ll, sub = paste0("log = '", ll, "'"))

## 
## curve> par(op)

2. Quick Plot: `qplot`

ggplot2 Package の qplot は、plot() から ggplot2 への橋渡しの役割を果たしていたと思われるが、最近は、最初から、ggplot2 を学ぶことが薦められている。

2.1 概要

Description: qplot is a shortcut designed to be familiar if you’re used to base plot(). It’s a convenient wrapper for creating a number of different types of plots using a consistent calling scheme. It’s great for allowing you to produce plots quickly, but I highly recommend learning ggplot() as it makes it easier to create complex graphics.
Usage: ggplot2 が必要である。tidyverse Package に含まれる。

qplot(
  x,
  y,
  ...,
  data,
  facets = NULL,
  margins = FALSE,
  geom = "auto",
  xlim = c(NA, NA),
  ylim = c(NA, NA),
  log = "",
  main = NULL,
  xlab = NULL,
  ylab = NULL,
  asp = NA,
  stat = NULL,
  position = NULL
)

2.2 Examples

ggplot2 に付属の、mpg データセットが使われている。

Fuel economy data from 1999 to 2008 for 38 popular models of cars

Description: This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.
Data Format:
- A data frame with 234 rows and 11 variables:　manufacturer manufacturer name model model name
- displ（エンジン排気量） engine displacement, in litres year year of manufacture
- cyl（気筒数） number of cylinders
- trans（コラム/オートマ） type of transmission
- drv（駆動形式） the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd
- cty（市内走行時燃費） city miles per gallon
- hwy（高速道路走行時燃費） highway miles per gallon
- fl（燃料） fuel type
- class（型式） “type” of car

str(mpg)

## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

summary(mpg)

##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

head(mpg)

Example Attached to `qplot`

example(qplot)

## 
## qplot> # Use data from data.frame
## qplot> qplot(mpg, wt, data = mtcars)

## 
## qplot> qplot(mpg, wt, data = mtcars, colour = cyl)

## 
## qplot> qplot(mpg, wt, data = mtcars, size = cyl)

## 
## qplot> qplot(mpg, wt, data = mtcars, facets = vs ~ am)

## 
## qplot> ## No test: 
## qplot> ##D qplot(1:10, rnorm(10), colour = runif(10))
## qplot> ##D qplot(1:10, letters[1:10])
## qplot> ##D mod <- lm(mpg ~ wt, data = mtcars)
## qplot> ##D qplot(resid(mod), fitted(mod))
## qplot> ##D 
## qplot> ##D f <- function() {
## qplot> ##D    a <- 1:10
## qplot> ##D    b <- a ^ 2
## qplot> ##D    qplot(a, b)
## qplot> ##D }
## qplot> ##D f()
## qplot> ##D 
## qplot> ##D # To set aesthetics, wrap in I()
## qplot> ##D qplot(mpg, wt, data = mtcars, colour = I("red"))
## qplot> ##D 
## qplot> ##D # qplot will attempt to guess what geom you want depending on the input
## qplot> ##D # both x and y supplied = scatterplot
## qplot> ##D qplot(mpg, wt, data = mtcars)
## qplot> ##D # just x supplied = histogram
## qplot> ##D qplot(mpg, data = mtcars)
## qplot> ##D # just y supplied = scatterplot, with x = seq_along(y)
## qplot> ##D qplot(y = mpg, data = mtcars)
## qplot> ##D 
## qplot> ##D # Use different geoms
## qplot> ##D qplot(mpg, wt, data = mtcars, geom = "path")
## qplot> ##D qplot(factor(cyl), wt, data = mtcars, geom = c("boxplot", "jitter"))
## qplot> ##D qplot(mpg, data = mtcars, geom = "dotplot")
## qplot> ## End(No test)
## qplot> 
## qplot> 
## qplot>

The following codes are in the example.

qplot(1:10, rnorm(10), colour = runif(10))

qplot(1:10, letters[1:10])

mod <- lm(mpg ~ wt, data = mtcars)
qplot(resid(mod), fitted(mod))

f <- function() {
   a <- 1:10
   b <- a ^ 2
   qplot(a, b)
}
f()

# To set aesthetics, wrap in I()
qplot(mpg, wt, data = mtcars, colour = I("red"))

# qplot will attempt to guess what geom you want depending on the input
# both x and y supplied = scatterplot
qplot(mpg, wt, data = mtcars)

# just x supplied = histogram
qplot(mpg, data = mtcars)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# just y supplied = scatterplot, with x = seq_along(y)
qplot(y = mpg, data = mtcars)

# Use different geoms
qplot(mpg, wt, data = mtcars, geom = "path")

qplot(factor(cyl), wt, data = mtcars, geom = c("boxplot", "jitter"))

qplot(mpg, data = mtcars, geom = "dotplot")

## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

2.3 Other Examples

Book: ggplot2 などからの例

2.3.1 Color, Shape, Transparancy

Data: ggplot2::diamonds: A dataset containing the prices and other attributes of almost 54,000 diamonds.
Data Format:
- A data frame with 53940 rows and 10 variables:
- price price in US dollars ($326–$18,823)
- carat weight of the diamond (0.2–5.01)
- cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- color diamond colour, from D (best) to J (worst)
- clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- x length in mm (0–10.74)
- y width in mm (0–58.9)
- z depth in mm (0–31.8)
- depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79) table width of top of diamond relative to widest point (43–95)

# diamond data attached to ggplot2
str(diamonds)

## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

# small subset of diamonds
dsmall <- diamonds[sample(nrow(diamonds), 100),]

Color, Shape を Categorical Data を用いて、自動設定

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

qc <- qplot(carat, price, data = dsmall, color = color)
qs <- qplot(carat, price, data = dsmall, shape = cut)
grid.arrange(qc, qs, ncol=2)

## Warning: Using shapes for an ordinal variable is not advised

データが多い場合、透過度を設定して見やすくする。

#library(gridExtra)
qa10 <- qplot(carat, price, data = diamonds, alpha = I(1/10))
qa100 <- qplot(carat, price, data = diamonds, alpha = I(1/100))
qa200 <- qplot(carat, price, data = diamonds, alpha = I(1/200))
grid.arrange(qa10, qa100, qa200, ncol=3)

2.3.2 geom: point, smooth, boxplot, path, line, histogram, freqpoly, density, bar

#library(gridExtra)
qp <- qplot(carat, price, data = dsmall, geom = "point")
qps <- qplot(carat, price, data = dsmall, geom = c("point", "smooth"))
qpss <- qplot(carat, price, data = dsmall, geom = c("point", "smooth"), span = 0.2)

## Warning: Ignoring unknown parameters: span

grid.arrange(qp, qps, qpss, ncol=3)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

geom のベクトルの順序により上書き
smooth は method を指定可能：loess, wiggliness is given by span
mgcv package for extra methods: gam
splines package for extra methods: lm
MASS package for extra methods: rlm

#library(gridExtra)
qj <- qplot(color, price / carat, data = diamonds, geom = "jitter")
qja <- qplot(color, price / carat, data = diamonds, geom = "jitter", alpha = I(1/50))
qb <- qplot(color, price / carat, data = diamonds, geom = "boxplot")
grid.arrange(qj, qja, qb, ncol=3)

Categorical 変数（color）と Continuous numeric (price / carat)

#library(gridExtra)
qh <- qplot(carat, data = diamonds, geom = "histogram", main = "ヒストグラム") + theme_gray(base_family = "HiraKakuPro-W3")
qd <- qplot(carat, data = diamonds, geom = "density", main = "密度") + theme_gray(base_family = "HiraKakuPro-W3")
qb <- qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.1, main = "Histogram: binwidth = 0.1")
qbk <- qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.5, main = "Histogram: binwidth = 0.5")
qhc <- qplot(carat, data = diamonds, geom = "histogram", fill = color, main = "Histogram: fill = color")
qdc <- qplot(carat, data = diamonds, geom = "density", color = color, main = "Density: color = color")
grid.arrange(qh, qd, qb, qbk, qhc, qdc, ncol=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Historgam and Density Function
binwidth
histogram + fill and density + color

#library(gridExtra)
qb <- qplot(color, data = diamonds, geom = "bar", ylob = "count")

## Warning: Ignoring unknown parameters: ylob

qbw <- qplot(color, data = diamonds, geom = "bar", weight = carat, ylab = "carat as weight")
qbws <- qplot(color, data = diamonds, geom = "bar", weight = carat) + 
  scale_y_continuous("carat")
grid.arrange(qb, qbw, qbws, ncol=3)

bar: color のみを与えると、count（頻度）
bar: color に、weight = carat を与えると、color 毎の、carat 合計
bar: color に、weight = carat を与え、scale_y_continuous(“carat”) とすると、color 毎の、carat 合計

2.3.3 Time Series Data

Data: ggplot2::economics: US economic time series
Data: Description: This dataset was produced from US economic time series data available from http://research. stlouisfed.org/fred2. economics is in “wide” format, economics_long is in “long” format.
Data Format:
- A data frame with 574 rows and 6 variables:
- date（年・月） Month of data collection
- pce（個人消費）- unit 1 billion USD personal consumption expenditures, in billions of dollars, http://research.stlouisfed. org/fred2/series/PCE
- pop（人口）- unit 1000 total population, in thousands, http://research.stlouisfed.org/fred2/series/POP
- psavert（個人貯蓄率） personal savings rate, http://research.stlouisfed.org/fred2/series/PSAVERT/
- uempmed（失業週数） median duration of unemployment, in weeks, http://research.stlouisfed.org/ fred2/series/UEMPMED
- unemploy（1000人毎の失業数） numberofunemployedinthousands,http://research.stlouisfed.org/fred2/series/ UNEMPLOY

# economy data attached to ggplot2
str(economics)

## tibble [574 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date    : Date[1:574], format: "1967-07-01" "1967-08-01" ...
##  $ pce     : num [1:574] 507 510 516 512 517 ...
##  $ pop     : num [1:574] 198712 198911 199113 199311 199498 ...
##  $ psavert : num [1:574] 12.6 12.6 11.9 12.9 12.8 11.8 11.7 12.3 11.7 12.3 ...
##  $ uempmed : num [1:574] 4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ...
##  $ unemploy: num [1:574] 2944 2945 2958 3143 3066 ...

#library(gridExtra)
ql <- qplot(date, unemploy / pop, data = economics, geom = "line")
ql2 <- qplot(date, uempmed, data = economics, geom = "line")
grid.arrange(ql, ql2, ncol=2)

year <- function(x) as.POSIXlt(x)$year + 1900
#library(gridExtra)
qpp <- qplot(unemploy / pop, uempmed, data = economics, geom = c("point", "path"))
qpc <- qplot(unemploy / pop, uempmed, data = economics, geom = c("point", "path"), color = year(date))
grid.arrange(qpp, qpc, ncol=2)

失業率（unemploy / pop）と、失業週数の月ごとの変化
上記の年ごとに色を変えたもの

2.3.4 Facets

#library(gridExtra)
qf <- qplot(carat, data = diamonds, facets = color ~ .,
             geom = "histogram", binwidth = 0.1, xlim = c(0,3))
qfd <- qplot(carat, ..density.., data = diamonds, facets = color ~ .,
             geom = "histogram", binwidth = 0.1, xlim = c(0,3))
grid.arrange(qf, qfd, ncol=2)

2.9 Translation

2.9.1 Translating between qplot and ggplot2

qplot と ggplot2 の間の変換

2.9.2 Translating between R base plot and ggplot2

Comparing ggplot2 and R Base Graphics

3. Grammar of Graphics グラフィックスの文法

参考文献：ggplot2

3.1 Basic Structure 基本構造

ggplot2-boos:What is the grammar of graphics?　では、以下の様に説明している。

Data and Variables（データと変数）： 可視化したいデータ、どの変数をどのように利用するか。（aesthetic attributes として対応させると表現されています。美しく表現する属性といった意味ですが、今後、簡単に、「エステ」としてと表現します。）
Data that you want to visualise and a set of aesthetic mappings describing how variables in the data are mapped to aesthetic attributes that you can perceive.
Layers（レイヤー）： グラフのタイプ（geom）、統計処理、形状の詳細
Layers made up of geometric elements and statistical transformation. Geometric objects, geoms for short, represent what you actually see on the plot: points, lines, polygons, etc. Statistical transformations, stats for short, summarise data in many useful ways. For example, binning and counting observations to create a histogram, or summarising a 2d relationship with a linear model.
Scales（スケール）： 色、大きさ、形、変換した数値などの尺度
The scales map values in the data space to values in an aesthetic space, whether it be colour, or size, or shape. Scales draw a legend or axes, which provide an inverse mapping to make it possible to read the original data values from the plot.
Coordinate System（座標系）： 座標系、座標軸、補助的な線
A coordinate system, coord for short, describes how data coordinates are mapped to the plane of the graphic. It also provides axes and gridlines to make it possible to read the graph. We normally use a Cartesian coordinate system, but a number of others are available, including polar coordinates and map projections.
Faceting（ファセット（追加する断面））: グループ毎の情報の追加
A faceting specification describes how to break up the data into subsets and how to display those subsets as small multiples. This is also known as conditioning or latticing/trellising.
theme（テーマ）： 表示の詳細
A theme which controls the finer points of display, like the font size and background colour. While the defaults in ggplot2 have been chosen with care, you may need to consult other references to create an attractive plot. A good starting place is Tufte’s early works (Tufte 1990, 1997, 2001)

Grammar of Graphics がしないことについて、二点掲げられている。

どんなグラフィックを使うべきかについては、なにも語らない。It doesn’t suggest what graphics you should use to answer the questions you are interested in.
静的なグラフィックのみで、動的なものは、生成しない。It does not describe interactivity: the grammar of graphics describes only static graphics and there is essentially no benefit to displaying them on a computer screen as opposed to a piece of paper.

3.2 Examples 例示

mpg を使って簡単に説明します。

mpg %>% select(displ, hwy, cyl, year) %>% head(10)

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point()

Data（データ）: mpg.
Aesthetic mapping（エステ）: engine size mapped to x position, fuel economy to y position. x には、エンジンの排気量、y には、高速での燃費
Layer（レイヤー）: points.　点を指定（散布図を描く）。

x, y を省略して、以下の様にしても、全く同じ。エステの、一つ目が、x 二つ目が y。

ggplot(mpg, aes(displ, hwy)) +
  geom_point()

Exercise

グラフからなにを読み取ることができるか。
下をより見やすくするにはどうしたらよいか。

ggplot(mpg, aes(model, manufacturer)) + geom_point()

下のグラフの、データ、エステ、レイヤーはどのようなものか。

ggplot(mpg, aes(cty, hwy)) + geom_point()

ggplot(diamonds, aes(carat, price)) + geom_point()

ggplot(economics, aes(date, unemploy)) + geom_line()

ggplot(mpg, aes(cty)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

4. ggplot2 fundamentals　グラフィックスの基礎

ggplot2-boos:Introduction

4.1 Colour, size, shape and other aesthetic attributes

他のエステを加えるには次のようにする。

aes(displ, hwy, colour = class)
aes(displ, hwy, shape = drv)
aes(displ, hwy, size = cyl)

どのような、色、形、大きさにするかは、scale で扱うが、省略すると、初期値（default）が割り当てられる。色（color）や形（shape）は、カテゴリカル（categorical）変数に適しており、大きさ（size）は、数値（numerical）変数に適している。種類が増えると実際には、区別しにくい。

ggplot(mpg, aes(displ, cty, colour = class)) + 
  geom_point()

#library(gridExtra)
g1c <- ggplot(mpg, aes(displ, hwy, colour = class)) + 
  geom_point()
g1s <- ggplot(mpg, aes(displ, hwy, shape = drv)) + 
  geom_point()
g1z <- ggplot(mpg, aes(displ, hwy, size = cyl)) + 
  geom_point()
grid.arrange(g1c, g1s, g1z, ncol=3)

スケールscale で一般的に指定せず、具体的に色を指定することも可能。下の二つを比較すること。

#library(gridExtra)
g1r <- ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = "blue"))
g1b <- ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue")
grid.arrange(g1r, g1b, ncol=2)

詳細は、Aesthetic specifications を参照。

Exercise

サイズとして、カテゴリカルデータをとったり、形として、数値データを対応させるとどうなるか。
二つ以上の変数を、エステとして加えるとどうなるか。
燃費とどのような変数が、関係しているか。それは、なぜか。

4.2 Facetting

ファセット（facetting）によってカテゴリカルデータをエステに加えることも可能である。Grid と、wrap があるが、ここでは、wrap の例のみをあげる。

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  facet_wrap(~class)

どのようなときに、ファセットを使い、どのようなときに、エステに加えるのがよいだろうか。

Exercise

数値データの、hwy をファセットとして加えるとどうなるか。cyl だとどうなるだろうか。
ファセットを使って、三つの変数（fuel economy, engine size, and number of cylinders）の関係を見るにはどうしたらよいか。
facet_wrap の Help から、縦横いくつを並べるかなどはどう制御するのかを調べよ。
facet_wrap について、scale は何を制御するのか。

4.3 Plot geoms

geom_point を他のものに置き換えたらどうなるのだろうか。

geom_smooth() fits a smoother to the data and displays the smooth and its standard error. 滑らかな曲線で表示
geom_boxplot() produces a box-and-whisker plot to summarise the distribution of a set of points. 箱ひげ図
geom_histogram() and geom_freqpoly() show the distribution of continuous variables. ヒストグラム・頻度図
geom_bar() shows the distribution of categorical variables. カテゴリカル変数の分布
geom_path() and geom_line() draw lines between the data points. A line plot is constrained to produce lines that travel from left to right, while paths can go in any direction. Lines are typically used to explore how things change over time. データポイントの時間的変化図

4.3.1 Adding a smoother to a plot

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

信頼区間が帯で描かれるが、不要なときは、geom_smooth(se = FALSE) を加える。

method = “loess”（method の初期値は、loess。詳細は、?loess）, the default for small n, uses a smooth local regression. The wiggliness of the line（波打ちの度合い）is controlled by the span parameter, which ranges from 0 (exceedingly wiggly) to 1 (not so wiggly).

#library(gridExtra)
gs02 <- ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth(span = 0.2)

gs10 <- ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth(span = 1)
  
grid.arrange(gs02, gs10, ncol=2)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

method = “gam” fits a generalised additive model provided by the mgcv package.
method = “lm” fits a linear model, giving the line of best fit.

4.3.2 Boxplots and jittered points

連続的に変化する数値変数と、カテゴリカル変数があったときは、カテゴリカル変数によって、数値変数が変化するかを見る。

ggplot(mpg, aes(drv, hwy)) + 
  geom_point()

三種類の方法がある。

Jittering（ジッター図）, geom_jitter(), adds a little random noise to the data which can help avoid overplotting.

ggplot(mpg, aes(drv, hwy)) + geom_jitter()

Boxplots（箱ひげ図）, geom_boxplot(), summarise the shape of the distribution with a handful of summary statistics.

ggplot(mpg, aes(drv, hwy)) + geom_boxplot()

Violin plots（バイオリン図）, geom_violin(), show a compact representation of the “density” of the distribution, highlighting the areas where more points are found.

ggplot(mpg, aes(drv, hwy)) + geom_violin()

ジッター図を描く geom_jitter() は、geom_point() と同じように、エステで、size, colour, shape を制御する。箱ひげ図と、バイオリン図を描く geom_boxplot() と、geom_violin() では、外形をcolor で描き、または、fill で、色で塗りつぶすことができる。

4.3.3 Histograms and frequency polygons

一つの数値変数についての詳細を表現するには、ヒストグラムまたは、頻度図を使う。どちらも、区間に分けて、その区間ごとの頻度を表す。区間の幅は、binwidth で決める。初期値は、30 分割（breaks = 30）するようにしている。

ヒストグラム（histogram）

ggplot(mpg, aes(hwy)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

頻度図（freeency polygon）

ggplot(mpg, aes(hwy)) + geom_freqpoly()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ファセットも利用可能である。

ggplot(mpg, aes(displ, colour = drv)) + 
  geom_freqpoly(binwidth = 0.5)

ggplot(mpg, aes(displ, fill = drv)) + 
  geom_histogram(binwidth = 0.5) + 
  facet_wrap(~drv, ncol = 1)

4.3.4 Bar charts

ggplot(mpg, aes(manufacturer)) + 
  geom_bar()

drugs <- data.frame(
  drug = c("a", "b", "c"),
  effect = c(4.2, 9.7, 6.1)
)
#library(gridExtra)
gbi <- ggplot(drugs, aes(drug, effect)) + geom_bar(stat = "identity")
gdrug <- ggplot(drugs, aes(drug, effect)) + geom_point()
  
grid.arrange(gbi, gdrug, ncol=2)

4.3.5 Time series with line and path plots

時系列データは、折れ線グラフ（line graph）または、パスグラフ（path graph）で表す。折れ線グラフでは、時系列は（初期値では）左から右に流れ、パスグラフでは、時系列で、点が点に移動する道（パス）が描かれる。つまり、折れ線グラフは、x 軸の値に従って、パスを表示していると考えられる。

利用してる、データについては、economics を参照のこと。

折れ線グラフ

#library(gridExtra)
geup <- ggplot(economics, aes(date, unemploy / pop)) +
  geom_line()
geu <- ggplot(economics, aes(date, uempmed)) +
  geom_line()
grid.arrange(geup, geu, ncol=2)

パスグラフ

#library(gridExtra)
geupp <- ggplot(economics, aes(unemploy / pop, uempmed)) + 
  geom_path() +
  geom_point()
year <- function(x) as.POSIXlt(x)$year + 1900
geuppp <- ggplot(economics, aes(unemploy / pop, uempmed)) + 
  geom_path(colour = "grey50") +
  geom_point(aes(colour = year(date)))
grid.arrange(geupp, geuppp, ncol=2)

Exercises

ggplot(mpg, aes(cty, hwy)) + geom_point() には、どのような問題があるか。どのような、geom を利用するのが適しているか。
ggplot(mpg, aes(class, hwy)) + geom_boxplot() では、アルファベティカルに並ぶが、より適切にするには、どうしたらよいか。手作業ですることも可能だが、ggplot(mpg, aes(reorder(class, hwy), hwy)) + geom_boxplot() とすることもできる。reorder については、Help 参照。
diamonds データの、carat を描くには、binwidth をいくつにするのが適切か。
diamonds の価格を、cut によって比較するには、どうしたらよいか。
対応する部分を比較するいくつかの方法（geom_violin(), geom_freqpoly() における colour エステ, および geom_histogram() と facetting）を学んだが、それぞれの長所と欠点は何か。
geom_bar() の説明を読み、weight エステついて調べよ。
学んだことから、それぞれの二変数（model と manufacturer, trans と class, cyl と trans) を可視化せよ。

4.3.6 Modifying the axes

ラベルについての基本は、以下の通り、自動的、能動的、省略である。

#library(gridExtra)
glabs0 <- ggplot(mpg, aes(cty, hwy)) +
  geom_point(alpha = 1 / 3)

glabsxy <- ggplot(mpg, aes(cty, hwy)) +
  geom_point(alpha = 1 / 3) + 
  xlab("city driving (mpg)") + 
  ylab("highway driving (mpg)")

glabsnull <- ggplot(mpg, aes(cty, hwy)) +
  geom_point(alpha = 1 / 3) + 
  xlab(NULL) + 
  ylab(NULL)
  
grid.arrange(glabs0, glabsxy, glabsnull, ncol=3)

xlim() and ylim() によって軸の範囲を制限することができる。データの範囲を超えて、軸を設定するとき、na.rm = TRUE とすることで、warnings を出さないようにできるが、注意も要する。

#library(gridExtra)
gjitter <-ggplot(mpg, aes(drv, hwy)) +
  geom_jitter(width = 0.25)

gjitterlim <- ggplot(mpg, aes(drv, hwy)) +
  geom_jitter(width = 0.25) + 
  xlim("f", "r") + 
  ylim(20, 30)

gjitterlimnull <- ggplot(mpg, aes(drv, hwy)) +
  geom_jitter(width = 0.25, na.rm = TRUE) + 
  ylim(NA, 30)
  
grid.arrange(gjitter, gjitterlim, gjitterlimnull, ncol=3)

## Warning: Removed 140 rows containing missing values (geom_point).

4.3.7 Output

出力をあとからすることも可能である。

p <- ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
  geom_point()

print(p)

ディスクに画像ファイルとして保存することも可能。

# Save png to disk
ggsave("plot.png", p, width = 5, height = 5)

summary で概要を表示することも可能。

summary(p)

## data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl,
##   class [234x11]
## mapping:  x = ~displ, y = ~hwy, colour = ~factor(cyl)
## faceting: <ggproto object: Class FacetNull, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetNull, Facet, gg>
## -----------------------------------
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity

データとして保存し、あとから読み込み利用することも可能

saveRDS(p, "plot.rds")
q <- readRDS("plot.rds")

5. Toolbox グラフィックスの道具箱

5.1 個々のジオム Individual geoms

ggplot2-boos:Individual geoms

各ジオムは、名前をもち、それぞれとして大切なだけでなく、より複雑なものを構成する基本要素でもある。また、二次元 x と y をエステとして書く必要があり、すべて、color（色）と size（大きさ）をエステに加えることができる。また、bat, tile, polygon は、フィルも加えることができる。

geom_area(): ある領域を下から上へ重ねて塗りつぶしていく。
geom_bar(stat = “identity”): 棒グラフを描く。初期値 stat = “identity” により、個数を自動的に計算して表示する。いくつかの棒グラフを上に重ねて表示することもできる。
geom_line(): 折れ線を描く。グループ毎に線でつなぐ。 geom_line() は左から右に、geom_path() は順に結んでいく。どちらも、点線にするなど、線の形式 linetype をエステとして取り得る。
geom_point(): 散布図を描き、点の形 shape をエステとして取り得る。
geom_polygon(): 多角形を描く。ただし、各行の値は異なるとする。多角形の座標軸をもつ、データフレームを一つにして、プロットするまえに、外形をみるのにも役立つ。
geom_rect(), geom_tile(), geom_raster(): いずれも長方形を描く。geom_rect() は四頂点の座標、xmin, ymin, xmax, ymax、geom_tile() はやはり四頂点だが、中心の座標と幅と高さ、 geom_raster() は、geom_tile() の特別な場合で、すべてのタイルが同じ大きさである場合に利用する。

座標軸の範囲が異なっていることに注意。

下の例では、すべて $(x,y) = (3,2), (1,4), (5,6)$ としている。

#library(gridExtra)
df <- data.frame(
  x = c(3, 1, 5), 
  y = c(2, 4, 6), 
  label = c("a","b","c")
)
p <- ggplot(df, aes(x, y, label = label)) + 
  labs(x = NULL, y = NULL) + # Hide axis label
  theme(plot.title = element_text(size = 12)) # Shrink plot title
gp1 <- p + geom_point() + ggtitle("point")
gp2 <- p + geom_text() + ggtitle("text")
gp3 <- p + geom_bar(stat = "identity") + ggtitle("bar")
gp4 <- p + geom_tile() + ggtitle("raster")
grid.arrange(gp1, gp2, gp3, gp4, ncol=4)

gp5 <- p + geom_line() + ggtitle("line")
gp6 <- p + geom_area() + ggtitle("area")
gp7 <- p + geom_path() + ggtitle("path")
gp8 <- p + geom_polygon() + ggtitle("polygon")
grid.arrange(gp5, gp6, gp7, gp8, ncol=4)

Exercises

下のそれぞれのグラフをプロットするには、どの geom を使うか。

Scatterplot
Line chart
Histogram
Bar chart
Pie chart

geom_path() と geom_polygon()、geom_path() と geom_line() の違いはそれぞれなにか。
geom_smooth()、geom_boxplot()、geom_violin() では、どのような基本要素をつかっているのだろうか。

5.2 複合的なジオム Collective geoms

ggplot2-boos:Collective geoms

ジオムは、大きく、個別のジオムと、複合的なジオムに分けられる。個別のジオムは、個々のオブザベーション（行）についてたとえば点を描くが、複合的なジオムは、いくつかのオブザベーションを、一つにまとめて描く。これは、統計的な概要を、たとえば箱ひげ図で表したり、多角形で表したりする。線やパスはその中間的なものである。線分は、二点を表しているとも考えられるからである。いくつかのオブザベーションをまとめるには、グループ化（エステにグループを加えること）を利用することになる。

離散値をグループ分けに使うことが基本であるが、適切に分割するためには、明示的に、どのように分割するかを指定することが必要な場合もある。

以下で典型的な三つの場合について、身長（longitude）のデータを用いて議論する。

nlme Package

Description: Fit and compare Gaussian linear and nonlinear mixed-effects models.
web: nlme: Linear and Nonlinear Mixed Effects Models
Manual
Homepage

Oxboys dataset in nlme: Heights of Boys in Oxford

The heights (height) and centered ages (age) of 26 boys (Subject), measured on nine occasions (Occasion)

Description: The Oxboys data frame has 234 rows and 4 columns.
Format:.
- This data frame contains the following columns:
- Subject an ordered factor giving a unique identifier for each boy in the experiment
- age a numeric vector giving the standardized age (dimensionless)
- height a numeric vector giving the height of the boy (cm)
- **Occasion88 an ordered factor - the result of converting age from a continuous variable to a count so these slightly unbalanced data can be analyzed as balanced.

# libaray(nlme)
data(Oxboys, package = "nlme")
str(Oxboys)

## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   234 obs. of  4 variables:
##  $ Subject : Ord.factor w/ 26 levels "10"<"26"<"25"<..: 13 13 13 13 13 13 13 13 13 5 ...
##  $ age     : num  -1 -0.7479 -0.463 -0.1643 -0.0027 ...
##  $ height  : num  140 143 145 147 148 ...
##  $ Occasion: Ord.factor w/ 9 levels "1"<"2"<"3"<"4"<..: 1 2 3 4 5 6 7 8 9 1 ...
##  - attr(*, "formula")=Class 'formula'  language height ~ age | Subject
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ y: chr "Height"
##   ..$ x: chr "Centered age"
##  - attr(*, "units")=List of 1
##   ..$ y: chr "(cm)"
##  - attr(*, "FUN")=function (x)  
##   ..- attr(*, "source")= chr "function (x) max(x, na.rm = TRUE)"
##  - attr(*, "order.groups")= logi TRUE

summary(Oxboys)

##     Subject         age               height         Occasion 
##  10     :  9   Min.   :-1.00000   Min.   :126.2   1      :26  
##  26     :  9   1st Qu.:-0.46300   1st Qu.:143.8   2      :26  
##  25     :  9   Median :-0.00270   Median :149.5   3      :26  
##  9      :  9   Mean   : 0.02263   Mean   :149.5   4      :26  
##  2      :  9   3rd Qu.: 0.55620   3rd Qu.:155.5   5      :26  
##  6      :  9   Max.   : 1.00550   Max.   :174.8   6      :26  
##  (Other):180                                      (Other):78

head(Oxboys)

5.2.1 Multiple groups, one aesthetic

データをいくつかのグループに分割して、それらを同一の基準で描きたいことが生じる。グループごとにどのように異なるかをみたいと表現することもできる。しかし、グループを明確に分けないと、下のそれぞれの少年の成長を表したグラフのように、ごちゃごちゃして見にくいものになる。

ggplot(Oxboys, aes(age, height, group = Subject)) + 
  geom_point() + 
  geom_line()

グループを指定しないと、次のグラフのように悲惨なものが生じる。

ggplot(Oxboys, aes(age, height)) + 
  geom_point() + 
  geom_line()

複数の変数でグループ分けするときには、interaction() を使い、たとえば、 aes(group = interaction(school_id, student_id)) のようにする。

5.2.2 Different groups on different layers

それぞれの男子についてのトレンドを表示すると次のようになる。

ggplot(Oxboys, aes(age, height, group = Subject)) + 
  geom_line() + 
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

これが欲しいものではなかった。グループ化を、ggplot のエステに加えず、geom_line() に加えてトレンドを描くと次のようになる。

ggplot(Oxboys, aes(age, height)) + 
  geom_line(aes(group = Subject)) + 
  geom_smooth(method = "lm", size = 2, se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

5.2.3 Overriding the default grouping

x 座標としては離散値をとり、グループごとを結んだ線を描きたい場合がある。これは、たとえば、interaction plots, profile plots, parallel coordinate plots と呼ばれるものである。たとえば、測定時期毎にまとめると次のようになる。

ggplot(Oxboys, aes(Occasion, height)) + 
  geom_boxplot()

この場合には、離散値変数は Occasion 一つなので、それぞれの、x の値について、一つずつ、箱ひげ図が描かれる。geom_line() を単純に加えると、うまくいかない。Occasion 間ではなく、それぞれの、Occasion の中で、線が引かれるからである。

ggplot(Oxboys, aes(Occasion, height)) + 
  geom_boxplot() +
  geom_line(colour = "#3366FF", alpha = 0.5)

このばあいは、グループを別に設定して、男子ひとりについて、折れ線を描くようにする。

ggplot(Oxboys, aes(Occasion, height)) + 
  geom_boxplot() +
  geom_line(aes(group = Subject), colour = "#3366FF", alpha = 0.5)

5.2.4 Matching aesthetics to graphic objects

複合的なジオムにおいて、個々のオブザベーションが、全体のエステとして、対応させるとどうなるか。ここの異なるエステが、一つの値に対応するとどうなるか。

これらは、それぞれの複合的ジオム毎に処理される。折れ線とパスは、最初の値に注目する。線分は、二つのオブザベーションで決まるが、エステとして、色を用いるとき、その最初のオブザベーションの値を利用する。

df <- data.frame(x = 1:3, y = 1:3, colour = c(1,3,5))

ggplot(df, aes(x, y, colour = factor(colour))) + 
  geom_line(aes(group = 1), size = 2) +
  geom_point(size = 5)

ggplot(df, aes(x, y, colour = colour)) + 
  geom_line(aes(group = 1), size = 2) +
  geom_point(size = 5)

上の例で確認せよ。左は、離散値変数、右は連続値変数。

xgrid <- with(df, seq(min(x), max(x), length = 50))
interp <- data.frame(
  x = xgrid,
  y = approx(df$x, df$y, xout = xgrid)$y,
  colour = approx(df$x, df$colour, xout = xgrid)$y  
)
ggplot(interp, aes(x, y, colour = colour)) + 
  geom_line(size = 2) +
  geom_point(data = df, size = 5)

ラインタイプは、一種類しか使えない。他のジオムの場合には、複雑になるので、グループ毎に同じ色が対応するような場合のみ、エステが反映される規則にしている。連続値変数の場合が理解しやすいかもしれない。特に、bar や area plots の場合には、線分を重ねていくので、結果がどのようになるかわかりやすい。

連続値変数を fill としてエステに加えるときは、次の例のようになるので、注意を要する。

ggplot(mpg, aes(class)) + 
  geom_bar()

ggplot(mpg, aes(class, fill = drv)) + 
  geom_bar()

ggplot(mpg, aes(class, fill = hwy)) + 
  geom_bar()

ggplot(mpg, aes(class, fill = hwy, group = hwy)) + 
  geom_bar()

上のようなグラフの色を適切に制御するには、必要に応じて、明示的に、値を区間に分けるなどする。

5.2.5 Exercises

hwy の箱ひげ図を、cyl をファクターにはせずに各 cyl について描くためには、エステとして何を加えればよいか。
次のプロットを、dipl の整数値について、一つずつ、箱ひげ図を描くように修正せよ。

ggplot(mpg, aes(displ, cty)) + 
  geom_boxplot()

## Warning: Continuous y aesthetic -- did you forget aes(group=...)?

連続値の場合と異なり、離散値の場合は、色を対応させるのに、aes(group = 1) が必要なのはなぜか。それを省略するとどうなるか。aes(group = 1) と aes(group = 2) の違いはなにか。説明せよ。
それぞれの、プロットでは、いくつの、棒が表示されるか。

ggplot(mpg, aes(drv)) + 
  geom_bar()

ggplot(mpg, aes(drv, fill = hwy, group = hwy)) + 
  geom_bar()

library(dplyr)  
mpg2 <- mpg %>% arrange(hwy) %>% mutate(id = seq_along(hwy)) 
ggplot(mpg2, aes(drv, fill = hwy, group = id)) + 
  geom_bar()

(Hint: try adding an outline around each bar with colour = “white”)

アメリカで人気の高い、赤ちゃんの名前のデータである、babynames package をインストールし、下のコードを実行し、グラフを適切に修正せよ。なぜ、このグラフではいけないのか。

library(babynames)
hadley <- dplyr::filter(babynames, name == "Hadley")
ggplot(hadley, aes(year, n)) + 
  geom_line()

5.3 統計的要約 Statistical summaries

ggplot2-boos:Statistical summaries

5.3.1 Revealing uncertainty

不確かさ、データのばらつきを表示するには、x に対応する変数が、離散変数（discrete）か、連続変数（continuous）か、また信頼区間の中心（center）を表示するかによって、次のようないくつかの方法がある。

Discrete x, range: geom_errorbar(), geom_linerange()
Discrete x, range & center: geom_crossbar(), geom_pointrange()
Continuous x, range: geom_ribbon()
Continuous x, range & center: geom_smooth(stat = “identity”)

x のそれぞれについて、y の値の範囲（ymin, ymax）を、エステに与えることで、表示する。

#library(gridExtra)
y <- c(18, 11, 16)
df <- data.frame(x = 1:3, y = y, se = c(1.2, 0.5, 1.0))

base <- ggplot(df, aes(x, y, ymin = y - se, ymax = y + se))
gbcb <- base + geom_crossbar()
gbpr <- base + geom_pointrange()
gbgs <- base + geom_smooth(stat = "identity")
grid.arrange(gbcb,gbpr, gbgs, ncol=3)

#library(gridExtra)
gbeb <- base + geom_errorbar()
gblr <- base + geom_linerange()
gbr <- base + geom_ribbon()
grid.arrange(gbeb, gblr, gbr, ncol=3)

標準偏差・誤差などは、各種あるので、何を利用するのが適切かを考えて、指定することになる。

参照：R for Data Science

5.3.2 Weighted data

各行がオブザベーションに対応し、各列が様々な変数に対応し、それらの重み付けが必要な場合を考える。

R に付属の、アメリカ合衆国の中西部の2010年の国勢調査のデータ midwest を利用する。これらは、様々なグループの割合の情報を含んでいる。白人の割合（percent white）、貧困率（percent below poverty line）、大学の学位（percent with college degree）、郡（county）ごとの面積（area）、人口（population）、人口密度（population density）などである。

重み付けとしてはたとえば次のようなことが考えられる。

Nothing, to look at numbers of counties.
Total population, to work with absolute numbers.
Area, to investigate geographic effects. (This isn’t useful for midwest, but would be if we had variables like percentage of farmland.)

重み付けによって、結果は大きく変化する。重み付けをエステとして含めるには二つの方法がある。単純なジオムとして、lines や points　に対して size を用いる。

#library(gridExtra)
# Unweighted
gwp <- ggplot(midwest, aes(percwhite, percbelowpoverty)) + 
  geom_point()

# Weight by population
gwps <- ggplot(midwest, aes(percwhite, percbelowpoverty)) + 
  geom_point(aes(size = poptotal / 1e6)) + 
  scale_size_area("Population\n(millions)", breaks = c(0.5, 1, 2, 4))

grid.arrange(gwp, gwps, ncol=2)

さらに複雑な統計処理を必要とする grob の場合には、weight をエステの中で指定する。値が統計的要約関数に渡される。例：smoothers, quantile regressions, boxplots, histograms, density plots. 反例には示されない。下の例を参照のこと。

#library(gridExtra)
# Unweighted
gwpl <- ggplot(midwest, aes(percwhite, percbelowpoverty)) + 
  geom_point() + 
  geom_smooth(method = lm, size = 1)
#> `geom_smooth()` using formula 'y ~ x'

# Weighted by population
gwpp <- ggplot(midwest, aes(percwhite, percbelowpoverty)) + 
  geom_point(aes(size = poptotal / 1e6)) + 
  geom_smooth(aes(weight = poptotal), method = lm, size = 1) +
  scale_size_area(guide = "none")
#> `geom_smooth()` using formula 'y ~ x'
grid.arrange(gwpl, gwpp, ncol=2)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

下は、人口に関する、ヒストグラムや密度プロットを利用した例である。

#library(gridExtra)
gphc <- ggplot(midwest, aes(percbelowpoverty)) +
  geom_histogram(binwidth = 1) + 
  ylab("Counties")

gppp <- ggplot(midwest, aes(percbelowpoverty)) +
  geom_histogram(aes(weight = poptotal), binwidth = 1) +
  ylab("Population (1000s)")
grid.arrange(gphc, gppp, ncol=2)

5.3.3 Diamonds data

参照：Diamonds Data

5.3.4 Displaying distributions

#library(gridExtra)
gddh <- ggplot(diamonds, aes(depth)) + 
  geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
gddhx <- ggplot(diamonds, aes(depth)) + 
  geom_histogram(binwidth = 0.1) + 
  xlim(55, 70)
grid.arrange(gddh, gddhx, ncol=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 45 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

#library(gridExtra)
gdf <- ggplot(diamonds, aes(depth)) + 
  geom_freqpoly(aes(colour = cut), binwidth = 0.1, na.rm = TRUE) +
  xlim(58, 68) + 
  theme(legend.position = "none")
gdhxt <- ggplot(diamonds, aes(depth)) + 
  geom_histogram(aes(fill = cut), binwidth = 0.1, position = "fill",
    na.rm = TRUE) +
  xlim(58, 68) + 
  theme(legend.position = "none")
grid.arrange(gdf, gdhxt, ncol=2)

#library(gridExtra)
gdbp <- ggplot(diamonds, aes(clarity, depth)) + 
  geom_boxplot()
gdbpx <- ggplot(diamonds, aes(carat, depth)) + 
  geom_boxplot(aes(group = cut_width(carat, 0.1))) + 
  xlim(NA, 2.05)
grid.arrange(gdbp, gdbpx, ncol=2)

## Warning: Removed 997 rows containing missing values (stat_boxplot).

#library(gridExtra)
dcdv <- ggplot(diamonds, aes(clarity, depth)) + 
  geom_violin()
dcdvx <- ggplot(diamonds, aes(carat, depth)) + 
  geom_violin(aes(group = cut_width(carat, 0.1))) + 
  xlim(NA, 2.05)
grid.arrange(dcdv, dcdvx, ncol=2)

## Warning: Removed 997 rows containing non-finite values (stat_ydensity).

Exercises

carat の分布について、binwidth を変えることでどのようなことがわかるか。
価格についての、ヒストグラムからどのようなことがわかるか。
透明度と価格の関係についてどのようなことがわかるか。
頻度多角形と、深さに関する密度のプロットを重ねよ。比較のための、y の計算には、どのようにすることが必要か。Overlay a frequency polygon and density plot of depth. What computed variable do you need to map to y to make the two plots comparable? (You can either modify geom_freqpoly() or geom_density().)

5.3.5 Dealing with overplotting

#library(gridExtra)
df <- data.frame(x = rnorm(2000), y = rnorm(2000))
norm <- ggplot(df, aes(x, y)) + xlab(NULL) + ylab(NULL)
gnp <- norm + geom_point()
gnps <- norm + geom_point(shape = 1) # Hollow circles
gnpsp <- norm + geom_point(shape = ".") # Pixel sized
grid.arrange(gnp, gnps, gnpsp, ncol=3)

#library(gridExtra)
gnpa3 <- norm + geom_point(alpha = 1 / 3)
gnpa5 <- norm + geom_point(alpha = 1 / 5)
gnpa10 <- norm + geom_point(alpha = 1 / 10)
grid.arrange(gnpa3, gnpa5, gnpa10, ncol=3)

#library(gridExtra)
gnb2 <- norm + geom_bin2d()
gnb10 <- norm + geom_bin2d(bins = 10)
grid.arrange(gnb2, gnb10, ncol=2)

#library(gridExtra)
gnh <- norm + geom_hex()
gnh10 <- norm + geom_hex(bins = 10)
grid.arrange(gnh, gnh10, ncol=2)

5.3.6 Statistical summaries

#library(gridExtra)
gdcb <- ggplot(diamonds, aes(color)) + 
  geom_bar()

gdbs <- ggplot(diamonds, aes(color, price)) + 
  geom_bar(stat = "summary_bin", fun = mean)
grid.arrange(gdcb, gdbs, ncol=2)

#library(gridExtra)
gtd <- ggplot(diamonds, aes(table, depth)) + 
  geom_bin2d(binwidth = 1, na.rm = TRUE) + 
  xlim(50, 70) + 
  ylim(50, 70)

gtdz <- ggplot(diamonds, aes(table, depth, z = price)) + 
  geom_raster(binwidth = 1, stat = "summary_2d", fun = mean, 
    na.rm = TRUE) + 
  xlim(50, 70) + 
  ylim(50, 70)
grid.arrange(gtd, gtdz, ncol=2)

## Warning: Raster pixels are placed at uneven horizontal intervals and will be
## shifted. Consider using geom_tile() instead.

## Warning: Raster pixels are placed at uneven vertical intervals and will be
## shifted. Consider using geom_tile() instead.

5.3.7 Surfaces

ggplot(faithfuld, aes(eruptions, waiting)) + 
  geom_contour(aes(z = density, colour = ..level..))

ggplot(faithfuld, aes(eruptions, waiting)) + 
  geom_raster(aes(fill = density))

small <- faithfuld[seq(1, nrow(faithfuld), by = 10), ]
ggplot(small, aes(eruptions, waiting)) + 
  geom_point(aes(size = density), alpha = 1/3) + 
  scale_size_area()

5.4 地図 Maps

ggplot2-boos:Maps

5.5.1 Polygon maps

5.5.2 Simple features maps

5.5.3 Map projections

5.5.4 Working with sf data

5.5.5 Raster maps

5.5.6 Data sources

5.5 注釈 Annotations

ggplot2-boos:Annotations

5.5.1 Titles

中身は書かれていない。as of April 22, 2020.

5.5.2 Labels

5.5.3 Text labels

5.5.4 Building custom annotations

5.5.5 Direct labelling

5.6 プロットの配置 Arranging plots

ggplot2-boos:Arranging plots

ほとんど中身は書かれていない。as of April 22, 2020.

このページ内の「ならべかた」参照

6. 文法 Grammar

66. Interactive Graphics 動的・対話型グラフィック

66.1 htmlwidgets

htmlwidgets: web visualization tools providing graphics honed for specific purposes
- leaflet for maps
- dygraph for time series
- networkD3 for networks.

66.2 ggvis

ggvis : in progress.
Work on ggvis, the successor to ggplot2, started in 2014.
ggvis extends the idea of ggplot2 to the web and interactive graphics.
ggvis is interactive and dynamic, so plots automatically re-draw themselves when the underlying data or plot specification changes.

77. Miscellaneous その他

77.1 Arrangements ならべかた

base::plot などで、m行、n列に、並べる場合（c(m,n)）

par(family= "HiraKakuPro-W3")
op <- par(mfrow = c(1,2))
barplot(BOD$demand, names.arg = BOD$Time, main = "棒グラフのラベルを指定", xlab = "時間経過（日）", ylab = "酸素の必要量 （mg/l）")
hist(mtcars$mpg, breaks = 10)

par(op)

gridExtra

gridExtra: Miscellaneous Functions for “Grid” Graphics
Provides a number of user-level functions to work with “grid” graphics, notably to arrange multiple grid-based plots on a page, and draw tables.
Manual
vignette

ggpurr

ggpubr: ‘ggplot2’ Based Publication Ready Plots
The ‘ggplot2’ package is excellent and flexible for elegant data visualization in R. However the default generated plots requires some formatting before we can send them for publication. Furthermore, to customize a ‘ggplot’, the syntax is opaque and this raises the level of difficulty for researchers with no advanced R programming skills. ‘ggpubr’ provides some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots.
Manual

cowplot

[cowplot: Streamlined Plot Theme and Plot Annotations for ‘ggplot2’] (https://cran.r-project.org/web/packages/cowplot/index.html)
Provides various features that help with creating publication-quality figures with ‘ggplot2’, such as a set of themes, functions to align plots and arrange them into complex compound figures, and functions that make it easy to annotate plots and or mix plots with images. The package was originally written for internal use in the Wilke lab, hence the name (Claus O. Wilke’s plot package). It has also been used extensively in the book Fundamentals of Data Visualization.
Manual
vognette
Book: Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures (Amazon)
- Online

参照：Using Facets

Laying out multiple plots on a page

qplot(displ, hwy, data = mpg, facets = . ~year) + geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

References

77.2 Japanese Environments 日本語環境

オペレーションシステムに依存するので、統一的には、現時点では書けないが、備忘録として記録する。

Showtext Package

初心者に、RStudio.cloud を薦めているが、ラベルが日本語の場合には表示されないので、ここに、解決の近道を書いておく。
グラフのラベルの日本語が表示されない場合は、以下を利用。

showtext Package
showtext Manual
showtext: Using Fonts More Easily in R Graphs (vignette)
Using system fonts in R graphs
奥村晴彦先生のページ
RMarkdown の中で使うときは、以下の通り
- showtext Package を install (install.packages(“showtext”))
- yaml に html_document: fig_retina: 1 を加える（これを加えないと、文字が小さすぎる）
- code chunk に {r fig.showtext=TRUE} を加える。
CJK (Chinese, Japanese, and Korean) font WenQuanYi Micro Hei が含まれており、使われる。

MacOS

Base Plot: par(family= “HiraKakuProN-W3”)
ggplot2: theme_gray(base_family = “HiraKakuPro-W3”)

88. DATA

88.1 Built-In Data

BOD

Biochemical Oxygen Demand: The BOD data frame has 6 rows and 2 columns giving the biochemical oxygen demand versus time in an evaluation of water quality.
Time A numeric vector giving the time of the measurement (days).
demand A numeric vector giving the biochemical oxygen demand (mg/l).

str(BOD)

## 'data.frame':    6 obs. of  2 variables:
##  $ Time  : num  1 2 3 4 5 7
##  $ demand: num  8.3 10.3 19 16 15.6 19.8
##  - attr(*, "reference")= chr "A1.4, p. 270"

head(BOD)

cars

str(cars)

## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

head(cars)

longley

str(longley)

## 'data.frame':    16 obs. of  7 variables:
##  $ GNP.deflator: num  83 88.5 88.2 89.5 96.2 ...
##  $ GNP         : num  234 259 258 285 329 ...
##  $ Unemployed  : num  236 232 368 335 210 ...
##  $ Armed.Forces: num  159 146 162 165 310 ...
##  $ Population  : num  108 109 110 111 112 ...
##  $ Year        : int  1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 ...
##  $ Employed    : num  60.3 61.1 60.2 61.2 63.2 ...

head(longley)

midwest

str(midwest)

## tibble [437 × 28] (S3: tbl_df/tbl/data.frame)
##  $ PID                 : int [1:437] 561 562 563 564 565 566 567 568 569 570 ...
##  $ county              : chr [1:437] "ADAMS" "ALEXANDER" "BOND" "BOONE" ...
##  $ state               : chr [1:437] "IL" "IL" "IL" "IL" ...
##  $ area                : num [1:437] 0.052 0.014 0.022 0.017 0.018 0.05 0.017 0.027 0.024 0.058 ...
##  $ poptotal            : int [1:437] 66090 10626 14991 30806 5836 35688 5322 16805 13437 173025 ...
##  $ popdensity          : num [1:437] 1271 759 681 1812 324 ...
##  $ popwhite            : int [1:437] 63917 7054 14477 29344 5264 35157 5298 16519 13384 146506 ...
##  $ popblack            : int [1:437] 1702 3496 429 127 547 50 1 111 16 16559 ...
##  $ popamerindian       : int [1:437] 98 19 35 46 14 65 8 30 8 331 ...
##  $ popasian            : int [1:437] 249 48 16 150 5 195 15 61 23 8033 ...
##  $ popother            : int [1:437] 124 9 34 1139 6 221 0 84 6 1596 ...
##  $ percwhite           : num [1:437] 96.7 66.4 96.6 95.3 90.2 ...
##  $ percblack           : num [1:437] 2.575 32.9 2.862 0.412 9.373 ...
##  $ percamerindan       : num [1:437] 0.148 0.179 0.233 0.149 0.24 ...
##  $ percasian           : num [1:437] 0.3768 0.4517 0.1067 0.4869 0.0857 ...
##  $ percother           : num [1:437] 0.1876 0.0847 0.2268 3.6973 0.1028 ...
##  $ popadults           : int [1:437] 43298 6724 9669 19272 3979 23444 3583 11323 8825 95971 ...
##  $ perchsd             : num [1:437] 75.1 59.7 69.3 75.5 68.9 ...
##  $ percollege          : num [1:437] 19.6 11.2 17 17.3 14.5 ...
##  $ percprof            : num [1:437] 4.36 2.87 4.49 4.2 3.37 ...
##  $ poppovertyknown     : int [1:437] 63628 10529 14235 30337 4815 35107 5241 16455 13081 154934 ...
##  $ percpovertyknown    : num [1:437] 96.3 99.1 95 98.5 82.5 ...
##  $ percbelowpoverty    : num [1:437] 13.15 32.24 12.07 7.21 13.52 ...
##  $ percchildbelowpovert: num [1:437] 18 45.8 14 11.2 13 ...
##  $ percadultpoverty    : num [1:437] 11.01 27.39 10.85 5.54 11.14 ...
##  $ percelderlypoverty  : num [1:437] 12.44 25.23 12.7 6.22 19.2 ...
##  $ inmetro             : int [1:437] 0 0 0 1 0 0 0 0 0 1 ...
##  $ category            : chr [1:437] "AAR" "LHR" "AAR" "ALU" ...

head(midwest)

mtcars

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

head(mtcars)

pressure

str(pressure)

## 'data.frame':    19 obs. of  2 variables:
##  $ temperature: num  0 20 40 60 80 100 120 140 160 180 ...
##  $ pressure   : num  0.0002 0.0012 0.006 0.03 0.09 0.27 0.75 1.85 4.2 8.8 ...

head(pressure)

Titanic

str(Titanic)

##  'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
##  - attr(*, "dimnames")=List of 4
##   ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
##   ..$ Sex     : chr [1:2] "Male" "Female"
##   ..$ Age     : chr [1:2] "Child" "Adult"
##   ..$ Survived: chr [1:2] "No" "Yes"

head(as.data.frame(Titanic))

ToothGrowth

str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

head(ToothGrowth)

99. References 参考としたもの

99.1 Base Graphics

WebsiteR Study Group
- Week4: Base Graphics の Option の詳細な説明がある。
- Base Graphics は、ggplot2 などに取って代わられているが、Reference としては重要。

99.2 R for Data Science

Website: r4ds
Github: hadley/r4ds
Book: “R for Data Science : Import, Tidy, Transform, Visualize, and Model Data”, Hadley Wickham & Garrett Grolemund, O’Reilly Media, 2016
「Rではじめるデータサイエンス」オライリージャパン、 2017

99.3 R Graphics Cookbook, 2nd edition

Website: R Graphics Cookbook, 2nd edition by Winston Chang
Github: wch/rgcookbook
Book: “R Graphics Cookbook, 2nd edition - Practical Recipes for Visualizing Data”, O’Reilly, November 2018, pp.444
R Package: gcookbook: Data for “R Graphics Cookbook”, Data for “R Graphics Cookbook”.
- Manual: Package ‘gcookbook’
「R グラフィックスクックブック - ggplot2 によるグラフ作成のレシピ集（第二版）」日本語訳, O’REILLY, オライリージャパン

99.4 ggplot2

Website: ggplot2: Elegant Graphics for Data Analysis: The online version of work-in-progress 3rd edition of “ggplot2: elegant graphics for data analysis”
Github: hadley/ggplot2-book
Book: “ggplot2 - Elegant Graphics for Data Analysis”, by Hadley Wickham, Springer, 2016
「ggplot2 グラフィックスのためのプログラミング - ggplot2入門」，丸善出版（“ggplot2” 2009 年版の翻訳）：古く、多くの例がこのままでは、動作しない。

ggplot2 as a part of `tidyverse`

Website: An implementation of the Grammar of Graphics in R
github: tidyverse/ggplot2

ggplot2 Package

ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics
Manual:
vignette: Extending ggplot2
vignette: Using ggplot2 in packages
vignette: Aesthetic specifications: Colour and fill, Lines, Polygons, Point, Text.

99.5 R for Everyone, 2nd edition

Book: “R for Everyone, 2nd edition - Advanced Analytics and Graphics”, by Jared P. Lander, Addison-Wesley, 2017
「みんなのR 第2版」マイナビ出版, 2018
Website of Jared Lander: https://www.jaredlander.com

99.6 Exploratory Data Analysis with R

Website: Exploratory Data Analysis with R, Roger Peng

99.7 Data Visualization with R

Website: Data Visualization with R, Claudia A Engel, 2019
Github: MingChen0919/data-visualization-with-r

Examples of Graphs

1. R Base Plot プロット関数によるグラフィック

1.1 Scatter Plots 散布図

1.1.1 Generic X-Y Plotting

1.1.1.a Scatter Plot 点での描画

1.1.1.b Line Graph 線と点・二つ目のグラフを追加

1.1.8 Help of Generic X-Y Plotting

1.1.9 Example in Help

1.2 Bar Plots 棒グラフ

1.2.1 Creates a bar plot with vertical or horizontal bars.

1.2.1.a

1.2.8 Help of Bar Plots

1.2.9 Example in Help

1.3 Histgram ヒストグラム

1.3.1 Simple Example 基本的な例

1.3.8 Help of Histograms

1.3.9 Example in Help

1.4 Box Plot 箱ひげ図

1.4.1 x が数値ベクトルではなく、Factor のとき

1.4.8 Help of Box Plots

1.4.9 Example in Help

1.5 Plot Curves

1.5.1 Simple Example

1.5.9 Help of Draw Function Plots

1.5.9 Example in Help

2. Quick Plot: qplot

2.1 概要

2.2 Examples

Fuel economy data from 1999 to 2008 for 38 popular models of cars

Example Attached to qplot

2.3 Other Examples

2.3.1 Color, Shape, Transparancy

2.3.2 geom: point, smooth, boxplot, path, line, histogram, freqpoly, density, bar

2.3.3 Time Series Data

2.3.4 Facets

2.9 Translation

2.9.1 Translating between qplot and ggplot2

2.9.2 Translating between R base plot and ggplot2

3. Grammar of Graphics グラフィックスの文法

3.1 Basic Structure 基本構造

3.2 Examples 例示

4. ggplot2 fundamentals グラフィックスの基礎

4.1 Colour, size, shape and other aesthetic attributes

4.2 Facetting

4.3 Plot geoms

4.3.1 Adding a smoother to a plot

4.3.2 Boxplots and jittered points

4.3.3 Histograms and frequency polygons

4.3.4 Bar charts

4.3.5 Time series with line and path plots

4.3.6 Modifying the axes

4.3.7 Output

5. Toolbox グラフィックスの道具箱

5.1 個々のジオム Individual geoms

5.2 複合的なジオム Collective geoms

nlme Package

Oxboys dataset in nlme: Heights of Boys in Oxford

5.2.1 Multiple groups, one aesthetic

5.2.2 Different groups on different layers

5.2.3 Overriding the default grouping

5.2.4 Matching aesthetics to graphic objects

5.2.5 Exercises

5.3 統計的要約 Statistical summaries

5.3.1 Revealing uncertainty

5.3.2 Weighted data

5.3.3 Diamonds data

5.3.4 Displaying distributions

5.3.5 Dealing with overplotting

5.3.6 Statistical summaries

5.3.7 Surfaces

5.4 地図 Maps

5.5.1 Polygon maps

5.5.2 Simple features maps

5.5.3 Map projections

5.5.4 Working with sf data

5.5.5 Raster maps

5.5.6 Data sources

5.5 注釈 Annotations

5.5.1 Titles

5.5.2 Labels

1.1.1.a Scatter Plot　点での描画

1.1.1.b Line Graph　線と点・二つ目のグラフを追加

2. Quick Plot: `qplot`

Example Attached to `qplot`

4. ggplot2 fundamentals　グラフィックスの基礎

ggplot2 as a part of `tidyverse`