データの可視化(Data Visualization)は、データサイエンスのひとつの核であるが、表現能力、コミュニケーション能力とともに、基本的な技術も必要とされ、
R
によるプログラミングのたいせつな部分である。少しずつ学びながら、例を蓄積していく。参考としたもの(References) は最後に記す。
# message = FALSE
library(tidyverse)
## ─ Attaching packages ───────────────────────────── tidyverse 1.3.0 ─
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ─ Conflicts ─────────────────────────────── tidyverse_conflicts() ─
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ggplot2
について学ぶ前に、R Base の プロット(plot)の概要を記す。 ``ggplot2 を使った プロット(plot)を参考のために、併記する。
Help に付属の example も考察する。
R Base の プロットの詳細については、R Study Group, Week 4 を参照のこと。
二次元のプロット。base では、二つのベクトルで可。ggplot2 では基本的に、データフレームの二つの列を使う。
plot(x, y, ...)
二つのベクトルを、x, y に割り付ける。
plot(mtcars$wt, mtcars$mpg)
# ggplot2
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point()
ベクトルを引数 (augments) として渡す、下の code も可能だが、推奨されていない。
ggplot(data = NULL, aes(x = mtcars$wt, y = mtcars$mpg)) +
geom_point()
plot(x, y, type = "l")
二つ目のグラフを加える。
plot(pressure$temperature, pressure$pressure, type = "l")
points(pressure$temperature, pressure$pressure)
lines(pressure$temperature, pressure$pressure/2, col = "red")
points(pressure$temperature, pressure$pressure/2, col = "red")
ggplot2 では、階層を追加していく
ggplot(pressure) +
geom_line(aes(x = temperature, y = pressure), color = "black") +
geom_point(aes(x = temperature, y = pressure), color = "black") +
geom_line(aes(x = temperature, y = pressure/2), color = "red") +
geom_point(aes(x = temperature, y = pressure/2), color = "red")
tidy data にして、group を使う方法もある。
pressure1 <- pressure %>% mutate(type = "A")
pressure2 <- pressure %>% mutate(pressure2 = pressure/2, type = "B") %>% select(temperature, pressure2, type)
colnames(pressure2) <- colnames(pressure1)
pressure0 <- bind_rows(pressure1, pressure2)
# ggplot2
ggplot(pressure0, aes(x = temperature, y = pressure, group = type)) +
geom_line(aes(color = type)) +
geom_point(aes(color = type))
methods(plot)
## [1] plot,ANY-method plot,color-method plot.acf*
## [4] plot.ACF* plot.augPred* plot.compareFits*
## [7] plot.data.frame* plot.decomposed.ts* plot.default
## [10] plot.dendrogram* plot.density* plot.ecdf
## [13] plot.factor* plot.formula* plot.function
## [16] plot.ggplot* plot.gls* plot.gtable*
## [19] plot.hcl_palettes* plot.hclust* plot.histogram*
## [22] plot.HoltWinters* plot.intervals.lmList* plot.isoreg*
## [25] plot.lm* plot.lme* plot.lmList*
## [28] plot.medpolish* plot.mlm* plot.nffGroupedData*
## [31] plot.nfnGroupedData* plot.nls* plot.nmGroupedData*
## [34] plot.pdMat* plot.ppr* plot.prcomp*
## [37] plot.princomp* plot.profile.nls* plot.R6*
## [40] plot.ranef.lme* plot.ranef.lmList* plot.raster*
## [43] plot.shingle* plot.simulate.lme* plot.spec*
## [46] plot.stepfun plot.stl* plot.table*
## [49] plot.trans* plot.trellis* plot.ts
## [52] plot.tskernel* plot.TukeyHSD* plot.Variogram*
## see '?methods' for accessing help and source code
Arguments
x
the coordinates of points in the plot. Alternatively, a single plotting structure, function or any R object with a plot method can be provided.
y
the y coordinates of points in the plot, optional if x is an appropriate structure.
...
Arguments to be passed to methods, such as graphical parameters (see par). Many methods will accept the following arguments:
type
what type of plot should be drawn. Possible types are
"p" for points,
"l" for lines,
"b" for both,
"c" for the lines part alone of "b",
"o" for both ‘overplotted’,
"h" for ‘histogram’ like (or ‘high-density’) vertical lines,
"s" for stair steps,
"S" for other steps, see ‘Details’ below,
"n" for no plotting.
All other types give a warning or an error; using, e.g., type = "punkte" being equivalent to type = "p" for S compatibility. Note that some methods, e.g. plot.factor, do not accept this.
main
an overall title for the plot: see title.
sub
a sub title for the plot: see title.
xlab
a title for the x axis: see title.
ylab
a title for the y axis: see title.
asp
the y/x aspect ratio, see plot.window.
Details
The two step types differ in their x-y preference: Going from (x1,y1) to (x2,y2) with x1 < x2, type = "s" moves first horizontal, then vertical, whereas type = "S" moves the other way around.
See Also
plot.default, plot.formula and other methods; points, lines, par. For thousands of points, consider using smoothScatter() instead of plot().
For X-Y-Z plotting see contour, persp and image.
Note:
methods(plot)
で表示したように、膨大な形式がある。example(plot)
##
## plot> require(stats) # for lowess, rpois, rnorm
##
## plot> plot(cars)
##
## plot> lines(lowess(cars))
##
## plot> plot(sin, -pi, 2*pi) # see ?plot.function
##
## plot> ## Discrete Distribution Plot:
## plot> plot(table(rpois(100, 5)), type = "h", col = "red", lwd = 10,
## plot+ main = "rpois(100, lambda = 5)")
##
## plot> ## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:
## plot> plot(x <- sort(rnorm(47)), type = "s", main = "plot(x, type = \"s\")")
##
## plot> points(x, cex = .5, col = "dark red")
lowess
(Local Polynomial Regression Fitting) で、線分で補間。plot(x, type = "s")
: 標準正規分布(平均0,標準偏差1)の47個のサンプルを小さい順に並べて、点を標準の半分の大きさで、濃い赤で階段状にプロットし、主タイトルを付ける。cex = .5
となっているので、点を通常の半分にしている。barplot(height, ...)
棒の高さとなるベクトルを与える。棒のラベルを指定するときは、names.arg
barplot(BOD$demand)
barplot(BOD$demand, names.arg = BOD$Time)
ベクトル内の、それぞれの値の個数を table で生成して、棒グラフとする。
# cyl = number of cylinders
table(mtcars$cyl)
##
## 4 6 8
## 11 7 14
barplot(table(mtcars$cyl))
ggplot2 では、geom_col を使う。
# ggplot2
ggplot(BOD, aes(x = Time, y = demand)) +
geom_col()
変数 x を離散値 (discrete value) として使うときは、ファクター(factor)を使う。
# ggplot2
ggplot(BOD, aes(x = factor(Time), y = demand)) +
geom_col()
geom_bar を使うと、各カテゴリの個数をグラフ化できる。x は連続値。
ggplot(mtcars, aes(x = cyl)) +
geom_bar()
個数データの棒グラフ。x は factor (category)
ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar()
## Default S3 method:
barplot(height, width = 1, space = NULL,
names.arg = NULL, legend.text = NULL, beside = FALSE,
horiz = FALSE, density = NULL, angle = 45,
col = NULL, border = par("fg"),
main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
xlim = NULL, ylim = NULL, xpd = TRUE, log = "",
axes = TRUE, axisnames = TRUE,
cex.axis = par("cex.axis"), cex.names = par("cex.axis"),
inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,
add = FALSE, ann = !add && par("ann"), args.legend = NULL, ...)
## S3 method for class 'formula'
barplot(formula, data, subset, na.action,
horiz = FALSE, xlab = NULL, ylab = NULL, ...)
Arguments
height
either a vector or matrix of values describing the bars which make up the plot. If height is a vector, the plot consists of a sequence of rectangular bars with heights given by the values in the vector. If height is a matrix and beside is FALSE then each bar of the plot corresponds to a column of height, with the values in the column giving the heights of stacked sub-bars making up the bar. If height is a matrix and beside is TRUE, then the values in each column are juxtaposed rather than stacked.
width
optional vector of bar widths. Re-cycled to length the number of bars drawn. Specifying a single value will have no visible effect unless xlim is specified.
space
the amount of space (as a fraction of the average bar width) left before each bar. May be given as a single number or one number per bar. If height is a matrix and beside is TRUE, space may be specified by two numbers, where the first is the space between bars in the same group, and the second the space between the groups. If not given explicitly, it defaults to c(0,1) if height is a matrix and beside is TRUE, and to 0.2 otherwise.
names.arg
a vector of names to be plotted below each bar or group of bars. If this argument is omitted, then the names are taken from the names attribute of height if this is a vector, or the column names if it is a matrix.
legend.text
a vector of text used to construct a legend for the plot, or a logical indicating whether a legend should be included. This is only useful when height is a matrix. In that case given legend labels should correspond to the rows of height; if legend.text is true, the row names of height will be used as labels if they are non-null.
beside
a logical value. If FALSE, the columns of height are portrayed as stacked bars, and if TRUE the columns are portrayed as juxtaposed bars.
horiz
a logical value. If FALSE, the bars are drawn vertically with the first bar to the left. If TRUE, the bars are drawn horizontally with the first at the bottom.
density
a vector giving the density of shading lines, in lines per inch, for the bars or bar components. The default value of NULL means that no shading lines are drawn. Non-positive values of density also inhibit the drawing of shading lines.
angle
the slope of shading lines, given as an angle in degrees (counter-clockwise), for the bars or bar components.
col
a vector of colors for the bars or bar components. By default, grey is used if height is a vector, and a gamma-corrected grey palette if height is a matrix.
border
the color to be used for the border of the bars. Use border = NA to omit borders. If there are shading lines, border = TRUE means use the same colour for the border as for the shading lines.
main,sub
overall and sub title for the plot.
xlab
a label for the x axis.
ylab
a label for the y axis.
xlim
limits for the x axis.
ylim
limits for the y axis.
xpd
logical. Should bars be allowed to go outside region?
log
string specifying if axis scales should be logarithmic; see plot.default.
axes
logical. If TRUE, a vertical (or horizontal, if horiz is true) axis is drawn.
axisnames
logical. If TRUE, and if there are names.arg (see above), the other axis is drawn (with lty = 0) and labeled.
cex.axis
expansion factor for numeric axis labels.
cex.names
expansion factor for axis names (bar labels).
inside
logical. If TRUE, the lines which divide adjacent (non-stacked!) bars will be drawn. Only applies when space = 0 (which it partly is when beside = TRUE).
plot
logical. If FALSE, nothing is plotted.
axis.lty
the graphics parameter lty applied to the axis and tick marks of the categorical (default horizontal) axis. Note that by default the axis is suppressed.
offset
a vector indicating how much the bars should be shifted relative to the x axis.
add
logical specifying if bars should be added to an already existing plot; defaults to FALSE.
ann
logical specifying if the default annotation (main, sub, xlab, ylab) should appear on the plot, see title.
args.legend
list of additional arguments to pass to legend(); names of the list are used as argument names. Only used if legend.text is supplied.
formula
a formula where the y variables are numeric data to plot against the categorical x variables. The formula can have one of three forms:
y ~ x
y ~ x1 + x2
cbind(y1, y2) ~ x
, see the examples.
data
a data frame (or list) from which the variables in formula should be taken.
subset
an optional vector specifying a subset of observations to be used.
na.action
a function which indicates what should happen when the data contain NA values. The default is to ignore missing values in the given variables.
...
arguments to be passed to/from other methods. For the default method these can include further arguments (such as axes, asp and main) and graphical parameters (see par) which are passed to plot.window(), title() and axis.
Value
A numeric vector (or matrix, when beside = TRUE), say mp, giving the coordinates of all the bar midpoints drawn, useful for adding to the graph.
If beside is true, use colMeans(mp) for the midpoints of each group of bars, see example.
Author(s)
R Core, with a contribution by Arni Magnusson.
example(barplot)
##
## barplt> # Formula method
## barplt> barplot(GNP ~ Year, data = longley)
##
## barplt> barplot(cbind(Employed, Unemployed) ~ Year, data = longley)
##
## barplt> ## 3rd form of formula - 2 categories :
## barplt> op <- par(mfrow = 2:1, mgp = c(3,1,0)/2, mar = .1+c(3,3:1))
##
## barplt> summary(d.Titanic <- as.data.frame(Titanic))
## Class Sex Age Survived Freq
## 1st :8 Male :16 Child:16 No :16 Min. : 0.00
## 2nd :8 Female:16 Adult:16 Yes:16 1st Qu.: 0.75
## 3rd :8 Median : 13.50
## Crew:8 Mean : 68.78
## 3rd Qu.: 77.00
## Max. :670.00
##
## barplt> barplot(Freq ~ Class + Survived, data = d.Titanic,
## barplt+ subset = Age == "Adult" & Sex == "Male",
## barplt+ main = "barplot(Freq ~ Class + Survived, *)", ylab = "# {passengers}", legend = TRUE)
##
## barplt> # Corresponding table :
## barplt> (xt <- xtabs(Freq ~ Survived + Class + Sex, d.Titanic, subset = Age=="Adult"))
## , , Sex = Male
##
## Class
## Survived 1st 2nd 3rd Crew
## No 118 154 387 670
## Yes 57 14 75 192
##
## , , Sex = Female
##
## Class
## Survived 1st 2nd 3rd Crew
## No 4 13 89 3
## Yes 140 80 76 20
##
##
## barplt> # Alternatively, a mosaic plot :
## barplt> mosaicplot(xt[,,"Male"], main = "mosaicplot(Freq ~ Class + Survived, *)", color=TRUE)
##
## barplt> par(op)
##
## barplt> # Default method
## barplt> require(grDevices) # for colours
##
## barplt> tN <- table(Ni <- stats::rpois(100, lambda = 5))
##
## barplt> r <- barplot(tN, col = rainbow(20))
##
## barplt> #- type = "h" plotting *is* 'bar'plot
## barplt> lines(r, tN, type = "h", col = "red", lwd = 2)
##
## barplt> barplot(tN, space = 1.5, axisnames = FALSE,
## barplt+ sub = "barplot(..., space= 1.5, axisnames = FALSE)")
##
## barplt> barplot(VADeaths, plot = FALSE)
## [1] 0.7 1.9 3.1 4.3
##
## barplt> barplot(VADeaths, plot = FALSE, beside = TRUE)
## [,1] [,2] [,3] [,4]
## [1,] 1.5 7.5 13.5 19.5
## [2,] 2.5 8.5 14.5 20.5
## [3,] 3.5 9.5 15.5 21.5
## [4,] 4.5 10.5 16.5 22.5
## [5,] 5.5 11.5 17.5 23.5
##
## barplt> mp <- barplot(VADeaths) # default
##
## barplt> tot <- colMeans(VADeaths)
##
## barplt> text(mp, tot + 3, format(tot), xpd = TRUE, col = "blue")
##
## barplt> barplot(VADeaths, beside = TRUE,
## barplt+ col = c("lightblue", "mistyrose", "lightcyan",
## barplt+ "lavender", "cornsilk"),
## barplt+ legend = rownames(VADeaths), ylim = c(0, 100))
##
## barplt> title(main = "Death Rates in Virginia", font.main = 4)
##
## barplt> hh <- t(VADeaths)[, 5:1]
##
## barplt> mybarcol <- "gray20"
##
## barplt> mp <- barplot(hh, beside = TRUE,
## barplt+ col = c("lightblue", "mistyrose",
## barplt+ "lightcyan", "lavender"),
## barplt+ legend = colnames(VADeaths), ylim = c(0,100),
## barplt+ main = "Death Rates in Virginia", font.main = 4,
## barplt+ sub = "Faked upper 2*sigma error bars", col.sub = mybarcol,
## barplt+ cex.names = 1.5)
##
## barplt> segments(mp, hh, mp, hh + 2*sqrt(1000*hh/100), col = mybarcol, lwd = 1.5)
##
## barplt> stopifnot(dim(mp) == dim(hh)) # corresponding matrices
##
## barplt> mtext(side = 1, at = colMeans(mp), line = -2,
## barplt+ text = paste("Mean", formatC(colMeans(hh))), col = "red")
##
## barplt> # Bar shading example
## barplt> barplot(VADeaths, angle = 15+10*1:5, density = 20, col = "black",
## barplt+ legend = rownames(VADeaths))
##
## barplt> title(main = list("Death Rates in Virginia", font = 4))
##
## barplt> # Border color
## barplt> barplot(VADeaths, border = "dark blue")
##
## barplt> # Log scales (not much sense here)
## barplt> barplot(tN, col = heat.colors(12), log = "y")
##
## barplt> barplot(tN, col = gray.colors(20), log = "xy")
##
## barplt> # Legend location
## barplt> barplot(height = cbind(x = c(465, 91) / 465 * 100,
## barplt+ y = c(840, 200) / 840 * 100,
## barplt+ z = c(37, 17) / 37 * 100),
## barplt+ beside = FALSE,
## barplt+ width = c(465, 840, 37),
## barplt+ col = c(1, 2),
## barplt+ legend.text = c("A", "B"),
## barplt+ args.legend = list(x = "topleft"))
一つのベクトルを引数とする。
hist(mtcars$mpg)
bin を分割する個数は、breaks で指定
hist(mtcars$mpg, breaks = 10)
ggplot2 では、x を指定し、
geom_histogram()
を使う。初期値は、bins = 30
ggplot(mtcars, aes(x = mpg)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
binwidth を調整
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 1)
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 4)
## Default S3 method:
hist(x, breaks = "Sturges",
freq = NULL, probability = !freq,
include.lowest = TRUE, right = TRUE,
density = NULL, angle = 45, col = NULL, border = NULL,
main = paste("Histogram of" , xname),
xlim = range(breaks), ylim = NULL,
xlab = xname, ylab,
axes = TRUE, plot = TRUE, labels = FALSE,
nclass = NULL, warn.unused = TRUE, ...)
Arguments
x
a vector of values for which the histogram is desired.
breaks
one of:
a vector giving the breakpoints between histogram cells,
a function to compute the vector of breakpoints,
a single number giving the number of cells for the histogram,
a character string naming an algorithm to compute the number of cells (see ‘Details’),
a function to compute the number of cells.
In the last three cases the number is a suggestion only; as the breakpoints will be set to pretty values, the number is limited to 1e6 (with a warning if it was larger). If breaks is a function, the x vector is supplied to it as the only argument (and the number of breaks is only limited by the amount of available memory).
freq
logical; if TRUE, the histogram graphic is a representation of frequencies, the counts component of the result; if FALSE, probability densities, component density, are plotted (so that the histogram has a total area of one). Defaults to TRUE if and only if breaks are equidistant (and probability is not specified).
probability
an alias for !freq, for S compatibility.
include.lowest
logical; if TRUE, an x[i] equal to the breaks value will be included in the first (or last, for right = FALSE) bar. This will be ignored (with a warning) unless breaks is a vector.
right
logical; if TRUE, the histogram cells are right-closed (left open) intervals.
density
the density of shading lines, in lines per inch. The default value of NULL means that no shading lines are drawn. Non-positive values of density also inhibit the drawing of shading lines.
angle
the slope of shading lines, given as an angle in degrees (counter-clockwise).
col
a colour to be used to fill the bars. The default of NULL yields unfilled bars.
border
the color of the border around the bars. The default is to use the standard foreground color.
main, xlab, ylab
main title and axis labels: these arguments to title() get “smart” defaults here, e.g., the default ylab is "Frequency" iff freq is true.
xlim, ylim
the range of x and y values with sensible defaults. Note that xlim is not used to define the histogram (breaks), but only for plotting (when plot = TRUE).
axes
logical. If TRUE (default), axes are draw if the plot is drawn.
plot
logical. If TRUE (default), a histogram is plotted, otherwise a list of breaks and counts is returned. In the latter case, a warning is used if (typically graphical) arguments are specified that only apply to the plot = TRUE case.
labels
logical or character string. Additionally draw labels on top of bars, if not FALSE; see plot.histogram.
nclass
numeric (integer). For S(-PLUS) compatibility only, nclass is equivalent to breaks for a scalar or character argument.
warn.unused
logical. If plot = FALSE and warn.unused = TRUE, a warning will be issued when graphical parameters are passed to hist.default().
...
further arguments and graphical parameters passed to plot.histogram and thence to title and axis (if plot = TRUE).
Details
The definition of histogram differs by source (with country-specific biases). R's default with equi-spaced breaks (also the default) is to plot the counts in the cells defined by breaks. Thus the height of a rectangle is proportional to the number of points falling into the cell, as is the area provided the breaks are equally-spaced.
The default with non-equi-spaced breaks is to give a plot of area one, in which the area of the rectangles is the fraction of the data points falling in the cells.
If right = TRUE (default), the histogram cells are intervals of the form (a, b], i.e., they include their right-hand endpoint, but not their left one, with the exception of the first cell when include.lowest is TRUE.
For right = FALSE, the intervals are of the form [a, b), and include.lowest means ‘include highest’.
A numerical tolerance of 1e-7 times the median bin size (for more than four bins, otherwise the median is substituted) is applied when counting entries on the edges of bins. This is not included in the reported breaks nor in the calculation of density.
The default for breaks is "Sturges": see nclass.Sturges. Other names for which algorithms are supplied are "Scott" and "FD" / "Freedman-Diaconis" (with corresponding functions nclass.scott and nclass.FD). Case is ignored and partial matching is used. Alternatively, a function can be supplied which will compute the intended number of breaks or the actual breakpoints as a function of x.
Value
an object of class "histogram" which is a list with components:
breaks
the n+1 cell boundaries (= breaks if that was a vector). These are the nominal breaks, not with the boundary fuzz.
counts
n integers; for each cell, the number of x[] inside.
density
values f^(x[i]), as estimated density values. If all(diff(breaks) == 1), they are the relative frequencies counts/n and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i] = breaks[i].
mids
the n cell midpoints.
xname
a character string with the actual x argument name.
equidist
logical, indicating if the distances between breaks are all the same.
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
Venables, W. N. and Ripley. B. D. (2002) Modern Applied Statistics with S. Springer.
See Also
nclass.Sturges, stem, density, truehist in package MASS.
Typical plots with vertical bars are not histograms. Consider barplot or plot(*, type = "h") for such bar plots.
example(hist)
##
## hist> op <- par(mfrow = c(2, 2))
##
## hist> hist(islands)
##
## hist> utils::str(hist(islands, col = "gray", labels = TRUE))
## List of 6
## $ breaks : num [1:10] 0 2000 4000 6000 8000 10000 12000 14000 16000 18000
## $ counts : int [1:9] 41 2 1 1 1 1 0 0 1
## $ density : num [1:9] 4.27e-04 2.08e-05 1.04e-05 1.04e-05 1.04e-05 ...
## $ mids : num [1:9] 1000 3000 5000 7000 9000 11000 13000 15000 17000
## $ xname : chr "islands"
## $ equidist: logi TRUE
## - attr(*, "class")= chr "histogram"
##
## hist> hist(sqrt(islands), breaks = 12, col = "lightblue", border = "pink")
##
## hist> ##-- For non-equidistant breaks, counts should NOT be graphed unscaled:
## hist> r <- hist(sqrt(islands), breaks = c(4*0:5, 10*3:5, 70, 100, 140),
## hist+ col = "blue1")
##
## hist> text(r$mids, r$density, r$counts, adj = c(.5, -.5), col = "blue3")
##
## hist> sapply(r[2:3], sum)
## counts density
## 48.000000 0.215625
##
## hist> sum(r$density * diff(r$breaks)) # == 1
## [1] 1
##
## hist> lines(r, lty = 3, border = "purple") # -> lines.histogram(*)
##
## hist> par(op)
##
## hist> require(utils) # for str
##
## hist> str(hist(islands, breaks = 12, plot = FALSE)) #-> 10 (~= 12) breaks
## List of 6
## $ breaks : num [1:10] 0 2000 4000 6000 8000 10000 12000 14000 16000 18000
## $ counts : int [1:9] 41 2 1 1 1 1 0 0 1
## $ density : num [1:9] 4.27e-04 2.08e-05 1.04e-05 1.04e-05 1.04e-05 ...
## $ mids : num [1:9] 1000 3000 5000 7000 9000 11000 13000 15000 17000
## $ xname : chr "islands"
## $ equidist: logi TRUE
## - attr(*, "class")= chr "histogram"
##
## hist> str(hist(islands, breaks = c(12,20,36,80,200,1000,17000), plot = FALSE))
## List of 6
## $ breaks : num [1:7] 12 20 36 80 200 1000 17000
## $ counts : int [1:6] 12 11 8 6 4 7
## $ density : num [1:6] 0.03125 0.014323 0.003788 0.001042 0.000104 ...
## $ mids : num [1:6] 16 28 58 140 600 9000
## $ xname : chr "islands"
## $ equidist: logi FALSE
## - attr(*, "class")= chr "histogram"
##
## hist> hist(islands, breaks = c(12,20,36,80,200,1000,17000), freq = TRUE,
## hist+ main = "WRONG histogram") # and warning
## Warning in plot.histogram(r, freq = freq1, col = col, border = border, angle =
## angle, : the AREAS in the plot are wrong -- rather use 'freq = FALSE'
##
## hist> ## No test: ##D
## hist> ##D ## Extreme outliers; the "FD" rule would take very large number of 'breaks':
## hist> ##D XXL <- c(1:9, c(-1,1)*1e300)
## hist> ##D hh <- hist(XXL, "FD") # did not work in R <= 3.4.1; now gives warning
## hist> ##D ## pretty() determines how many counts are used (platform dependently!):
## hist> ##D length(hh$breaks) ## typically 1 million -- though 1e6 was "a suggestion only"
## hist> ## End(No test)
## hist> require(stats)
##
## hist> set.seed(14)
##
## hist> x <- rchisq(100, df = 4)
##
## hist> ## Don't show:
## hist> op <- par(mfrow = 2:1, mgp = c(1.5, 0.6, 0), mar = .1 + c(3,3:1))
##
## hist> ## End(Don't show)
## hist> ## Comparing data with a model distribution should be done with qqplot()!
## hist> qqplot(x, qchisq(ppoints(x), df = 4)); abline(0, 1, col = 2, lty = 2)
##
## hist> ## if you really insist on using hist() ... :
## hist> hist(x, freq = FALSE, ylim = c(0, 0.2))
##
## hist> curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)
##
## hist> ## Don't show:
## hist> par(op)
##
## hist> ## End(Don't show)
## hist>
## hist>
## hist>
head(ToothGrowth)
plot(ToothGrowth$supp, ToothGrowth$len)
boxplot(len ~ supp, data = ToothGrowth)
# ggplot2
ggplot(ToothGrowth, aes(x = supp, y = len)) +
geom_boxplot()
# ggplot2
ggplot(ToothGrowth, aes(x = interaction(supp, dose), y = len)) +
geom_boxplot()
## S3 method for class 'formula'
boxplot(formula, data = NULL, ..., subset, na.action = NULL,
xlab = mklab(y_var = horizontal),
ylab = mklab(y_var =!horizontal),
add = FALSE, ann = !add, horizontal = FALSE,
drop = FALSE, sep = ".", lex.order = FALSE)
## Default S3 method:
boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,
notch = FALSE, outline = TRUE, names, plot = TRUE,
border = par("fg"), col = NULL, log = "",
pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),
ann = !add, horizontal = FALSE, add = FALSE, at = NULL)
Arguments
formula
a formula, such as y ~ grp, where y is a numeric vector of data values to be split into groups according to the grouping variable grp (usually a factor). Note that ~ g1 + g2 is equivalent to g1:g2.
data
a data.frame (or list) from which the variables in formula should be taken.
subset
an optional vector specifying a subset of observations to be used for plotting.
na.action
a function which indicates what should happen when the data contain NAs. The default is to ignore missing values in either the response or the group.
xlab, ylab
x- and y-axis annotation, since R 3.6.0 with a non-empty default. Can be suppressed by ann=FALSE.
ann
logical indicating if axes should be annotated (by xlab and ylab).
drop, sep, lex.order
passed to split.default, see there.
x
for specifying data from which the boxplots are to be produced. Either a numeric vector, or a single list containing such vectors. Additional unnamed arguments specify further data as separate vectors (each corresponding to a component boxplot). NAs are allowed in the data.
...
For the formula method, named arguments to be passed to the default method.
For the default method, unnamed arguments are additional data vectors (unless x is a list when they are ignored), and named arguments are arguments and graphical parameters to be passed to bxp in addition to the ones given by argument pars (and override those in pars). Note that bxp may or may not make use of graphical parameters it is passed: see its documentation.
range
this determines how far the plot whiskers extend out from the box. If range is positive, the whiskers extend to the most extreme data point which is no more than range times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.
width
a vector giving the relative widths of the boxes making up the plot.
varwidth
if varwidth is TRUE, the boxes are drawn with widths proportional to the square-roots of the number of observations in the groups.
notch
if notch is TRUE, a notch is drawn in each side of the boxes. If the notches of two plots do not overlap this is ‘strong evidence’ that the two medians differ (Chambers et al, 1983, p. 62). See boxplot.stats for the calculations used.
outline
if outline is not true, the outliers are not drawn (as points whereas S+ uses lines).
names
group labels which will be printed under each boxplot. Can be a character vector or an expression (see plotmath).
boxwex
a scale factor to be applied to all boxes. When there are only a few groups, the appearance of the plot can be improved by making the boxes narrower.
staplewex
staple line width expansion, proportional to box width.
outwex
outlier line width expansion, proportional to box width.
plot
if TRUE (the default) then a boxplot is produced. If not, the summaries which the boxplots are based on are returned.
border
an optional vector of colors for the outlines of the boxplots. The values in border are recycled if the length of border is less than the number of plots.
col
if col is non-null it is assumed to contain colors to be used to colour the bodies of the box plots. By default they are in the background colour.
log
character indicating if x or y or both coordinates should be plotted in log scale.
pars
a list of (potentially many) more graphical parameters, e.g., boxwex or outpch; these are passed to bxp (if plot is true); for details, see there.
horizontal
logical indicating if the boxplots should be horizontal; default FALSE means vertical boxes.
add
logical, if true add boxplot to current plot.
at
numeric vector giving the locations where the boxplots should be drawn, particularly when add = TRUE; defaults to 1:n where n is the number of boxes.
Details
The generic function boxplot currently has a default method (boxplot.default) and a formula interface (boxplot.formula).
If multiple groups are supplied either as multiple arguments or via a formula, parallel boxplots will be plotted, in the order of the arguments or the order of the levels of the factor (see factor).
Missing values are ignored when forming boxplots.
Value
List with the following components:
stats
a matrix, each column contains the extreme of the lower whisker, the lower hinge, the median, the upper hinge and the extreme of the upper whisker for one group/plot. If all the inputs have the same class attribute, so will this component.
n
a vector with the number of observations in each group.
conf
a matrix where each column contains the lower and upper extremes of the notch.
out
the values of any data points which lie beyond the extremes of the whiskers.
group
a vector of the same length as out whose elements indicate to which group the outlier belongs.
names
a vector of names for the groups.
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). The New S Language. Wadsworth & Brooks/Cole.
Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983). Graphical Methods for Data Analysis. Wadsworth & Brooks/Cole.
Murrell, P. (2005). R Graphics. Chapman & Hall/CRC Press.
See also boxplot.stats.
See Also
boxplot.stats which does the computation, bxp for the plotting and more examples; and stripchart for an alternative (with small data sets).
example(boxplot)
##
## boxplt> ## boxplot on a formula:
## boxplt> boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
##
## boxplt> # *add* notches (somewhat funny here <--> warning "notches .. outside hinges"):
## boxplt> boxplot(count ~ spray, data = InsectSprays,
## boxplt+ notch = TRUE, add = TRUE, col = "blue")
## Warning in bxp(list(stats = structure(c(7, 11, 14, 18.5, 23, 7, 12, 16.5, : some
## notches went outside hinges ('box'): maybe set notch=FALSE
##
## boxplt> boxplot(decrease ~ treatment, data = OrchardSprays, col = "bisque",
## boxplt+ log = "y")
##
## boxplt> ## horizontal=TRUE, switching y <--> x :
## boxplt> boxplot(decrease ~ treatment, data = OrchardSprays, col = "bisque",
## boxplt+ log = "x", horizontal=TRUE)
##
## boxplt> rb <- boxplot(decrease ~ treatment, data = OrchardSprays, col = "bisque")
##
## boxplt> title("Comparing boxplot()s and non-robust mean +/- SD")
##
## boxplt> mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean)
##
## boxplt> sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd)
##
## boxplt> xi <- 0.3 + seq(rb$n)
##
## boxplt> points(xi, mn.t, col = "orange", pch = 18)
##
## boxplt> arrows(xi, mn.t - sd.t, xi, mn.t + sd.t,
## boxplt+ code = 3, col = "pink", angle = 75, length = .1)
##
## boxplt> ## boxplot on a matrix:
## boxplt> mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100),
## boxplt+ `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2))
##
## boxplt> boxplot(mat) # directly, calling boxplot.matrix()
##
## boxplt> ## boxplot on a data frame:
## boxplt> df. <- as.data.frame(mat)
##
## boxplt> par(las = 1) # all axis labels horizontal
##
## boxplt> boxplot(df., main = "boxplot(*, horizontal = TRUE)", horizontal = TRUE)
##
## boxplt> ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand :
## boxplt> boxplot(len ~ dose, data = ToothGrowth,
## boxplt+ boxwex = 0.25, at = 1:3 - 0.2,
## boxplt+ subset = supp == "VC", col = "yellow",
## boxplt+ main = "Guinea Pigs' Tooth Growth",
## boxplt+ xlab = "Vitamin C dose mg",
## boxplt+ ylab = "tooth length",
## boxplt+ xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = "i")
##
## boxplt> boxplot(len ~ dose, data = ToothGrowth, add = TRUE,
## boxplt+ boxwex = 0.25, at = 1:3 + 0.2,
## boxplt+ subset = supp == "OJ", col = "orange")
##
## boxplt> legend(2, 9, c("Ascorbic acid", "Orange juice"),
## boxplt+ fill = c("yellow", "orange"))
##
## boxplt> ## With less effort (slightly different) using factor *interaction*:
## boxplt> boxplot(len ~ dose:supp, data = ToothGrowth,
## boxplt+ boxwex = 0.5, col = c("orange", "yellow"),
## boxplt+ main = "Guinea Pigs' Tooth Growth",
## boxplt+ xlab = "Vitamin C dose mg", ylab = "tooth length",
## boxplt+ sep = ":", lex.order = TRUE, ylim = c(0, 35), yaxs = "i")
##
## boxplt> ## more examples in help(bxp)
## boxplt>
## boxplt>
## boxplt>
curve(x^3 - 5*x, from = -4, to = 4)
myfun <- function(xvar){
1/(1+exp(-xvar + 10))
}
curve(myfun(x), from = 0, to = 20)
curve(1 - myfun(x), add = TRUE, col = "red")
p <- ggplot(data.frame(x = c(0,20)), aes(x = x))
p <- p + stat_function(fun = myfun, geom = "line", color = "black")
p + stat_function(fun = function(t) 1 - myfun(t), geom = "line", color = "red")
Description: Draws a curve corresponding to a function over the interval [from, to]. curve can plot also an expression in the variable xname, default x.
Usage:
curve(expr, from = NULL, to = NULL, n = 101, add = FALSE, type = “l”, xname = “x”, xlab = xname, ylab = NULL, log = NULL, xlim = NULL, …)
## S3 method for class 'function'
plot(x, y = 0, to = 1, from = y, xlim = NULL, ylab = NULL, ...)
Arguments
expr
The name of a function, or a call or an expression written as a function of x which will evaluate to an object of the same length as x.
x
a ‘vectorizing’ numeric R function.
y
alias for from for compatibility with plot
from, to
the range over which the function will be plotted.
n
integer; the number of x values at which to evaluate.
add
logical; if TRUE add to an already existing plot; if NA start a new plot taking the defaults for the limits and log-scaling of the x-axis from the previous plot. Taken as FALSE (with a warning if a different value is supplied) if no graphics device is open.
xlim
NULL or a numeric vector of length 2; if non-NULL it provides the defaults for c(from, to) and, unless add = TRUE, selects the x-limits of the plot – see plot.window.
type
plot type: see plot.default.
xname
character string giving the name to be used for the x axis.
xlab, ylab, log, ...
labels and graphical parameters can also be specified as arguments. See ‘Details’ for the interpretation of the default for log.
For the "function" method of plot, ... can include any of the other arguments of curve, except expr.
Details
The function or expression expr (for curve) or function x (for plot) is evaluated at n points equally spaced over the range [from, to]. The points determined in this way are then plotted.
If either from or to is NULL, it defaults to the corresponding element of xlim if that is not NULL.
What happens when neither from/to nor xlim specifies both x-limits is a complex story. For plot(<function>) and for curve(add = FALSE) the defaults are (0, 1). For curve(add = NA) and curve(add = TRUE) the defaults are taken from the x-limits used for the previous plot. (This differs from versions of R prior to 2.14.0.)
The value of log is used both to specify the plot axes (unless add = TRUE) and how ‘equally spaced’ is interpreted: if the x component indicates log-scaling, the points at which the expression or function is plotted are equally spaced on log scale.
The default value of log is taken from the current plot when add = TRUE, whereas if add = NA the x component is taken from the existing plot (if any) and the y component defaults to linear. For add = FALSE the default is ""
This used to be a quick hack which now seems to serve a useful purpose, but can give bad results for functions which are not smooth.
For expensive-to-compute expressions, you should use smarter tools.
The way curve handles expr has caused confusion. It first looks to see if expr is a name (also known as a symbol), in which case it is taken to be the name of a function, and expr is replaced by a call to expr with a single argument with name given by xname. Otherwise it checks that expr is either a call or an expression, and that it contains a reference to the variable given by xname (using all.vars): anything else is an error. Then expr is evaluated in an environment which supplies a vector of name given by xname of length n, and should evaluate to an object of length n. Note that this means that curve(x, ...) is taken as a request to plot a function named x (and it is used as such in the function method for plot).
The plot method can be called directly as plot.function.
Value
A list with components x and y of the points that were drawn is returned invisibly.
Warning
For historical reasons, add is allowed as an argument to the "function" method of plot, but its behaviour may surprise you. It is recommended to use add only with curve.
See Also
splinefun for spline interpolation, lines.
example(curve)
##
## curve> plot(qnorm) # default range c(0, 1) is appropriate here,
##
## curve> # but end values are -/+Inf and so are omitted.
## curve> plot(qlogis, main = "The Inverse Logit : qlogis()")
##
## curve> abline(h = 0, v = 0:2/2, lty = 3, col = "gray")
##
## curve> curve(sin, -2*pi, 2*pi, xname = "t")
##
## curve> curve(tan, xname = "t", add = NA,
## curve+ main = "curve(tan) --> same x-scale as previous plot")
##
## curve> op <- par(mfrow = c(2, 2))
##
## curve> curve(x^3 - 3*x, -2, 2)
##
## curve> curve(x^2 - 2, add = TRUE, col = "violet")
##
## curve> ## simple and advanced versions, quite similar:
## curve> plot(cos, -pi, 3*pi)
##
## curve> curve(cos, xlim = c(-pi, 3*pi), n = 1001, col = "blue", add = TRUE)
##
## curve> chippy <- function(x) sin(cos(x)*exp(-x/2))
##
## curve> curve(chippy, -8, 7, n = 2001)
##
## curve> plot (chippy, -8, -5)
##
## curve> for(ll in c("", "x", "y", "xy"))
## curve+ curve(log(1+x), 1, 100, log = ll, sub = paste0("log = '", ll, "'"))
##
## curve> par(op)
qplot
ggplot2 Package の qplot は、plot() から ggplot2 への橋渡しの役割を果たしていたと思われるが、最近は、最初から、ggplot2 を学ぶことが薦められている。
Description: qplot is a shortcut designed to be familiar if you’re used to base plot(). It’s a convenient wrapper for creating a number of different types of plots using a consistent calling scheme. It’s great for allowing you to produce plots quickly, but I highly recommend learning ggplot() as it makes it easier to create complex graphics.
Usage: ggplot2
が必要である。tidyverse
Package に含まれる。
qplot(
x,
y,
...,
data,
facets = NULL,
margins = FALSE,
geom = "auto",
xlim = c(NA, NA),
ylim = c(NA, NA),
log = "",
main = NULL,
xlab = NULL,
ylab = NULL,
asp = NA,
stat = NULL,
position = NULL
)
ggplot2
に付属の、mpg
データセットが使われている。
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
head(mpg)
qplot
example(qplot)
##
## qplot> # Use data from data.frame
## qplot> qplot(mpg, wt, data = mtcars)
##
## qplot> qplot(mpg, wt, data = mtcars, colour = cyl)
##
## qplot> qplot(mpg, wt, data = mtcars, size = cyl)
##
## qplot> qplot(mpg, wt, data = mtcars, facets = vs ~ am)
##
## qplot> ## No test:
## qplot> ##D qplot(1:10, rnorm(10), colour = runif(10))
## qplot> ##D qplot(1:10, letters[1:10])
## qplot> ##D mod <- lm(mpg ~ wt, data = mtcars)
## qplot> ##D qplot(resid(mod), fitted(mod))
## qplot> ##D
## qplot> ##D f <- function() {
## qplot> ##D a <- 1:10
## qplot> ##D b <- a ^ 2
## qplot> ##D qplot(a, b)
## qplot> ##D }
## qplot> ##D f()
## qplot> ##D
## qplot> ##D # To set aesthetics, wrap in I()
## qplot> ##D qplot(mpg, wt, data = mtcars, colour = I("red"))
## qplot> ##D
## qplot> ##D # qplot will attempt to guess what geom you want depending on the input
## qplot> ##D # both x and y supplied = scatterplot
## qplot> ##D qplot(mpg, wt, data = mtcars)
## qplot> ##D # just x supplied = histogram
## qplot> ##D qplot(mpg, data = mtcars)
## qplot> ##D # just y supplied = scatterplot, with x = seq_along(y)
## qplot> ##D qplot(y = mpg, data = mtcars)
## qplot> ##D
## qplot> ##D # Use different geoms
## qplot> ##D qplot(mpg, wt, data = mtcars, geom = "path")
## qplot> ##D qplot(factor(cyl), wt, data = mtcars, geom = c("boxplot", "jitter"))
## qplot> ##D qplot(mpg, data = mtcars, geom = "dotplot")
## qplot> ## End(No test)
## qplot>
## qplot>
## qplot>
The following codes are in the example.
qplot(1:10, rnorm(10), colour = runif(10))
qplot(1:10, letters[1:10])
mod <- lm(mpg ~ wt, data = mtcars)
qplot(resid(mod), fitted(mod))
f <- function() {
a <- 1:10
b <- a ^ 2
qplot(a, b)
}
f()
# To set aesthetics, wrap in I()
qplot(mpg, wt, data = mtcars, colour = I("red"))
# qplot will attempt to guess what geom you want depending on the input
# both x and y supplied = scatterplot
qplot(mpg, wt, data = mtcars)
# just x supplied = histogram
qplot(mpg, data = mtcars)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# just y supplied = scatterplot, with x = seq_along(y)
qplot(y = mpg, data = mtcars)
# Use different geoms
qplot(mpg, wt, data = mtcars, geom = "path")
qplot(factor(cyl), wt, data = mtcars, geom = c("boxplot", "jitter"))
qplot(mpg, data = mtcars, geom = "dotplot")
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
Book: ggplot2 などからの例
# diamond data attached to ggplot2
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
# small subset of diamonds
dsmall <- diamonds[sample(nrow(diamonds), 100),]
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
qc <- qplot(carat, price, data = dsmall, color = color)
qs <- qplot(carat, price, data = dsmall, shape = cut)
grid.arrange(qc, qs, ncol=2)
## Warning: Using shapes for an ordinal variable is not advised
#library(gridExtra)
qa10 <- qplot(carat, price, data = diamonds, alpha = I(1/10))
qa100 <- qplot(carat, price, data = diamonds, alpha = I(1/100))
qa200 <- qplot(carat, price, data = diamonds, alpha = I(1/200))
grid.arrange(qa10, qa100, qa200, ncol=3)
#library(gridExtra)
qp <- qplot(carat, price, data = dsmall, geom = "point")
qps <- qplot(carat, price, data = dsmall, geom = c("point", "smooth"))
qpss <- qplot(carat, price, data = dsmall, geom = c("point", "smooth"), span = 0.2)
## Warning: Ignoring unknown parameters: span
grid.arrange(qp, qps, qpss, ncol=3)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
mgcv
package for extra methods: gamsplines
package for extra methods: lmMASS
package for extra methods: rlm#library(gridExtra)
qj <- qplot(color, price / carat, data = diamonds, geom = "jitter")
qja <- qplot(color, price / carat, data = diamonds, geom = "jitter", alpha = I(1/50))
qb <- qplot(color, price / carat, data = diamonds, geom = "boxplot")
grid.arrange(qj, qja, qb, ncol=3)
#library(gridExtra)
qh <- qplot(carat, data = diamonds, geom = "histogram", main = "ヒストグラム") + theme_gray(base_family = "HiraKakuPro-W3")
qd <- qplot(carat, data = diamonds, geom = "density", main = "密度") + theme_gray(base_family = "HiraKakuPro-W3")
qb <- qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.1, main = "Histogram: binwidth = 0.1")
qbk <- qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.5, main = "Histogram: binwidth = 0.5")
qhc <- qplot(carat, data = diamonds, geom = "histogram", fill = color, main = "Histogram: fill = color")
qdc <- qplot(carat, data = diamonds, geom = "density", color = color, main = "Density: color = color")
grid.arrange(qh, qd, qb, qbk, qhc, qdc, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#library(gridExtra)
qb <- qplot(color, data = diamonds, geom = "bar", ylob = "count")
## Warning: Ignoring unknown parameters: ylob
qbw <- qplot(color, data = diamonds, geom = "bar", weight = carat, ylab = "carat as weight")
qbws <- qplot(color, data = diamonds, geom = "bar", weight = carat) +
scale_y_continuous("carat")
grid.arrange(qb, qbw, qbws, ncol=3)
# economy data attached to ggplot2
str(economics)
## tibble [574 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date : Date[1:574], format: "1967-07-01" "1967-08-01" ...
## $ pce : num [1:574] 507 510 516 512 517 ...
## $ pop : num [1:574] 198712 198911 199113 199311 199498 ...
## $ psavert : num [1:574] 12.6 12.6 11.9 12.9 12.8 11.8 11.7 12.3 11.7 12.3 ...
## $ uempmed : num [1:574] 4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ...
## $ unemploy: num [1:574] 2944 2945 2958 3143 3066 ...
#library(gridExtra)
ql <- qplot(date, unemploy / pop, data = economics, geom = "line")
ql2 <- qplot(date, uempmed, data = economics, geom = "line")
grid.arrange(ql, ql2, ncol=2)
year <- function(x) as.POSIXlt(x)$year + 1900
#library(gridExtra)
qpp <- qplot(unemploy / pop, uempmed, data = economics, geom = c("point", "path"))
qpc <- qplot(unemploy / pop, uempmed, data = economics, geom = c("point", "path"), color = year(date))
grid.arrange(qpp, qpc, ncol=2)
#library(gridExtra)
qf <- qplot(carat, data = diamonds, facets = color ~ .,
geom = "histogram", binwidth = 0.1, xlim = c(0,3))
qfd <- qplot(carat, ..density.., data = diamonds, facets = color ~ .,
geom = "histogram", binwidth = 0.1, xlim = c(0,3))
grid.arrange(qf, qfd, ncol=2)
参考文献:ggplot2
ggplot2-boos:What is the grammar of graphics? では、以下の様に説明している。
Data and Variables(データと変数): 可視化したいデータ、どの変数をどのように利用するか。(aesthetic attributes として対応させると表現されています。美しく表現する属性といった意味ですが、今後、簡単に、「エステ」としてと表現します。)
Data that you want to visualise and a set of aesthetic mappings describing how variables in the data are mapped to aesthetic attributes that you can perceive.
Layers(レイヤー): グラフのタイプ(geom)、統計処理、形状の詳細
Layers made up of geometric elements and statistical transformation. Geometric objects, geoms for short, represent what you actually see on the plot: points, lines, polygons, etc. Statistical transformations, stats for short, summarise data in many useful ways. For example, binning and counting observations to create a histogram, or summarising a 2d relationship with a linear model.
Scales(スケール): 色、大きさ、形、変換した数値などの尺度
The scales map values in the data space to values in an aesthetic space, whether it be colour, or size, or shape. Scales draw a legend or axes, which provide an inverse mapping to make it possible to read the original data values from the plot.
Coordinate System(座標系): 座標系、座標軸、補助的な線
A coordinate system, coord for short, describes how data coordinates are mapped to the plane of the graphic. It also provides axes and gridlines to make it possible to read the graph. We normally use a Cartesian coordinate system, but a number of others are available, including polar coordinates and map projections.
Faceting(ファセット(追加する断面)): グループ毎の情報の追加
A faceting specification describes how to break up the data into subsets and how to display those subsets as small multiples. This is also known as conditioning or latticing/trellising.
theme(テーマ): 表示の詳細
A theme which controls the finer points of display, like the font size and background colour. While the defaults in ggplot2 have been chosen with care, you may need to consult other references to create an attractive plot. A good starting place is Tufte’s early works (Tufte 1990, 1997, 2001)
Grammar of Graphics がしないことについて、二点掲げられている。
mpg を使って簡単に説明します。
mpg %>% select(displ, hwy, cyl, year) %>% head(10)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
ggplot(mpg, aes(displ, hwy)) +
geom_point()
Exercise
グラフからなにを読み取ることができるか。
下をより見やすくするにはどうしたらよいか。
ggplot(mpg, aes(model, manufacturer)) + geom_point()
ggplot(mpg, aes(cty, hwy)) + geom_point()
ggplot(diamonds, aes(carat, price)) + geom_point()
ggplot(economics, aes(date, unemploy)) + geom_line()
ggplot(mpg, aes(cty)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
他のエステを加えるには次のようにする。
aes(displ, hwy, colour = class)
aes(displ, hwy, shape = drv)
aes(displ, hwy, size = cyl)
どのような、色、形、大きさにするかは、scale
で扱うが、省略すると、初期値(default)が割り当てられる。色(color)や形(shape)は、カテゴリカル(categorical)変数に適しており、大きさ(size)は、数値(numerical)変数に適している。種類が増えると実際には、区別しにくい。
ggplot(mpg, aes(displ, cty, colour = class)) +
geom_point()
#library(gridExtra)
g1c <- ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point()
g1s <- ggplot(mpg, aes(displ, hwy, shape = drv)) +
geom_point()
g1z <- ggplot(mpg, aes(displ, hwy, size = cyl)) +
geom_point()
grid.arrange(g1c, g1s, g1z, ncol=3)
スケールscale
で一般的に指定せず、具体的に色を指定することも可能。下の二つを比較すること。
#library(gridExtra)
g1r <- ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = "blue"))
g1b <- ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue")
grid.arrange(g1r, g1b, ncol=2)
詳細は、Aesthetic specifications を参照。
Exercise
ファセット(facetting)によってカテゴリカルデータをエステに加えることも可能である。Grid と、wrap があるが、ここでは、wrap の例のみをあげる。
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_wrap(~class)
どのようなときに、ファセットを使い、どのようなときに、エステに加えるのがよいだろうか。
Exercise
facet_wrap
の Help から、縦横いくつを並べるかなどはどう制御するのかを調べよ。facet_wrap
について、scale は何を制御するのか。geom_point
を他のものに置き換えたらどうなるのだろうか。
geom_smooth()
fits a smoother to the data and displays the smooth and its standard error. 滑らかな曲線で表示
geom_boxplot() produces a box-and-whisker plot to summarise the distribution of a set of points. 箱ひげ図
geom_histogram() and geom_freqpoly() show the distribution of continuous variables. ヒストグラム・頻度図
geom_bar() shows the distribution of categorical variables. カテゴリカル変数の分布
geom_path() and geom_line() draw lines between the data points. A line plot is constrained to produce lines that travel from left to right, while paths can go in any direction. Lines are typically used to explore how things change over time. データポイントの時間的変化図
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
信頼区間が帯で描かれるが、不要なときは、geom_smooth(se = FALSE)
を加える。
#library(gridExtra)
gs02 <- ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(span = 0.2)
gs10 <- ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(span = 1)
grid.arrange(gs02, gs10, ncol=2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
連続的に変化する数値変数と、カテゴリカル変数があったときは、カテゴリカル変数によって、数値変数が変化するかを見る。
ggplot(mpg, aes(drv, hwy)) +
geom_point()
三種類の方法がある。
ggplot(mpg, aes(drv, hwy)) + geom_jitter()
ggplot(mpg, aes(drv, hwy)) + geom_boxplot()
ggplot(mpg, aes(drv, hwy)) + geom_violin()
ジッター図を描く geom_jitter() は、geom_point() と同じように、エステで、size, colour, shape を制御する。箱ひげ図と、バイオリン図を描く geom_boxplot() と、geom_violin() では、外形をcolor で描き、または、fill で、色で塗りつぶすことができる。
一つの数値変数についての詳細を表現するには、ヒストグラムまたは、頻度図を使う。どちらも、区間に分けて、その区間ごとの頻度を表す。区間の幅は、binwidth で決める。初期値は、30 分割(breaks = 30)するようにしている。
ggplot(mpg, aes(hwy)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mpg, aes(hwy)) + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ファセットも利用可能である。
ggplot(mpg, aes(displ, colour = drv)) +
geom_freqpoly(binwidth = 0.5)
ggplot(mpg, aes(displ, fill = drv)) +
geom_histogram(binwidth = 0.5) +
facet_wrap(~drv, ncol = 1)
ggplot(mpg, aes(manufacturer)) +
geom_bar()
drugs <- data.frame(
drug = c("a", "b", "c"),
effect = c(4.2, 9.7, 6.1)
)
#library(gridExtra)
gbi <- ggplot(drugs, aes(drug, effect)) + geom_bar(stat = "identity")
gdrug <- ggplot(drugs, aes(drug, effect)) + geom_point()
grid.arrange(gbi, gdrug, ncol=2)
時系列データは、折れ線グラフ(line graph)または、パスグラフ(path graph)で表す。 折れ線グラフでは、時系列は(初期値では)左から右に流れ、パスグラフでは、時系列で、点が点に移動する道(パス)が描かれる。つまり、折れ線グラフは、x 軸の値に従って、パスを表示していると考えられる。
利用してる、データについては、economics を参照のこと。
折れ線グラフ
#library(gridExtra)
geup <- ggplot(economics, aes(date, unemploy / pop)) +
geom_line()
geu <- ggplot(economics, aes(date, uempmed)) +
geom_line()
grid.arrange(geup, geu, ncol=2)
パスグラフ
#library(gridExtra)
geupp <- ggplot(economics, aes(unemploy / pop, uempmed)) +
geom_path() +
geom_point()
year <- function(x) as.POSIXlt(x)$year + 1900
geuppp <- ggplot(economics, aes(unemploy / pop, uempmed)) +
geom_path(colour = "grey50") +
geom_point(aes(colour = year(date)))
grid.arrange(geupp, geuppp, ncol=2)
Exercises
ggplot(mpg, aes(cty, hwy)) + geom_point()
には、どのような問題があるか。どのような、geom を利用するのが適しているか。ggplot(mpg, aes(class, hwy)) + geom_boxplot()
では、アルファベティカルに並ぶが、より適切にするには、どうしたらよいか。 手作業ですることも可能だが、ggplot(mpg, aes(reorder(class, hwy), hwy)) + geom_boxplot()
とすることもできる。reorder
については、Help 参照。ラベルについての基本は、以下の通り、自動的、能動的、省略である。
#library(gridExtra)
glabs0 <- ggplot(mpg, aes(cty, hwy)) +
geom_point(alpha = 1 / 3)
glabsxy <- ggplot(mpg, aes(cty, hwy)) +
geom_point(alpha = 1 / 3) +
xlab("city driving (mpg)") +
ylab("highway driving (mpg)")
glabsnull <- ggplot(mpg, aes(cty, hwy)) +
geom_point(alpha = 1 / 3) +
xlab(NULL) +
ylab(NULL)
grid.arrange(glabs0, glabsxy, glabsnull, ncol=3)
xlim() and ylim() によって軸の範囲を制限することができる。データの範囲を超えて、軸を設定するとき、na.rm = TRUE
とすることで、warnings を出さないようにできるが、注意も要する。
#library(gridExtra)
gjitter <-ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = 0.25)
gjitterlim <- ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = 0.25) +
xlim("f", "r") +
ylim(20, 30)
gjitterlimnull <- ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = 0.25, na.rm = TRUE) +
ylim(NA, 30)
grid.arrange(gjitter, gjitterlim, gjitterlimnull, ncol=3)
## Warning: Removed 140 rows containing missing values (geom_point).
p <- ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
geom_point()
print(p)
# Save png to disk
ggsave("plot.png", p, width = 5, height = 5)
summary(p)
## data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl,
## class [234x11]
## mapping: x = ~displ, y = ~hwy, colour = ~factor(cyl)
## faceting: <ggproto object: Class FacetNull, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetNull, Facet, gg>
## -----------------------------------
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
saveRDS(p, "plot.rds")
q <- readRDS("plot.rds")
各ジオムは、名前をもち、それぞれとして大切なだけでなく、より複雑なものを構成する基本要素でもある。 また、二次元 x と y をエステとして書く必要があり、すべて、color(色)と size(大きさ)をエステに加えることができる。また、bat, tile, polygon は、フィルも加えることができる。
geom_area(): ある領域を下から上へ重ねて塗りつぶしていく。
geom_bar(stat = “identity”): 棒グラフを描く。初期値 stat = “identity” により、個数を自動的に計算して表示する。いくつかの棒グラフを上に重ねて表示することもできる。
geom_line(): 折れ線を描く。グループ毎に線でつなぐ。 geom_line() は左から右に、geom_path() は順に結んでいく。どちらも、点線にするなど、線の形式 linetype をエステとして取り得る。
geom_point(): 散布図を描き、点の形 shape をエステとして取り得る。
geom_polygon(): 多角形を描く。ただし、各行の値は異なるとする。多角形の座標軸をもつ、データフレームを一つにして、プロットするまえに、外形をみるのにも役立つ。
geom_rect(), geom_tile(), geom_raster(): いずれも長方形を描く。geom_rect() は四頂点の座標、xmin, ymin, xmax, ymax、geom_tile() はやはり四頂点だが、中心の座標と幅と高さ、 geom_raster() は、geom_tile() の特別な場合で、すべてのタイルが同じ大きさである場合に利用する。
座標軸の範囲が異なっていることに注意。
下の例では、すべて \((x,y) = (3,2), (1,4), (5,6)\) としている。
#library(gridExtra)
df <- data.frame(
x = c(3, 1, 5),
y = c(2, 4, 6),
label = c("a","b","c")
)
p <- ggplot(df, aes(x, y, label = label)) +
labs(x = NULL, y = NULL) + # Hide axis label
theme(plot.title = element_text(size = 12)) # Shrink plot title
gp1 <- p + geom_point() + ggtitle("point")
gp2 <- p + geom_text() + ggtitle("text")
gp3 <- p + geom_bar(stat = "identity") + ggtitle("bar")
gp4 <- p + geom_tile() + ggtitle("raster")
grid.arrange(gp1, gp2, gp3, gp4, ncol=4)
gp5 <- p + geom_line() + ggtitle("line")
gp6 <- p + geom_area() + ggtitle("area")
gp7 <- p + geom_path() + ggtitle("path")
gp8 <- p + geom_polygon() + ggtitle("polygon")
grid.arrange(gp5, gp6, gp7, gp8, ncol=4)
Exercises
geom_path() と geom_polygon()、geom_path() と geom_line() の違いはそれぞれなにか。
geom_smooth()、geom_boxplot()、geom_violin() では、どのような基本要素をつかっているのだろうか。
ジオムは、大きく、個別のジオムと、複合的なジオムに分けられる。個別のジオムは、個々のオブザベーション(行)についてたとえば点を描くが、複合的なジオムは、いくつかのオブザベーションを、一つにまとめて描く。これは、統計的な概要を、たとえば箱ひげ図で表したり、多角形で表したりする。線やパスはその中間的なものである。線分は、二点を表しているとも考えられるからである。いくつかのオブザベーションをまとめるには、グループ化(エステにグループを加えること)を利用することになる。
離散値をグループ分けに使うことが基本であるが、適切に分割するためには、明示的に、どのように分割するかを指定することが必要な場合もある。
以下で典型的な三つの場合について、身長(longitude)のデータを用いて議論する。
The heights (height) and centered ages (age) of 26 boys (Subject), measured on nine occasions (Occasion)
# libaray(nlme)
data(Oxboys, package = "nlme")
str(Oxboys)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 234 obs. of 4 variables:
## $ Subject : Ord.factor w/ 26 levels "10"<"26"<"25"<..: 13 13 13 13 13 13 13 13 13 5 ...
## $ age : num -1 -0.7479 -0.463 -0.1643 -0.0027 ...
## $ height : num 140 143 145 147 148 ...
## $ Occasion: Ord.factor w/ 9 levels "1"<"2"<"3"<"4"<..: 1 2 3 4 5 6 7 8 9 1 ...
## - attr(*, "formula")=Class 'formula' language height ~ age | Subject
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## - attr(*, "labels")=List of 2
## ..$ y: chr "Height"
## ..$ x: chr "Centered age"
## - attr(*, "units")=List of 1
## ..$ y: chr "(cm)"
## - attr(*, "FUN")=function (x)
## ..- attr(*, "source")= chr "function (x) max(x, na.rm = TRUE)"
## - attr(*, "order.groups")= logi TRUE
summary(Oxboys)
## Subject age height Occasion
## 10 : 9 Min. :-1.00000 Min. :126.2 1 :26
## 26 : 9 1st Qu.:-0.46300 1st Qu.:143.8 2 :26
## 25 : 9 Median :-0.00270 Median :149.5 3 :26
## 9 : 9 Mean : 0.02263 Mean :149.5 4 :26
## 2 : 9 3rd Qu.: 0.55620 3rd Qu.:155.5 5 :26
## 6 : 9 Max. : 1.00550 Max. :174.8 6 :26
## (Other):180 (Other):78
head(Oxboys)
データをいくつかのグループに分割して、それらを同一の基準で描きたいことが生じる。グループごとにどのように異なるかをみたいと表現することもできる。しかし、グループを明確に分けないと、下のそれぞれの少年の成長を表したグラフのように、ごちゃごちゃして見にくいものになる。
ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_point() +
geom_line()
グループを指定しないと、次のグラフのように悲惨なものが生じる。
ggplot(Oxboys, aes(age, height)) +
geom_point() +
geom_line()
複数の変数でグループ分けするときには、interaction() を使い、たとえば、 aes(group = interaction(school_id, student_id)) のようにする。
それぞれの男子についてのトレンドを表示すると次のようになる。
ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
これが欲しいものではなかった。グループ化を、ggplot のエステに加えず、geom_line() に加えてトレンドを描くと次のようになる。
ggplot(Oxboys, aes(age, height)) +
geom_line(aes(group = Subject)) +
geom_smooth(method = "lm", size = 2, se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
x 座標としては離散値をとり、グループごとを結んだ線を描きたい場合がある。これは、たとえば、interaction plots, profile plots, parallel coordinate plots と呼ばれるものである。たとえば、測定時期毎にまとめると次のようになる。
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot()
この場合には、離散値変数は Occasion 一つなので、それぞれの、x の値について、一つずつ、箱ひげ図が描かれる。geom_line() を単純に加えると、うまくいかない。Occasion 間ではなく、それぞれの、Occasion の中で、線が引かれるからである。
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot() +
geom_line(colour = "#3366FF", alpha = 0.5)
このばあいは、グループを別に設定して、男子ひとりについて、折れ線を描くようにする。
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot() +
geom_line(aes(group = Subject), colour = "#3366FF", alpha = 0.5)
複合的なジオムにおいて、個々のオブザベーションが、全体のエステとして、対応させるとどうなるか。ここの異なるエステが、一つの値に対応するとどうなるか。
これらは、それぞれの複合的ジオム毎に処理される。折れ線とパスは、最初の値に注目する。線分は、二つのオブザベーションで決まるが、エステとして、色を用いるとき、その最初のオブザベーションの値を利用する。
df <- data.frame(x = 1:3, y = 1:3, colour = c(1,3,5))
ggplot(df, aes(x, y, colour = factor(colour))) +
geom_line(aes(group = 1), size = 2) +
geom_point(size = 5)
ggplot(df, aes(x, y, colour = colour)) +
geom_line(aes(group = 1), size = 2) +
geom_point(size = 5)
上の例で確認せよ。左は、離散値変数、右は連続値変数。
xgrid <- with(df, seq(min(x), max(x), length = 50))
interp <- data.frame(
x = xgrid,
y = approx(df$x, df$y, xout = xgrid)$y,
colour = approx(df$x, df$colour, xout = xgrid)$y
)
ggplot(interp, aes(x, y, colour = colour)) +
geom_line(size = 2) +
geom_point(data = df, size = 5)
ラインタイプは、一種類しか使えない。他のジオムの場合には、複雑になるので、グループ毎に同じ色が対応するような場合のみ、エステが反映される規則にしている。連続値変数の場合が理解しやすいかもしれない。特に、bar や area plots の場合には、線分を重ねていくので、結果がどのようになるかわかりやすい。
連続値変数を fill としてエステに加えるときは、次の例のようになるので、注意を要する。
ggplot(mpg, aes(class)) +
geom_bar()
ggplot(mpg, aes(class, fill = drv)) +
geom_bar()
ggplot(mpg, aes(class, fill = hwy)) +
geom_bar()
ggplot(mpg, aes(class, fill = hwy, group = hwy)) +
geom_bar()
上のようなグラフの色を適切に制御するには、必要に応じて、明示的に、値を区間に分けるなどする。
hwy の箱ひげ図を、cyl をファクターにはせずに各 cyl について描くためには、エステとして何を加えればよいか。
次のプロットを、dipl の整数値について、一つずつ、箱ひげ図を描くように修正せよ。
ggplot(mpg, aes(displ, cty)) +
geom_boxplot()
## Warning: Continuous y aesthetic -- did you forget aes(group=...)?
連続値の場合と異なり、離散値の場合は、色を対応させるのに、aes(group = 1) が必要なのはなぜか。それを省略するとどうなるか。aes(group = 1) と aes(group = 2) の違いはなにか。説明せよ。
それぞれの、プロットでは、いくつの、棒が表示されるか。
ggplot(mpg, aes(drv)) +
geom_bar()
ggplot(mpg, aes(drv, fill = hwy, group = hwy)) +
geom_bar()
library(dplyr)
mpg2 <- mpg %>% arrange(hwy) %>% mutate(id = seq_along(hwy))
ggplot(mpg2, aes(drv, fill = hwy, group = id)) +
geom_bar()
(Hint: try adding an outline around each bar with colour = “white”)
library(babynames)
hadley <- dplyr::filter(babynames, name == "Hadley")
ggplot(hadley, aes(year, n)) +
geom_line()
ggplot2-boos:Statistical summaries
不確かさ、データのばらつきを表示するには、x に対応する変数が、離散変数(discrete)か、連続変数(continuous)か、また信頼区間の中心(center)を表示するかによって、次のようないくつかの方法がある。
x のそれぞれについて、y の値の範囲(ymin, ymax)を、エステに与えることで、表示する。
#library(gridExtra)
y <- c(18, 11, 16)
df <- data.frame(x = 1:3, y = y, se = c(1.2, 0.5, 1.0))
base <- ggplot(df, aes(x, y, ymin = y - se, ymax = y + se))
gbcb <- base + geom_crossbar()
gbpr <- base + geom_pointrange()
gbgs <- base + geom_smooth(stat = "identity")
grid.arrange(gbcb,gbpr, gbgs, ncol=3)
#library(gridExtra)
gbeb <- base + geom_errorbar()
gblr <- base + geom_linerange()
gbr <- base + geom_ribbon()
grid.arrange(gbeb, gblr, gbr, ncol=3)
標準偏差・誤差などは、各種あるので、何を利用するのが適切かを考えて、指定することになる。
各行がオブザベーションに対応し、各列が様々な変数に対応し、それらの重み付けが必要な場合を考える。
R
に付属の、アメリカ合衆国の中西部の2010年の国勢調査のデータ midwest
を利用する。これらは、様々なグループの割合の情報を含んでいる。白人の割合(percent white)、貧困率(percent below poverty line)、大学の学位(percent with college degree)、郡(county)ごとの面積(area)、人口(population)、人口密度(population density)などである。
重み付けとしてはたとえば次のようなことが考えられる。
重み付けによって、結果は大きく変化する。重み付けをエステとして含めるには二つの方法がある。単純なジオムとして、lines や points に対して size を用いる。
#library(gridExtra)
# Unweighted
gwp <- ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point()
# Weight by population
gwps <- ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point(aes(size = poptotal / 1e6)) +
scale_size_area("Population\n(millions)", breaks = c(0.5, 1, 2, 4))
grid.arrange(gwp, gwps, ncol=2)
さらに複雑な統計処理を必要とする grob の場合には、weight をエステの中で指定する。値が統計的要約関数に渡される。例:smoothers, quantile regressions, boxplots, histograms, density plots. 反例には示されない。下の例を参照のこと。
#library(gridExtra)
# Unweighted
gwpl <- ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point() +
geom_smooth(method = lm, size = 1)
#> `geom_smooth()` using formula 'y ~ x'
# Weighted by population
gwpp <- ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point(aes(size = poptotal / 1e6)) +
geom_smooth(aes(weight = poptotal), method = lm, size = 1) +
scale_size_area(guide = "none")
#> `geom_smooth()` using formula 'y ~ x'
grid.arrange(gwpl, gwpp, ncol=2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
下は、人口に関する、ヒストグラムや密度プロットを利用した例である。
#library(gridExtra)
gphc <- ggplot(midwest, aes(percbelowpoverty)) +
geom_histogram(binwidth = 1) +
ylab("Counties")
gppp <- ggplot(midwest, aes(percbelowpoverty)) +
geom_histogram(aes(weight = poptotal), binwidth = 1) +
ylab("Population (1000s)")
grid.arrange(gphc, gppp, ncol=2)
#library(gridExtra)
gddh <- ggplot(diamonds, aes(depth)) +
geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
gddhx <- ggplot(diamonds, aes(depth)) +
geom_histogram(binwidth = 0.1) +
xlim(55, 70)
grid.arrange(gddh, gddhx, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 45 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
#library(gridExtra)
gdf <- ggplot(diamonds, aes(depth)) +
geom_freqpoly(aes(colour = cut), binwidth = 0.1, na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
gdhxt <- ggplot(diamonds, aes(depth)) +
geom_histogram(aes(fill = cut), binwidth = 0.1, position = "fill",
na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
grid.arrange(gdf, gdhxt, ncol=2)
#library(gridExtra)
gdbp <- ggplot(diamonds, aes(clarity, depth)) +
geom_boxplot()
gdbpx <- ggplot(diamonds, aes(carat, depth)) +
geom_boxplot(aes(group = cut_width(carat, 0.1))) +
xlim(NA, 2.05)
grid.arrange(gdbp, gdbpx, ncol=2)
## Warning: Removed 997 rows containing missing values (stat_boxplot).
#library(gridExtra)
dcdv <- ggplot(diamonds, aes(clarity, depth)) +
geom_violin()
dcdvx <- ggplot(diamonds, aes(carat, depth)) +
geom_violin(aes(group = cut_width(carat, 0.1))) +
xlim(NA, 2.05)
grid.arrange(dcdv, dcdvx, ncol=2)
## Warning: Removed 997 rows containing non-finite values (stat_ydensity).
Exercises
carat の分布について、binwidth を変えることでどのようなことがわかるか。
価格についての、ヒストグラムからどのようなことがわかるか。
透明度と価格の関係についてどのようなことがわかるか。
頻度多角形と、深さに関する密度のプロットを重ねよ。比較のための、y の計算には、どのようにすることが必要か。Overlay a frequency polygon and density plot of depth. What computed variable do you need to map to y to make the two plots comparable? (You can either modify geom_freqpoly() or geom_density().)
#library(gridExtra)
df <- data.frame(x = rnorm(2000), y = rnorm(2000))
norm <- ggplot(df, aes(x, y)) + xlab(NULL) + ylab(NULL)
gnp <- norm + geom_point()
gnps <- norm + geom_point(shape = 1) # Hollow circles
gnpsp <- norm + geom_point(shape = ".") # Pixel sized
grid.arrange(gnp, gnps, gnpsp, ncol=3)
#library(gridExtra)
gnpa3 <- norm + geom_point(alpha = 1 / 3)
gnpa5 <- norm + geom_point(alpha = 1 / 5)
gnpa10 <- norm + geom_point(alpha = 1 / 10)
grid.arrange(gnpa3, gnpa5, gnpa10, ncol=3)
#library(gridExtra)
gnb2 <- norm + geom_bin2d()
gnb10 <- norm + geom_bin2d(bins = 10)
grid.arrange(gnb2, gnb10, ncol=2)
#library(gridExtra)
gnh <- norm + geom_hex()
gnh10 <- norm + geom_hex(bins = 10)
grid.arrange(gnh, gnh10, ncol=2)
#library(gridExtra)
gdcb <- ggplot(diamonds, aes(color)) +
geom_bar()
gdbs <- ggplot(diamonds, aes(color, price)) +
geom_bar(stat = "summary_bin", fun = mean)
grid.arrange(gdcb, gdbs, ncol=2)
#library(gridExtra)
gtd <- ggplot(diamonds, aes(table, depth)) +
geom_bin2d(binwidth = 1, na.rm = TRUE) +
xlim(50, 70) +
ylim(50, 70)
gtdz <- ggplot(diamonds, aes(table, depth, z = price)) +
geom_raster(binwidth = 1, stat = "summary_2d", fun = mean,
na.rm = TRUE) +
xlim(50, 70) +
ylim(50, 70)
grid.arrange(gtd, gtdz, ncol=2)
## Warning: Raster pixels are placed at uneven horizontal intervals and will be
## shifted. Consider using geom_tile() instead.
## Warning: Raster pixels are placed at uneven vertical intervals and will be
## shifted. Consider using geom_tile() instead.
ggplot(faithfuld, aes(eruptions, waiting)) +
geom_contour(aes(z = density, colour = ..level..))
ggplot(faithfuld, aes(eruptions, waiting)) +
geom_raster(aes(fill = density))
small <- faithfuld[seq(1, nrow(faithfuld), by = 10), ]
ggplot(small, aes(eruptions, waiting)) +
geom_point(aes(size = density), alpha = 1/3) +
scale_size_area()
中身は書かれていない。as of April 22, 2020.
ほとんど中身は書かれていない。as of April 22, 2020.
par(family= "HiraKakuPro-W3")
op <- par(mfrow = c(1,2))
barplot(BOD$demand, names.arg = BOD$Time, main = "棒グラフのラベルを指定", xlab = "時間経過(日)", ylab = "酸素の必要量 (mg/l)")
hist(mtcars$mpg, breaks = 10)
par(op)
qplot(displ, hwy, data = mpg, facets = . ~year) + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
オペレーションシステムに依存するので、統一的には、現時点では書けないが、備忘録として記録する。
初心者に、RStudio.cloud を薦めているが、ラベルが日本語の場合には表示されないので、ここに、解決の近道を書いておく。
グラフのラベルの日本語が表示されない場合は、以下を利用。
CJK (Chinese, Japanese, and Korean) font WenQuanYi Micro Hei が含まれており、使われる。
str(BOD)
## 'data.frame': 6 obs. of 2 variables:
## $ Time : num 1 2 3 4 5 7
## $ demand: num 8.3 10.3 19 16 15.6 19.8
## - attr(*, "reference")= chr "A1.4, p. 270"
head(BOD)
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
head(cars)
str(longley)
## 'data.frame': 16 obs. of 7 variables:
## $ GNP.deflator: num 83 88.5 88.2 89.5 96.2 ...
## $ GNP : num 234 259 258 285 329 ...
## $ Unemployed : num 236 232 368 335 210 ...
## $ Armed.Forces: num 159 146 162 165 310 ...
## $ Population : num 108 109 110 111 112 ...
## $ Year : int 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 ...
## $ Employed : num 60.3 61.1 60.2 61.2 63.2 ...
head(longley)
str(midwest)
## tibble [437 × 28] (S3: tbl_df/tbl/data.frame)
## $ PID : int [1:437] 561 562 563 564 565 566 567 568 569 570 ...
## $ county : chr [1:437] "ADAMS" "ALEXANDER" "BOND" "BOONE" ...
## $ state : chr [1:437] "IL" "IL" "IL" "IL" ...
## $ area : num [1:437] 0.052 0.014 0.022 0.017 0.018 0.05 0.017 0.027 0.024 0.058 ...
## $ poptotal : int [1:437] 66090 10626 14991 30806 5836 35688 5322 16805 13437 173025 ...
## $ popdensity : num [1:437] 1271 759 681 1812 324 ...
## $ popwhite : int [1:437] 63917 7054 14477 29344 5264 35157 5298 16519 13384 146506 ...
## $ popblack : int [1:437] 1702 3496 429 127 547 50 1 111 16 16559 ...
## $ popamerindian : int [1:437] 98 19 35 46 14 65 8 30 8 331 ...
## $ popasian : int [1:437] 249 48 16 150 5 195 15 61 23 8033 ...
## $ popother : int [1:437] 124 9 34 1139 6 221 0 84 6 1596 ...
## $ percwhite : num [1:437] 96.7 66.4 96.6 95.3 90.2 ...
## $ percblack : num [1:437] 2.575 32.9 2.862 0.412 9.373 ...
## $ percamerindan : num [1:437] 0.148 0.179 0.233 0.149 0.24 ...
## $ percasian : num [1:437] 0.3768 0.4517 0.1067 0.4869 0.0857 ...
## $ percother : num [1:437] 0.1876 0.0847 0.2268 3.6973 0.1028 ...
## $ popadults : int [1:437] 43298 6724 9669 19272 3979 23444 3583 11323 8825 95971 ...
## $ perchsd : num [1:437] 75.1 59.7 69.3 75.5 68.9 ...
## $ percollege : num [1:437] 19.6 11.2 17 17.3 14.5 ...
## $ percprof : num [1:437] 4.36 2.87 4.49 4.2 3.37 ...
## $ poppovertyknown : int [1:437] 63628 10529 14235 30337 4815 35107 5241 16455 13081 154934 ...
## $ percpovertyknown : num [1:437] 96.3 99.1 95 98.5 82.5 ...
## $ percbelowpoverty : num [1:437] 13.15 32.24 12.07 7.21 13.52 ...
## $ percchildbelowpovert: num [1:437] 18 45.8 14 11.2 13 ...
## $ percadultpoverty : num [1:437] 11.01 27.39 10.85 5.54 11.14 ...
## $ percelderlypoverty : num [1:437] 12.44 25.23 12.7 6.22 19.2 ...
## $ inmetro : int [1:437] 0 0 0 1 0 0 0 0 0 1 ...
## $ category : chr [1:437] "AAR" "LHR" "AAR" "ALU" ...
head(midwest)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
head(mtcars)
str(pressure)
## 'data.frame': 19 obs. of 2 variables:
## $ temperature: num 0 20 40 60 80 100 120 140 160 180 ...
## $ pressure : num 0.0002 0.0012 0.006 0.03 0.09 0.27 0.75 1.85 4.2 8.8 ...
head(pressure)
str(Titanic)
## 'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
## - attr(*, "dimnames")=List of 4
## ..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew"
## ..$ Sex : chr [1:2] "Male" "Female"
## ..$ Age : chr [1:2] "Child" "Adult"
## ..$ Survived: chr [1:2] "No" "Yes"
head(as.data.frame(Titanic))
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
head(ToothGrowth)
tidyverse