observation_id IgG_concentration age gender slum
1 5772 0.3176895 2 Female Non slum
2 8095 3.4368231 4 Female Non slum
3 9784 0.3000000 4 Male Non slum
After module 9, you should be able to…
Let’s read in our data (again) and take a quick look.
Create age_group
three level factor variable
Create seropos
binary variable representing seropositivity if antibody concentrations are >10 IU/mL.
aggregate()
function to do many analyses across groups.Registered S3 method overwritten by 'printr':
method from
knit_print.data.frame rmarkdown
Compute Summary Statistics of Data Subsets
Description:
Splits the data into subsets, computes summary statistics for
each, and returns the result in a convenient form.
Usage:
aggregate(x, ...)
## Default S3 method:
aggregate(x, ...)
## S3 method for class 'data.frame'
aggregate(x, by, FUN, ..., simplify = TRUE, drop = TRUE)
## S3 method for class 'formula'
aggregate(x, data, FUN, ...,
subset, na.action = na.omit)
## S3 method for class 'ts'
aggregate(x, nfrequency = 1, FUN = sum, ndeltat = 1,
ts.eps = getOption("ts.eps"), ...)
Arguments:
x: an R object. For the 'formula' method a 'formula', such as
'y ~ x' or 'cbind(y1, y2) ~ x1 + x2', where the 'y' variables
are numeric data to be split into groups according to the
grouping 'x' variables (usually factors).
by: a list of grouping elements, each as long as the variables in
the data frame 'x', or a formula. The elements are coerced
to factors before use.
FUN: a function to compute the summary statistics which can be
applied to all data subsets.
simplify: a logical indicating whether results should be simplified to
a vector or matrix if possible.
drop: a logical indicating whether to drop unused combinations of
grouping values. The non-default case 'drop=FALSE' has been
amended for R 3.5.0 to drop unused combinations.
data: a data frame (or list) from which the variables in the
formula should be taken.
subset: an optional vector specifying a subset of observations to be
used.
na.action: a function which indicates what should happen when the data
contain 'NA' values. The default is to ignore missing values
in the given variables.
nfrequency: new number of observations per unit of time; must be a
divisor of the frequency of 'x'.
ndeltat: new fraction of the sampling period between successive
observations; must be a divisor of the sampling interval of
'x'.
ts.eps: tolerance used to decide if 'nfrequency' is a sub-multiple of
the original frequency.
...: further arguments passed to or used by methods.
Details:
'aggregate' is a generic function with methods for data frames and
time series.
The default method, 'aggregate.default', uses the time series
method if 'x' is a time series, and otherwise coerces 'x' to a
data frame and calls the data frame method.
'aggregate.data.frame' is the data frame method. If 'x' is not a
data frame, it is coerced to one, which must have a non-zero
number of rows. Then, each of the variables (columns) in 'x' is
split into subsets of cases (rows) of identical combinations of
the components of 'by', and 'FUN' is applied to each such subset
with further arguments in '...' passed to it. The result is
reformatted into a data frame containing the variables in 'by' and
'x'. The ones arising from 'by' contain the unique combinations
of grouping values used for determining the subsets, and the ones
arising from 'x' the corresponding summaries for the subset of the
respective variables in 'x'. If 'simplify' is true, summaries are
simplified to vectors or matrices if they have a common length of
one or greater than one, respectively; otherwise, lists of summary
results according to subsets are obtained. Rows with missing
values in any of the 'by' variables will be omitted from the
result. (Note that versions of R prior to 2.11.0 required 'FUN'
to be a scalar function.)
The formula method provides a standard formula interface to
'aggregate.data.frame'. The latter invokes the formula method if
'by' is a formula, in which case 'aggregate(x, by, FUN)' is the
same as 'aggregate(by, x, FUN)' for a data frame 'x'.
'aggregate.ts' is the time series method, and requires 'FUN' to be
a scalar function. If 'x' is not a time series, it is coerced to
one. Then, the variables in 'x' are split into appropriate blocks
of length 'frequency(x) / nfrequency', and 'FUN' is applied to
each such block, with further (named) arguments in '...' passed to
it. The result returned is a time series with frequency
'nfrequency' holding the aggregated values. Note that this make
most sense for a quarterly or yearly result when the original
series covers a whole number of quarters or years: in particular
aggregating a monthly series to quarters starting in February does
not give a conventional quarterly series.
'FUN' is passed to 'match.fun', and hence it can be a function or
a symbol or character string naming a function.
Value:
For the time series method, a time series of class '"ts"' or class
'c("mts", "ts")'.
For the data frame method, a data frame with columns corresponding
to the grouping variables in 'by' followed by aggregated columns
from 'x'. If the 'by' has names, the non-empty times are used to
label the columns in the results, with unnamed grouping variables
being named 'Group.i' for 'by[[i]]'.
Warning:
The first argument of the '"formula"' method was named 'formula'
rather than 'x' prior to R 4.2.0. Portable uses should not name
that argument.
Author(s):
Kurt Hornik, with contributions by Arni Magnusson.
References:
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S
Language_. Wadsworth & Brooks/Cole.
See Also:
'apply', 'lapply', 'tapply'.
Examples:
## Compute the averages for the variables in 'state.x77', grouped
## according to the region (Northeast, South, North Central, West) that
## each state belongs to.
aggregate(state.x77, list(Region = state.region), mean)
## Compute the averages according to region and the occurrence of more
## than 130 days of frost.
aggregate(state.x77,
list(Region = state.region,
Cold = state.x77[,"Frost"] > 130),
mean)
## (Note that no state in 'South' is THAT cold.)
## example with character variables and NAs
testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )
by1 <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12)
by2 <- c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA)
aggregate(x = testDF, by = list(by1, by2), FUN = "mean")
# and if you want to treat NAs as a group
fby1 <- factor(by1, exclude = "")
fby2 <- factor(by2, exclude = "")
aggregate(x = testDF, by = list(fby1, fby2), FUN = "mean")
## Formulas, one ~ one, one ~ many, many ~ one, and many ~ many:
aggregate(weight ~ feed, data = chickwts, mean)
aggregate(breaks ~ wool + tension, data = warpbreaks, mean)
aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum)
## Dot notation:
aggregate(. ~ Species, data = iris, mean)
aggregate(len ~ ., data = ToothGrowth, mean)
## Often followed by xtabs():
ag <- aggregate(len ~ ., data = ToothGrowth, mean)
xtabs(len ~ ., data = ag)
## Formula interface via 'by' (for pipe operations)
ToothGrowth |> aggregate(len ~ ., FUN = mean)
## Compute the average annual approval ratings for American presidents.
aggregate(presidents, nfrequency = 1, FUN = mean)
## Give the summer less weight.
aggregate(presidents, nfrequency = 1,
FUN = weighted.mean, w = c(1, 1, 0.5, 1))
aggregate()
is with the formula option. The syntax is variable_of_intest ~ grouping_variables
.aggregate(
# Formula specifies we are calculating statistics on seropos, separately for
# each level of age_group
seropos ~ age_group,
data = df, # Data argument
FUN = mean # function for our calculation WITHOUT PARENTHESES
)
age_group | seropos |
---|---|
young | 0.1832797 |
middle | 0.6000000 |
old | 0.7945205 |
age_group | slum | IgG_concentration |
---|---|---|
young | Mixed | 174.89797 |
middle | Mixed | 162.08188 |
old | Mixed | 150.07063 |
young | Non slum | 114.68422 |
middle | Non slum | 177.62113 |
old | Non slum | 141.22330 |
young | Slum | 61.85705 |
middle | Slum | 202.42018 |
old | Slum | 74.75217 |
cbind()
syntax.We use table()
prior to look at one variable, now we can generate frequency tables for 2 plus variables. To get cell percentages, the prop.table()
is useful.
Express Table Entries as Fraction of Marginal Table
Description:
Returns conditional proportions given 'margins', i.e. entries of
'x', divided by the appropriate marginal sums.
Usage:
proportions(x, margin = NULL)
prop.table(x, margin = NULL)
Arguments:
x: table
margin: a vector giving the margins to split by. E.g., for a matrix
'1' indicates rows, '2' indicates columns, 'c(1, 2)'
indicates rows and columns. When 'x' has named dimnames, it
can be a character vector selecting dimension names.
Value:
Table like 'x' expressed relative to 'margin'
Note:
'prop.table' is an earlier name, retained for back-compatibility.
Author(s):
Peter Dalgaard
See Also:
'marginSums'. 'apply', 'sweep' are a more general mechanism for
sweeping out marginal statistics.
Examples:
m <- matrix(1:4, 2)
m
proportions(m, 1)
DF <- as.data.frame(UCBAdmissions)
tbl <- xtabs(Freq ~ Gender + Admit, DF)
proportions(tbl, "Gender")
Let’s practice
Now, lets move to percentages
/ | 0 | 1 |
---|---|---|
young | 0.4018987 | 0.0901899 |
middle | 0.1107595 | 0.1661392 |
old | 0.0474684 | 0.1835443 |
/ | 0 | 1 |
---|---|---|
young | 0.7175141 | 0.2050360 |
middle | 0.1977401 | 0.3776978 |
old | 0.0847458 | 0.4172662 |
The chisq.test()
function test of independence of factor variables from stats
package.
Pearson’s Chi-squared Test for Count Data
Description:
'chisq.test' performs chi-squared contingency table tests and
goodness-of-fit tests.
Usage:
chisq.test(x, y = NULL, correct = TRUE,
p = rep(1/length(x), length(x)), rescale.p = FALSE,
simulate.p.value = FALSE, B = 2000)
Arguments:
x: a numeric vector or matrix. 'x' and 'y' can also both be
factors.
y: a numeric vector; ignored if 'x' is a matrix. If 'x' is a
factor, 'y' should be a factor of the same length.
correct: a logical indicating whether to apply continuity correction when computing the test statistic for 2 by 2 tables: one half is subtracted from all |O - E| differences; however, the correction will not be bigger than the differences themselves. No correction is done if ‘simulate.p.value = TRUE’.
p: a vector of probabilities of the same length as 'x'. An
error is given if any entry of 'p' is negative.
rescale.p: a logical scalar; if TRUE then ‘p’ is rescaled (if necessary) to sum to 1. If ‘rescale.p’ is FALSE, and ‘p’ does not sum to 1, an error is given.
simulate.p.value: a logical indicating whether to compute p-values by Monte Carlo simulation.
B: an integer specifying the number of replicates used in the
Monte Carlo test.
Details:
If 'x' is a matrix with one row or column, or if 'x' is a vector
and 'y' is not given, then a _goodness-of-fit test_ is performed
('x' is treated as a one-dimensional contingency table). The
entries of 'x' must be non-negative integers. In this case, the
hypothesis tested is whether the population probabilities equal
those in 'p', or are all equal if 'p' is not given.
If 'x' is a matrix with at least two rows and columns, it is taken
as a two-dimensional contingency table: the entries of 'x' must be
non-negative integers. Otherwise, 'x' and 'y' must be vectors or
factors of the same length; cases with missing values are removed,
the objects are coerced to factors, and the contingency table is
computed from these. Then Pearson's chi-squared test is performed
of the null hypothesis that the joint distribution of the cell
counts in a 2-dimensional contingency table is the product of the
row and column marginals.
If 'simulate.p.value' is 'FALSE', the p-value is computed from the
asymptotic chi-squared distribution of the test statistic;
continuity correction is only used in the 2-by-2 case (if
'correct' is 'TRUE', the default). Otherwise the p-value is
computed for a Monte Carlo test (Hope, 1968) with 'B' replicates.
The default 'B = 2000' implies a minimum p-value of about 0.0005
(1/(B+1)).
In the contingency table case, simulation is done by random
sampling from the set of all contingency tables with given
marginals, and works only if the marginals are strictly positive.
Continuity correction is never used, and the statistic is quoted
without it. Note that this is not the usual sampling situation
assumed for the chi-squared test but rather that for Fisher's
exact test.
In the goodness-of-fit case simulation is done by random sampling
from the discrete distribution specified by 'p', each sample being
of size 'n = sum(x)'. This simulation is done in R and may be
slow.
Value:
A list with class '"htest"' containing the following components:
statistic: the value the chi-squared test statistic.
parameter: the degrees of freedom of the approximate chi-squared distribution of the test statistic, ‘NA’ if the p-value is computed by Monte Carlo simulation.
p.value: the p-value for the test.
method: a character string indicating the type of test performed, and whether Monte Carlo simulation or continuity correction was used.
data.name: a character string giving the name(s) of the data.
observed: the observed counts.
expected: the expected counts under the null hypothesis.
residuals: the Pearson residuals, ‘(observed - expected) / sqrt(expected)’.
stdres: standardized residuals, ‘(observed - expected) / sqrt(V)’, where ‘V’ is the residual cell variance (Agresti, 2007, section 2.4.5 for the case where ‘x’ is a matrix, ‘n * p * (1 - p)’ otherwise).
Source:
The code for Monte Carlo simulation is a C translation of the
Fortran algorithm of Patefield (1981).
References:
Hope, A. C. A. (1968). A simplified Monte Carlo significance test
procedure. _Journal of the Royal Statistical Society Series B_,
*30*, 582-598. doi:10.1111/j.2517-6161.1968.tb00759.x
<https://doi.org/10.1111/j.2517-6161.1968.tb00759.x>.
Patefield, W. M. (1981). Algorithm AS 159: An efficient method of
generating r x c tables with given row and column totals.
_Applied Statistics_, *30*, 91-97. doi:10.2307/2346669
<https://doi.org/10.2307/2346669>.
Agresti, A. (2007). _An Introduction to Categorical Data
Analysis_, 2nd ed. New York: John Wiley & Sons. Page 38.
See Also:
For goodness-of-fit testing, notably of continuous distributions,
'ks.test'.
Examples:
## From Agresti(2007) p.39
M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
dimnames(M) <- list(gender = c("F", "M"),
party = c("Democrat","Independent", "Republican"))
(Xsq <- chisq.test(M)) # Prints test summary
Xsq$observed # observed counts (same as M)
Xsq$expected # expected counts under the null
Xsq$residuals # Pearson residuals
Xsq$stdres # standardized residuals
## Effect of simulating p-values
x <- matrix(c(12, 5, 7, 7), ncol = 2)
chisq.test(x)$p.value # 0.4233
chisq.test(x, simulate.p.value = TRUE, B = 10000)$p.value
# around 0.29!
## Testing for population probabilities
## Case A. Tabulated data
x <- c(A = 20, B = 15, C = 25)
chisq.test(x)
chisq.test(as.table(x)) # the same
x <- c(89,37,30,28,2)
p <- c(40,20,20,15,5)
try(
chisq.test(x, p = p) # gives an error
)
chisq.test(x, p = p, rescale.p = TRUE)
# works
p <- c(0.40,0.20,0.20,0.19,0.01)
# Expected count in category 5
# is 1.86 < 5 ==> chi square approx.
chisq.test(x, p = p) # maybe doubtful, but is ok!
chisq.test(x, p = p, simulate.p.value = TRUE)
## Case B. Raw data
x <- trunc(5 * runif(100))
chisq.test(table(x)) # NOT 'chisq.test(x)'!
Pearson's Chi-squared test
data: freq
X-squared = 175.85, df = 2, p-value < 2.2e-16
We reject the null hypothesis that the proportion of seropositive individuals in the young, middle, and old age groups are the same.
First, we compute correlation by providing two vectors.
Like other functions, if there are NA
s, you get NA
as the result. But if you specify use only the complete observations, then it will give you correlation using the non-missing data.
[1] NA
[1] 0.2604783
Small positive correlation between IgG concentration and age.
The function cor.test()
also gives you the confidence interval of the correlation statistic. Note, it uses complete observations by default.
Pearson's product-moment correlation
data: df$age and df$IgG_concentration
t = 6.7717, df = 630, p-value = 2.921e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1862722 0.3317295
sample estimates:
cor
0.2604783
The commonly used are:
We can use the t.test()
function from the stats
package.
Student’s t-Test
Description:
Performs one and two sample t-tests on vectors of data.
Usage:
t.test(x, ...)
## Default S3 method:
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
## S3 method for class 'formula'
t.test(formula, data, subset, na.action, ...)
Arguments:
x: a (non-empty) numeric vector of data values.
y: an optional (non-empty) numeric vector of data values.
alternative: a character string specifying the alternative hypothesis, must be one of ‘“two.sided”’ (default), ‘“greater”’ or ‘“less”’. You can specify just the initial letter.
mu: a number indicating the true value of the mean (or difference
in means if you are performing a two sample test).
paired: a logical indicating whether you want a paired t-test.
var.equal: a logical variable indicating whether to treat the two variances as being equal. If ‘TRUE’ then the pooled variance is used to estimate the variance otherwise the Welch (or Satterthwaite) approximation to the degrees of freedom is used.
conf.level: confidence level of the interval.
formula: a formula of the form ‘lhs ~ rhs’ where ‘lhs’ is a numeric variable giving the data values and ‘rhs’ either ‘1’ for a one-sample or paired test or a factor with two levels giving the corresponding groups. If ‘lhs’ is of class ‘“Pair”’ and ‘rhs’ is ‘1’, a paired test is done.
data: an optional matrix or data frame (or similar: see
'model.frame') containing the variables in the formula
'formula'. By default the variables are taken from
'environment(formula)'.
subset: an optional vector specifying a subset of observations to be used.
na.action: a function which indicates what should happen when the data contain ‘NA’s. Defaults to ’getOption(“na.action”)’.
...: further arguments to be passed to or from methods.
Details:
'alternative = "greater"' is the alternative that 'x' has a larger
mean than 'y'. For the one-sample case: that the mean is positive.
If 'paired' is 'TRUE' then both 'x' and 'y' must be specified and
they must be the same length. Missing values are silently removed
(in pairs if 'paired' is 'TRUE'). If 'var.equal' is 'TRUE' then
the pooled estimate of the variance is used. By default, if
'var.equal' is 'FALSE' then the variance is estimated separately
for both groups and the Welch modification to the degrees of
freedom is used.
If the input data are effectively constant (compared to the larger
of the two means) an error is generated.
Value:
A list with class '"htest"' containing the following components:
statistic: the value of the t-statistic.
parameter: the degrees of freedom for the t-statistic.
p.value: the p-value for the test.
conf.int: a confidence interval for the mean appropriate to the specified alternative hypothesis.
estimate: the estimated mean or difference in means depending on whether it was a one-sample test or a two-sample test.
null.value: the specified hypothesized value of the mean or mean difference depending on whether it was a one-sample test or a two-sample test.
stderr: the standard error of the mean (difference), used as denominator in the t-statistic formula.
alternative: a character string describing the alternative hypothesis.
method: a character string indicating what type of t-test was performed.
data.name: a character string giving the name(s) of the data.
See Also:
'prop.test'
Examples:
require(graphics)
t.test(1:10, y = c(7:20)) # P = .00001855
t.test(1:10, y = c(7:20, 200)) # P = .1245 -- NOT significant anymore
## Classical example: Student's sleep data
plot(extra ~ group, data = sleep)
## Traditional interface
with(sleep, t.test(extra[group == 1], extra[group == 2]))
## Formula interface
t.test(extra ~ group, data = sleep)
## Formula interface to one-sample test
t.test(extra ~ 1, data = sleep)
## Formula interface to paired test
## The sleep data are actually paired, so could have been in wide format:
sleep2 <- reshape(sleep, direction = "wide",
idvar = "ID", timevar = "group")
t.test(Pair(extra.1, extra.2) ~ 1, data = sleep2)
The base R - t.test()
function from the stats
package. It tests test difference in means of a variable between two groups. By default:
mu=0
)alternative = "two.sided"
)conf.level = 0.95
)paired = FALSE
)var.equal = FALSE
)IgG_young <- df$IgG_concentration[df$age_group=="young"]
IgG_old <- df$IgG_concentration[df$age_group=="old"]
t.test(IgG_young, IgG_old)
Welch Two Sample t-test
data: IgG_young and IgG_old
t = -6.1969, df = 259.54, p-value = 2.25e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-111.09281 -57.51515
sample estimates:
mean of x mean of y
45.05056 129.35454
The mean IgG concenration of young and old is 45.05 and 129.35 IU/mL, respectively. We reject null hypothesis that the difference in the mean IgG concentration of young and old is 0 IU/mL.
To fit regression models in R, we use the function glm()
(Generalized Linear Model).
Fitting Generalized Linear Models
Description:
'glm' is used to fit generalized linear models, specified by
giving a symbolic description of the linear predictor and a
description of the error distribution.
Usage:
glm(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset,
control = list(...), model = TRUE, method = "glm.fit",
x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, ...)
glm.fit(x, y, weights = rep.int(1, nobs),
start = NULL, etastart = NULL, mustart = NULL,
offset = rep.int(0, nobs), family = gaussian(),
control = list(), intercept = TRUE, singular.ok = TRUE)
## S3 method for class 'glm'
weights(object, type = c("prior", "working"), ...)
Arguments:
formula: an object of class ‘“formula”’ (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.
family: a description of the error distribution and link function to be used in the model. For ‘glm’ this can be a character string naming a family function, a family function or the result of a call to a family function. For ‘glm.fit’ only the third option is supported. (See ‘family’ for details of family functions.)
data: an optional data frame, list or environment (or object
coercible by 'as.data.frame' to a data frame) containing the
variables in the model. If not found in 'data', the
variables are taken from 'environment(formula)', typically
the environment from which 'glm' is called.
weights: an optional vector of ‘prior weights’ to be used in the fitting process. Should be ‘NULL’ or a numeric vector.
subset: an optional vector specifying a subset of observations to be used in the fitting process.
na.action: a function which indicates what should happen when the data contain ‘NA’s. The default is set by the ’na.action’ setting of ‘options’, and is ‘na.fail’ if that is unset. The ‘factory-fresh’ default is ‘na.omit’. Another possible value is ‘NULL’, no action. Value ‘na.exclude’ can be useful.
start: starting values for the parameters in the linear predictor.
etastart: starting values for the linear predictor.
mustart: starting values for the vector of means.
offset: this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be ‘NULL’ or a numeric vector of length equal to the number of cases. One or more ‘offset’ terms can be included in the formula instead or as well, and if more than one is specified their sum is used. See ‘model.offset’.
control: a list of parameters for controlling the fitting process. For ‘glm.fit’ this is passed to ‘glm.control’.
model: a logical value indicating whether model frame should be included as a component of the returned value.
method: the method to be used in fitting the model. The default method ‘“glm.fit”’ uses iteratively reweighted least squares (IWLS): the alternative ‘“model.frame”’ returns the model frame and does no fitting.
User-supplied fitting functions can be supplied either as a
function or a character string naming a function, with a
function which takes the same arguments as 'glm.fit'. If
specified as a character string it is looked up from within
the 'stats' namespace.
x, y: For 'glm': logical values indicating whether the response
vector and model matrix used in the fitting process should be
returned as components of the returned value.
For 'glm.fit': 'x' is a design matrix of dimension 'n * p',
and 'y' is a vector of observations of length 'n'.
singular.ok: logical; if ‘FALSE’ a singular fit is an error.
contrasts: an optional list. See the ‘contrasts.arg’ of ‘model.matrix.default’.
intercept: logical. Should an intercept be included in the null model?
object: an object inheriting from class ‘“glm”’.
type: character, partial matching allowed. Type of weights to
extract from the fitted model object. Can be abbreviated.
...: For 'glm': arguments to be used to form the default 'control'
argument if it is not supplied directly.
For 'weights': further arguments passed to or from other
methods.
Details:
A typical predictor has the form 'response ~ terms' where
'response' is the (numeric) response vector and 'terms' is a
series of terms which specifies a linear predictor for 'response'.
For 'binomial' and 'quasibinomial' families the response can also
be specified as a 'factor' (when the first level denotes failure
and all others success) or as a two-column matrix with the columns
giving the numbers of successes and failures. A terms
specification of the form 'first + second' indicates all the terms
in 'first' together with all the terms in 'second' with any
duplicates removed.
A specification of the form 'first:second' indicates the set of
terms obtained by taking the interactions of all terms in 'first'
with all terms in 'second'. The specification 'first*second'
indicates the _cross_ of 'first' and 'second'. This is the same
as 'first + second + first:second'.
The terms in the formula will be re-ordered so that main effects
come first, followed by the interactions, all second-order, all
third-order and so on: to avoid this pass a 'terms' object as the
formula.
Non-'NULL' 'weights' can be used to indicate that different
observations have different dispersions (with the values in
'weights' being inversely proportional to the dispersions); or
equivalently, when the elements of 'weights' are positive integers
w_i, that each response y_i is the mean of w_i unit-weight
observations. For a binomial GLM prior weights are used to give
the number of trials when the response is the proportion of
successes: they would rarely be used for a Poisson GLM.
'glm.fit' is the workhorse function: it is not normally called
directly but can be more efficient where the response vector,
design matrix and family have already been calculated.
If more than one of 'etastart', 'start' and 'mustart' is
specified, the first in the list will be used. It is often
advisable to supply starting values for a 'quasi' family, and also
for families with unusual links such as 'gaussian("log")'.
All of 'weights', 'subset', 'offset', 'etastart' and 'mustart' are
evaluated in the same way as variables in 'formula', that is first
in 'data' and then in the environment of 'formula'.
For the background to warning messages about 'fitted probabilities
numerically 0 or 1 occurred' for binomial GLMs, see Venables &
Ripley (2002, pp. 197-8).
Value:
'glm' returns an object of class inheriting from '"glm"' which
inherits from the class '"lm"'. See later in this section. If a
non-standard 'method' is used, the object will also inherit from
the class (if any) returned by that function.
The function 'summary' (i.e., 'summary.glm') can be used to obtain
or print a summary of the results and the function 'anova' (i.e.,
'anova.glm') to produce an analysis of variance table.
The generic accessor functions 'coefficients', 'effects',
'fitted.values' and 'residuals' can be used to extract various
useful features of the value returned by 'glm'.
'weights' extracts a vector of weights, one for each case in the
fit (after subsetting and 'na.action').
An object of class '"glm"' is a list containing at least the
following components:
coefficients: a named vector of coefficients
residuals: the working residuals, that is the residuals in the final iteration of the IWLS fit. Since cases with zero weights are omitted, their working residuals are ‘NA’.
fitted.values: the fitted mean values, obtained by transforming the linear predictors by the inverse of the link function.
rank: the numeric rank of the fitted linear model.
family: the ‘family’ object used.
linear.predictors: the linear fit on link scale.
deviance: up to a constant, minus twice the maximized log-likelihood. Where sensible, the constant is chosen so that a saturated model has deviance zero.
aic: A version of Akaike's _An Information Criterion_, minus twice
the maximized log-likelihood plus twice the number of
parameters, computed via the 'aic' component of the family.
For binomial and Poison families the dispersion is fixed at
one and the number of parameters is the number of
coefficients. For gaussian, Gamma and inverse gaussian
families the dispersion is estimated from the residual
deviance, and the number of parameters is the number of
coefficients plus one. For a gaussian family the MLE of the
dispersion is used so this is a valid value of AIC, but for
Gamma and inverse gaussian families it is not. For families
fitted by quasi-likelihood the value is 'NA'.
null.deviance: The deviance for the null model, comparable with ‘deviance’. The null model will include the offset, and an intercept if there is one in the model. Note that this will be incorrect if the link function depends on the data other than through the fitted mean: specify a zero offset to force a correct calculation.
iter: the number of iterations of IWLS used.
weights: the working weights, that is the weights in the final iteration of the IWLS fit.
prior.weights: the weights initially supplied, a vector of ’1’s if none were.
df.residual: the residual degrees of freedom.
df.null: the residual degrees of freedom for the null model.
y: if requested (the default) the 'y' vector used. (It is a
vector even for a binomial model.)
x: if requested, the model matrix.
model: if requested (the default), the model frame.
converged: logical. Was the IWLS algorithm judged to have converged?
boundary: logical. Is the fitted value on the boundary of the attainable values?
call: the matched call.
formula: the formula supplied.
terms: the ‘terms’ object used.
data: the 'data argument'.
offset: the offset vector used.
control: the value of the ‘control’ argument used.
method: the name of the fitter function used (when provided as a ‘character’ string to ‘glm()’) or the fitter ‘function’ (when provided as that).
contrasts: (where relevant) the contrasts used.
xlevels: (where relevant) a record of the levels of the factors used in fitting.
na.action: (where relevant) information returned by ‘model.frame’ on the special handling of ’NA’s.
In addition, non-empty fits will have components 'qr', 'R' and
'effects' relating to the final weighted linear fit.
Objects of class '"glm"' are normally of class 'c("glm", "lm")',
that is inherit from class '"lm"', and well-designed methods for
class '"lm"' will be applied to the weighted linear model at the
final iteration of IWLS. However, care is needed, as extractor
functions for class '"glm"' such as 'residuals' and 'weights' do
*not* just pick out the component of the fit with the same name.
If a 'binomial' 'glm' model was specified by giving a two-column
response, the weights returned by 'prior.weights' are the total
numbers of cases (factored by the supplied case weights) and the
component 'y' of the result is the proportion of successes.
Fitting functions:
The argument 'method' serves two purposes. One is to allow the
model frame to be recreated with no fitting. The other is to
allow the default fitting function 'glm.fit' to be replaced by a
function which takes the same arguments and uses a different
fitting algorithm. If 'glm.fit' is supplied as a character string
it is used to search for a function of that name, starting in the
'stats' namespace.
The class of the object return by the fitter (if any) will be
prepended to the class returned by 'glm'.
Author(s):
The original R implementation of 'glm' was written by Simon Davies
working for Ross Ihaka at the University of Auckland, but has
since been extensively re-written by members of the R Core team.
The design was inspired by the S function of the same name
described in Hastie & Pregibon (1992).
References:
Dobson, A. J. (1990) _An Introduction to Generalized Linear
Models._ London: Chapman and Hall.
Hastie, T. J. and Pregibon, D. (1992) _Generalized linear models._
Chapter 6 of _Statistical Models in S_ eds J. M. Chambers and T.
J. Hastie, Wadsworth & Brooks/Cole.
McCullagh P. and Nelder, J. A. (1989) _Generalized Linear Models._
London: Chapman and Hall.
Venables, W. N. and Ripley, B. D. (2002) _Modern Applied
Statistics with S._ New York: Springer.
See Also:
'anova.glm', 'summary.glm', etc. for 'glm' methods, and the
generic functions 'anova', 'summary', 'effects', 'fitted.values',
and 'residuals'.
'lm' for non-generalized _linear_ models (which SAS calls GLMs,
for 'general' linear models).
'loglin' and 'loglm' (package 'MASS') for fitting log-linear
models (which binomial and Poisson GLMs are) to contingency
tables.
'bigglm' in package 'biglm' for an alternative way to fit GLMs to
large datasets (especially those with many cases).
'esoph', 'infert' and 'predict.glm' have examples of fitting
binomial glms.
Examples:
## Dobson (1990) Page 93: Randomized Controlled Trial :
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
data.frame(treatment, outcome, counts) # showing data
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
anova(glm.D93)
summary(glm.D93)
## Computing AIC [in many ways]:
(A0 <- AIC(glm.D93))
(ll <- logLik(glm.D93))
A1 <- -2*c(ll) + 2*attr(ll, "df")
A2 <- glm.D93$family$aic(counts, mu=fitted(glm.D93), wt=1) +
2 * length(coef(glm.D93))
stopifnot(exprs = {
all.equal(A0, A1)
all.equal(A1, A2)
all.equal(A1, glm.D93$aic)
})
## an example with offsets from Venables & Ripley (2002, p.189)
utils::data(anorexia, package = "MASS")
anorex.1 <- glm(Postwt ~ Prewt + Treat + offset(Prewt),
family = gaussian, data = anorexia)
summary(anorex.1)
# A Gamma example, from McCullagh & Nelder (1989, pp. 300-2)
clotting <- data.frame(
u = c(5,10,15,20,30,40,60,80,100),
lot1 = c(118,58,42,35,27,25,21,19,18),
lot2 = c(69,35,26,21,18,16,13,12,12))
summary(glm(lot1 ~ log(u), data = clotting, family = Gamma))
summary(glm(lot2 ~ log(u), data = clotting, family = Gamma))
## Aliased ("S"ingular) -> 1 NA coefficient
(fS <- glm(lot2 ~ log(u) + log(u^2), data = clotting, family = Gamma))
tools::assertError(update(fS, singular.ok=FALSE), verbose=interactive())
## -> .. "singular fit encountered"
## Not run:
## for an example of the use of a terms object as a formula
demo(glm.vr)
## End(Not run)
We tend to focus on three arguments:
formula
– model formula written using names of columns in our datadata
– our data framefamily
– error distribution and link functionsummary.glm()
The summary()
function when applied to a fit object based on a glm is technically the summary.glm()
function and produces details of the model fit. Note on object oriented code.
Summarizing Generalized Linear Model Fits
Description:
These functions are all 'methods' for class 'glm' or 'summary.glm'
objects.
Usage:
## S3 method for class 'glm'
summary(object, dispersion = NULL, correlation = FALSE,
symbolic.cor = FALSE, ...)
## S3 method for class 'summary.glm'
print(x, digits = max(3, getOption("digits") - 3),
symbolic.cor = x$symbolic.cor,
signif.stars = getOption("show.signif.stars"),
show.residuals = FALSE, ...)
Arguments:
object: an object of class ‘“glm”’, usually, a result of a call to ‘glm’.
x: an object of class '"summary.glm"', usually, a result of a
call to 'summary.glm'.
dispersion: the dispersion parameter for the family used. Either a single numerical value or ‘NULL’ (the default), when it is inferred from ‘object’ (see ‘Details’).
correlation: logical; if ‘TRUE’, the correlation matrix of the estimated parameters is returned and printed.
digits: the number of significant digits to use when printing.
symbolic.cor: logical. If ‘TRUE’, print the correlations in a symbolic form (see ‘symnum’) rather than as numbers.
signif.stars: logical. If ‘TRUE’, ‘significance stars’ are printed for each coefficient.
show.residuals: logical. If ‘TRUE’ then a summary of the deviance residuals is printed at the head of the output.
...: further arguments passed to or from other methods.
Details:
'print.summary.glm' tries to be smart about formatting the
coefficients, standard errors, etc. and additionally gives
'significance stars' if 'signif.stars' is 'TRUE'. The
'coefficients' component of the result gives the estimated
coefficients and their estimated standard errors, together with
their ratio. This third column is labelled 't ratio' if the
dispersion is estimated, and 'z ratio' if the dispersion is known
(or fixed by the family). A fourth column gives the two-tailed
p-value corresponding to the t or z ratio based on a Student t or
Normal reference distribution. (It is possible that the
dispersion is not known and there are no residual degrees of
freedom from which to estimate it. In that case the estimate is
'NaN'.)
Aliased coefficients are omitted in the returned object but
restored by the 'print' method.
Correlations are printed to two decimal places (or symbolically):
to see the actual correlations print 'summary(object)$correlation'
directly.
The dispersion of a GLM is not used in the fitting process, but it
is needed to find standard errors. If 'dispersion' is not
supplied or 'NULL', the dispersion is taken as '1' for the
'binomial' and 'Poisson' families, and otherwise estimated by the
residual Chisquared statistic (calculated from cases with non-zero
weights) divided by the residual degrees of freedom.
'summary' can be used with Gaussian 'glm' fits to handle the case
of a linear regression with known error variance, something not
handled by 'summary.lm'.
Value:
'summary.glm' returns an object of class '"summary.glm"', a list
with components
call: the component from 'object'.
family: the component from ‘object’.
deviance: the component from ‘object’.
contrasts: the component from ‘object’.
df.residual: the component from ‘object’.
null.deviance: the component from ‘object’.
df.null: the component from ‘object’.
deviance.resid: the deviance residuals: see ‘residuals.glm’.
coefficients: the matrix of coefficients, standard errors, z-values and p-values. Aliased coefficients are omitted.
aliased: named logical vector showing if the original coefficients are aliased.
dispersion: either the supplied argument or the inferred/estimated dispersion if the former is ‘NULL’.
df: a 3-vector of the rank of the model and the number of
residual degrees of freedom, plus number of coefficients
(including aliased ones).
cov.unscaled: the unscaled (‘dispersion = 1’) estimated covariance matrix of the estimated coefficients.
cov.scaled: ditto, scaled by ‘dispersion’.
correlation: (only if ‘correlation’ is true.) The estimated correlations of the estimated coefficients.
symbolic.cor: (only if ‘correlation’ is true.) The value of the argument ‘symbolic.cor’.
See Also:
'glm', 'summary'.
Examples:
## For examples see example(glm)
Lets look at the output…
Call:
glm(formula = IgG_concentration ~ age + gender + slum, family = gaussian(),
data = df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.132 16.774 2.750 0.00613 **
age 9.324 1.388 6.718 4.15e-11 ***
genderMale -9.655 11.543 -0.836 0.40321
slumNon slum -20.353 14.299 -1.423 0.15513
slumSlum -29.705 25.009 -1.188 0.23536
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 20918.39)
Null deviance: 14141483 on 631 degrees of freedom
Residual deviance: 13115831 on 627 degrees of freedom
(19 observations deleted due to missingness)
AIC: 8087.9
Number of Fisher Scoring iterations: 2
Call:
glm(formula = seropos ~ age_group + gender + slum, family = binomial(link = "logit"),
data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.3220 0.2516 -5.254 1.49e-07 ***
age_groupmiddle 1.9020 0.2133 8.916 < 2e-16 ***
age_groupold 2.8443 0.2522 11.278 < 2e-16 ***
genderMale -0.1725 0.1895 -0.910 0.363
slumNon slum -0.1099 0.2329 -0.472 0.637
slumSlum -0.1073 0.4118 -0.261 0.794
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 866.98 on 631 degrees of freedom
Residual deviance: 679.10 on 626 degrees of freedom
(19 observations deleted due to missingness)
AIC: 691.1
Number of Fisher Scoring iterations: 4
aggregate()
function can be used to conduct analyses across groups (i.e., categorical variables in the data(table()
function can generate frequency tables for 2 plus variables, but to get percentage tables, the prop.table()
is usefulchisq.test()
function tests independence of factor variablescor()
or cor.test()
functions can be used to calculate correlation between two numeric vectorst.test()
functions conducts one and two sample (paired or unpaired) t-testsglm()
fits generalized linear modules to data and returns a fit object that can be read with the summary()
functionfamily
argument in the glm()
function allows you to fit models with different link functionsThese are the materials we looked through, modified, or extracted to complete this module’s lecture.