Module 9: Data Analysis

Learning Objectives

After module 9, you should be able to…

  • Descriptively assess association between two variables
  • Compute basic statistics
  • Fit a generalized linear model

Import data for this module

Let’s read in our data (again) and take a quick look.

df <- read.csv(file = "data/serodata.csv") #relative path
head(x=df, n=3)
  observation_id IgG_concentration age gender     slum
1           5772         0.3176895   2 Female Non slum
2           8095         3.4368231   4 Female Non slum
3           9784         0.3000000   4   Male Non slum

Prep data

Create age_group three level factor variable

df$age_group <- ifelse(df$age <= 5, "young", 
                       ifelse(df$age<=10 & df$age>5, "middle", "old"))
df$age_group <- factor(df$age_group, levels=c("young", "middle", "old"))

Create seropos binary variable representing seropositivity if antibody concentrations are >10 IU/mL.

df$seropos <- ifelse(df$IgG_concentration<10, 0, 1)

Grouped analyses

  • Most of this module will discuss statistical analyses. But first we’ll discuss doing univariate analyses we’ve already used on multiple groups.
  • We can use the aggregate() function to do many analyses across groups.
?aggregate
library(printr)
Registered S3 method overwritten by 'printr':
  method                from     
  knit_print.data.frame rmarkdown
?aggregate
Compute Summary Statistics of Data Subsets

Description:

     Splits the data into subsets, computes summary statistics for
     each, and returns the result in a convenient form.

Usage:

     aggregate(x, ...)
     
     ## Default S3 method:
     aggregate(x, ...)
     
     ## S3 method for class 'data.frame'
     aggregate(x, by, FUN, ..., simplify = TRUE, drop = TRUE)
     
     ## S3 method for class 'formula'
     aggregate(x, data, FUN, ...,
               subset, na.action = na.omit)
     
     ## S3 method for class 'ts'
     aggregate(x, nfrequency = 1, FUN = sum, ndeltat = 1,
               ts.eps = getOption("ts.eps"), ...)
     
Arguments:

       x: an R object.  For the 'formula' method a 'formula', such as
          'y ~ x' or 'cbind(y1, y2) ~ x1 + x2', where the 'y' variables
          are numeric data to be split into groups according to the
          grouping 'x' variables (usually factors).

      by: a list of grouping elements, each as long as the variables in
          the data frame 'x', or a formula.  The elements are coerced
          to factors before use.

     FUN: a function to compute the summary statistics which can be
          applied to all data subsets.

simplify: a logical indicating whether results should be simplified to
          a vector or matrix if possible.

    drop: a logical indicating whether to drop unused combinations of
          grouping values.  The non-default case 'drop=FALSE' has been
          amended for R 3.5.0 to drop unused combinations.

    data: a data frame (or list) from which the variables in the
          formula should be taken.

  subset: an optional vector specifying a subset of observations to be
          used.

na.action: a function which indicates what should happen when the data
          contain 'NA' values. The default is to ignore missing values
          in the given variables.

nfrequency: new number of observations per unit of time; must be a
          divisor of the frequency of 'x'.

 ndeltat: new fraction of the sampling period between successive
          observations; must be a divisor of the sampling interval of
          'x'.

  ts.eps: tolerance used to decide if 'nfrequency' is a sub-multiple of
          the original frequency.

     ...: further arguments passed to or used by methods.

Details:

     'aggregate' is a generic function with methods for data frames and
     time series.

     The default method, 'aggregate.default', uses the time series
     method if 'x' is a time series, and otherwise coerces 'x' to a
     data frame and calls the data frame method.

     'aggregate.data.frame' is the data frame method.  If 'x' is not a
     data frame, it is coerced to one, which must have a non-zero
     number of rows.  Then, each of the variables (columns) in 'x' is
     split into subsets of cases (rows) of identical combinations of
     the components of 'by', and 'FUN' is applied to each such subset
     with further arguments in '...' passed to it.  The result is
     reformatted into a data frame containing the variables in 'by' and
     'x'.  The ones arising from 'by' contain the unique combinations
     of grouping values used for determining the subsets, and the ones
     arising from 'x' the corresponding summaries for the subset of the
     respective variables in 'x'.  If 'simplify' is true, summaries are
     simplified to vectors or matrices if they have a common length of
     one or greater than one, respectively; otherwise, lists of summary
     results according to subsets are obtained.  Rows with missing
     values in any of the 'by' variables will be omitted from the
     result.  (Note that versions of R prior to 2.11.0 required 'FUN'
     to be a scalar function.)

     The formula method provides a standard formula interface to
     'aggregate.data.frame'.  The latter invokes the formula method if
     'by' is a formula, in which case 'aggregate(x, by, FUN)' is the
     same as 'aggregate(by, x, FUN)' for a data frame 'x'.

     'aggregate.ts' is the time series method, and requires 'FUN' to be
     a scalar function.  If 'x' is not a time series, it is coerced to
     one.  Then, the variables in 'x' are split into appropriate blocks
     of length 'frequency(x) / nfrequency', and 'FUN' is applied to
     each such block, with further (named) arguments in '...' passed to
     it.  The result returned is a time series with frequency
     'nfrequency' holding the aggregated values.  Note that this make
     most sense for a quarterly or yearly result when the original
     series covers a whole number of quarters or years: in particular
     aggregating a monthly series to quarters starting in February does
     not give a conventional quarterly series.

     'FUN' is passed to 'match.fun', and hence it can be a function or
     a symbol or character string naming a function.

Value:

     For the time series method, a time series of class '"ts"' or class
     'c("mts", "ts")'.

     For the data frame method, a data frame with columns corresponding
     to the grouping variables in 'by' followed by aggregated columns
     from 'x'.  If the 'by' has names, the non-empty times are used to
     label the columns in the results, with unnamed grouping variables
     being named 'Group.i' for 'by[[i]]'.

Warning:

     The first argument of the '"formula"' method was named 'formula'
     rather than 'x' prior to R 4.2.0.  Portable uses should not name
     that argument.

Author(s):

     Kurt Hornik, with contributions by Arni Magnusson.

References:

     Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S
     Language_.  Wadsworth & Brooks/Cole.

See Also:

     'apply', 'lapply', 'tapply'.

Examples:

     ## Compute the averages for the variables in 'state.x77', grouped
     ## according to the region (Northeast, South, North Central, West) that
     ## each state belongs to.
     aggregate(state.x77, list(Region = state.region), mean)
     
     ## Compute the averages according to region and the occurrence of more
     ## than 130 days of frost.
     aggregate(state.x77,
               list(Region = state.region,
                    Cold = state.x77[,"Frost"] > 130),
               mean)
     ## (Note that no state in 'South' is THAT cold.)
     
     
     ## example with character variables and NAs
     testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
                          v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )
     by1 <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12)
     by2 <- c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA)
     aggregate(x = testDF, by = list(by1, by2), FUN = "mean")
     
     # and if you want to treat NAs as a group
     fby1 <- factor(by1, exclude = "")
     fby2 <- factor(by2, exclude = "")
     aggregate(x = testDF, by = list(fby1, fby2), FUN = "mean")
     
     
     ## Formulas, one ~ one, one ~ many, many ~ one, and many ~ many:
     aggregate(weight ~ feed, data = chickwts, mean)
     aggregate(breaks ~ wool + tension, data = warpbreaks, mean)
     aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
     aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum)
     
     ## Dot notation:
     aggregate(. ~ Species, data = iris, mean)
     aggregate(len ~ ., data = ToothGrowth, mean)
     
     ## Often followed by xtabs():
     ag <- aggregate(len ~ ., data = ToothGrowth, mean)
     xtabs(len ~ ., data = ag)
     
     ## Formula interface via 'by' (for pipe operations)
     ToothGrowth |> aggregate(len ~ ., FUN = mean)
     
     ## Compute the average annual approval ratings for American presidents.
     aggregate(presidents, nfrequency = 1, FUN = mean)
     ## Give the summer less weight.
     aggregate(presidents, nfrequency = 1,
               FUN = weighted.mean, w = c(1, 1, 0.5, 1))

Grouped analyses

  • Let’s calculate seropositivity rate across age groups using the variables we just created.
  • The easiest way to use aggregate() is with the formula option. The syntax is variable_of_intest ~ grouping_variables.
aggregate(
    # Formula specifies we are calculating statistics on seropos, separately for
    # each level of age_group
    seropos ~ age_group,
    data = df, # Data argument
    FUN = mean # function for our calculation WITHOUT PARENTHESES
)
age_group seropos
young 0.1832797
middle 0.6000000
old 0.7945205
  • We can add as many things as we want on the RHS of the formula.
aggregate(
    IgG_concentration ~ age_group + slum,
    data = df,
    FUN = sd # standard deviation
)
age_group slum IgG_concentration
young Mixed 174.89797
middle Mixed 162.08188
old Mixed 150.07063
young Non slum 114.68422
middle Non slum 177.62113
old Non slum 141.22330
young Slum 61.85705
middle Slum 202.42018
old Slum 74.75217
  • We can also add multiple variables on the LHS at the same time using cbind() syntax.
aggregate(
    cbind(age, IgG_concentration) ~ gender + slum,
    data = df,
    FUN = median
)
gender slum age IgG_concentration
Female Mixed 5.0 2.0117423
Male Mixed 6.0 2.2082192
Female Non slum 6.0 2.5040431
Male Non slum 5.0 1.1245846
Female Slum 3.0 5.1482480
Male Slum 5.5 0.7753834

2 variable contingency tables

We use table() prior to look at one variable, now we can generate frequency tables for 2 plus variables. To get cell percentages, the prop.table() is useful.

?prop.table
library(printr)
?prop.table
Express Table Entries as Fraction of Marginal Table

Description:

     Returns conditional proportions given 'margins', i.e. entries of
     'x', divided by the appropriate marginal sums.

Usage:

     proportions(x, margin = NULL)
     prop.table(x, margin = NULL)
     
Arguments:

       x: table

  margin: a vector giving the margins to split by.  E.g., for a matrix
          '1' indicates rows, '2' indicates columns, 'c(1, 2)'
          indicates rows and columns.  When 'x' has named dimnames, it
          can be a character vector selecting dimension names.

Value:

     Table like 'x' expressed relative to 'margin'

Note:

     'prop.table' is an earlier name, retained for back-compatibility.

Author(s):

     Peter Dalgaard

See Also:

     'marginSums'. 'apply', 'sweep' are a more general mechanism for
     sweeping out marginal statistics.

Examples:

     m <- matrix(1:4, 2)
     m
     proportions(m, 1)
     
     DF <- as.data.frame(UCBAdmissions)
     tbl <- xtabs(Freq ~ Gender + Admit, DF)
     
     proportions(tbl, "Gender")

2 variable contingency tables

Let’s practice

freq <- table(df$age_group, df$seropos)
freq
/ 0 1
young 254 57
middle 70 105
old 30 116

Now, lets move to percentages

prop.cell.percentages <- prop.table(freq)
prop.cell.percentages
/ 0 1
young 0.4018987 0.0901899
middle 0.1107595 0.1661392
old 0.0474684 0.1835443
prop.column.percentages <- prop.table(freq, margin=2)
prop.column.percentages
/ 0 1
young 0.7175141 0.2050360
middle 0.1977401 0.3776978
old 0.0847458 0.4172662

Chi-Square test

The chisq.test() function test of independence of factor variables from stats package.

?chisq.test

Pearson’s Chi-squared Test for Count Data

Description:

 'chisq.test' performs chi-squared contingency table tests and
 goodness-of-fit tests.

Usage:

 chisq.test(x, y = NULL, correct = TRUE,
            p = rep(1/length(x), length(x)), rescale.p = FALSE,
            simulate.p.value = FALSE, B = 2000)
 

Arguments:

   x: a numeric vector or matrix. 'x' and 'y' can also both be
      factors.

   y: a numeric vector; ignored if 'x' is a matrix.  If 'x' is a
      factor, 'y' should be a factor of the same length.

correct: a logical indicating whether to apply continuity correction when computing the test statistic for 2 by 2 tables: one half is subtracted from all |O - E| differences; however, the correction will not be bigger than the differences themselves. No correction is done if ‘simulate.p.value = TRUE’.

   p: a vector of probabilities of the same length as 'x'.  An
      error is given if any entry of 'p' is negative.

rescale.p: a logical scalar; if TRUE then ‘p’ is rescaled (if necessary) to sum to 1. If ‘rescale.p’ is FALSE, and ‘p’ does not sum to 1, an error is given.

simulate.p.value: a logical indicating whether to compute p-values by Monte Carlo simulation.

   B: an integer specifying the number of replicates used in the
      Monte Carlo test.

Details:

 If 'x' is a matrix with one row or column, or if 'x' is a vector
 and 'y' is not given, then a _goodness-of-fit test_ is performed
 ('x' is treated as a one-dimensional contingency table).  The
 entries of 'x' must be non-negative integers.  In this case, the
 hypothesis tested is whether the population probabilities equal
 those in 'p', or are all equal if 'p' is not given.

 If 'x' is a matrix with at least two rows and columns, it is taken
 as a two-dimensional contingency table: the entries of 'x' must be
 non-negative integers.  Otherwise, 'x' and 'y' must be vectors or
 factors of the same length; cases with missing values are removed,
 the objects are coerced to factors, and the contingency table is
 computed from these.  Then Pearson's chi-squared test is performed
 of the null hypothesis that the joint distribution of the cell
 counts in a 2-dimensional contingency table is the product of the
 row and column marginals.

 If 'simulate.p.value' is 'FALSE', the p-value is computed from the
 asymptotic chi-squared distribution of the test statistic;
 continuity correction is only used in the 2-by-2 case (if
 'correct' is 'TRUE', the default).  Otherwise the p-value is
 computed for a Monte Carlo test (Hope, 1968) with 'B' replicates.
 The default 'B = 2000' implies a minimum p-value of about 0.0005
 (1/(B+1)).

 In the contingency table case, simulation is done by random
 sampling from the set of all contingency tables with given
 marginals, and works only if the marginals are strictly positive.
 Continuity correction is never used, and the statistic is quoted
 without it.  Note that this is not the usual sampling situation
 assumed for the chi-squared test but rather that for Fisher's
 exact test.

 In the goodness-of-fit case simulation is done by random sampling
 from the discrete distribution specified by 'p', each sample being
 of size 'n = sum(x)'.  This simulation is done in R and may be
 slow.

Value:

 A list with class '"htest"' containing the following components:

statistic: the value the chi-squared test statistic.

parameter: the degrees of freedom of the approximate chi-squared distribution of the test statistic, ‘NA’ if the p-value is computed by Monte Carlo simulation.

p.value: the p-value for the test.

method: a character string indicating the type of test performed, and whether Monte Carlo simulation or continuity correction was used.

data.name: a character string giving the name(s) of the data.

observed: the observed counts.

expected: the expected counts under the null hypothesis.

residuals: the Pearson residuals, ‘(observed - expected) / sqrt(expected)’.

stdres: standardized residuals, ‘(observed - expected) / sqrt(V)’, where ‘V’ is the residual cell variance (Agresti, 2007, section 2.4.5 for the case where ‘x’ is a matrix, ‘n * p * (1 - p)’ otherwise).

Source:

 The code for Monte Carlo simulation is a C translation of the
 Fortran algorithm of Patefield (1981).

References:

 Hope, A. C. A. (1968).  A simplified Monte Carlo significance test
 procedure.  _Journal of the Royal Statistical Society Series B_,
 *30*, 582-598.  doi:10.1111/j.2517-6161.1968.tb00759.x
 <https://doi.org/10.1111/j.2517-6161.1968.tb00759.x>.

 Patefield, W. M. (1981).  Algorithm AS 159: An efficient method of
 generating r x c tables with given row and column totals.
 _Applied Statistics_, *30*, 91-97.  doi:10.2307/2346669
 <https://doi.org/10.2307/2346669>.

 Agresti, A. (2007).  _An Introduction to Categorical Data
 Analysis_, 2nd ed.  New York: John Wiley & Sons.  Page 38.

See Also:

 For goodness-of-fit testing, notably of continuous distributions,
 'ks.test'.

Examples:

 ## From Agresti(2007) p.39
 M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
 dimnames(M) <- list(gender = c("F", "M"),
                     party = c("Democrat","Independent", "Republican"))
 (Xsq <- chisq.test(M))  # Prints test summary
 Xsq$observed   # observed counts (same as M)
 Xsq$expected   # expected counts under the null
 Xsq$residuals  # Pearson residuals
 Xsq$stdres     # standardized residuals
 
 
 ## Effect of simulating p-values
 x <- matrix(c(12, 5, 7, 7), ncol = 2)
 chisq.test(x)$p.value           # 0.4233
 chisq.test(x, simulate.p.value = TRUE, B = 10000)$p.value
                                 # around 0.29!
 
 ## Testing for population probabilities
 ## Case A. Tabulated data
 x <- c(A = 20, B = 15, C = 25)
 chisq.test(x)
 chisq.test(as.table(x))             # the same
 x <- c(89,37,30,28,2)
 p <- c(40,20,20,15,5)
 try(
 chisq.test(x, p = p)                # gives an error
 )
 chisq.test(x, p = p, rescale.p = TRUE)
                                 # works
 p <- c(0.40,0.20,0.20,0.19,0.01)
                                 # Expected count in category 5
                                 # is 1.86 < 5 ==> chi square approx.
 chisq.test(x, p = p)            #               maybe doubtful, but is ok!
 chisq.test(x, p = p, simulate.p.value = TRUE)
 
 ## Case B. Raw data
 x <- trunc(5 * runif(100))
 chisq.test(table(x))            # NOT 'chisq.test(x)'!

Chi-Square test

chisq.test(freq)

    Pearson's Chi-squared test

data:  freq
X-squared = 175.85, df = 2, p-value < 2.2e-16

We reject the null hypothesis that the proportion of seropositive individuals in the young, middle, and old age groups are the same.

Correlation

First, we compute correlation by providing two vectors.

Like other functions, if there are NAs, you get NA as the result. But if you specify use only the complete observations, then it will give you correlation using the non-missing data.

cor(df$age, df$IgG_concentration, method="pearson")
[1] NA
cor(df$age, df$IgG_concentration, method="pearson", use = "complete.obs") #IF have missing data
[1] 0.2604783

Small positive correlation between IgG concentration and age.

Correlation confidence interval

The function cor.test() also gives you the confidence interval of the correlation statistic. Note, it uses complete observations by default.

cor.test(df$age, df$IgG_concentration, method="pearson")

    Pearson's product-moment correlation

data:  df$age and df$IgG_concentration
t = 6.7717, df = 630, p-value = 2.921e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1862722 0.3317295
sample estimates:
      cor 
0.2604783 

T-test

The commonly used are:

  • one-sample t-test – used to test mean of a variable in one group (to the null hypothesis mean)
  • two-sample t-test – used to test difference in means of a variable between two groups (null hypothesis - the group means are the same)

T-test

We can use the t.test() function from the stats package.

?t.test

Student’s t-Test

Description:

 Performs one and two sample t-tests on vectors of data.

Usage:

 t.test(x, ...)
 
 ## Default S3 method:
 t.test(x, y = NULL,
        alternative = c("two.sided", "less", "greater"),
        mu = 0, paired = FALSE, var.equal = FALSE,
        conf.level = 0.95, ...)
 
 ## S3 method for class 'formula'
 t.test(formula, data, subset, na.action, ...)
 

Arguments:

   x: a (non-empty) numeric vector of data values.

   y: an optional (non-empty) numeric vector of data values.

alternative: a character string specifying the alternative hypothesis, must be one of ‘“two.sided”’ (default), ‘“greater”’ or ‘“less”’. You can specify just the initial letter.

  mu: a number indicating the true value of the mean (or difference
      in means if you are performing a two sample test).

paired: a logical indicating whether you want a paired t-test.

var.equal: a logical variable indicating whether to treat the two variances as being equal. If ‘TRUE’ then the pooled variance is used to estimate the variance otherwise the Welch (or Satterthwaite) approximation to the degrees of freedom is used.

conf.level: confidence level of the interval.

formula: a formula of the form ‘lhs ~ rhs’ where ‘lhs’ is a numeric variable giving the data values and ‘rhs’ either ‘1’ for a one-sample or paired test or a factor with two levels giving the corresponding groups. If ‘lhs’ is of class ‘“Pair”’ and ‘rhs’ is ‘1’, a paired test is done.

data: an optional matrix or data frame (or similar: see
      'model.frame') containing the variables in the formula
      'formula'.  By default the variables are taken from
      'environment(formula)'.

subset: an optional vector specifying a subset of observations to be used.

na.action: a function which indicates what should happen when the data contain ‘NA’s. Defaults to ’getOption(“na.action”)’.

 ...: further arguments to be passed to or from methods.

Details:

 'alternative = "greater"' is the alternative that 'x' has a larger
 mean than 'y'. For the one-sample case: that the mean is positive.

 If 'paired' is 'TRUE' then both 'x' and 'y' must be specified and
 they must be the same length.  Missing values are silently removed
 (in pairs if 'paired' is 'TRUE').  If 'var.equal' is 'TRUE' then
 the pooled estimate of the variance is used.  By default, if
 'var.equal' is 'FALSE' then the variance is estimated separately
 for both groups and the Welch modification to the degrees of
 freedom is used.

 If the input data are effectively constant (compared to the larger
 of the two means) an error is generated.

Value:

 A list with class '"htest"' containing the following components:

statistic: the value of the t-statistic.

parameter: the degrees of freedom for the t-statistic.

p.value: the p-value for the test.

conf.int: a confidence interval for the mean appropriate to the specified alternative hypothesis.

estimate: the estimated mean or difference in means depending on whether it was a one-sample test or a two-sample test.

null.value: the specified hypothesized value of the mean or mean difference depending on whether it was a one-sample test or a two-sample test.

stderr: the standard error of the mean (difference), used as denominator in the t-statistic formula.

alternative: a character string describing the alternative hypothesis.

method: a character string indicating what type of t-test was performed.

data.name: a character string giving the name(s) of the data.

See Also:

 'prop.test'

Examples:

 require(graphics)
 
 t.test(1:10, y = c(7:20))      # P = .00001855
 t.test(1:10, y = c(7:20, 200)) # P = .1245    -- NOT significant anymore
 
 ## Classical example: Student's sleep data
 plot(extra ~ group, data = sleep)
 ## Traditional interface
 with(sleep, t.test(extra[group == 1], extra[group == 2]))
 
 ## Formula interface
 t.test(extra ~ group, data = sleep)
 
 ## Formula interface to one-sample test
 t.test(extra ~ 1, data = sleep)
 
 ## Formula interface to paired test
 ## The sleep data are actually paired, so could have been in wide format:
 sleep2 <- reshape(sleep, direction = "wide", 
                   idvar = "ID", timevar = "group")
 t.test(Pair(extra.1, extra.2) ~ 1, data = sleep2)

Running two-sample t-test

The base R - t.test() function from the stats package. It tests test difference in means of a variable between two groups. By default:

  • tests whether difference in means of a variable is equal to 0 (default mu=0)
  • uses “two sided” alternative (alternative = "two.sided")
  • returns result assuming confidence level 0.95 (conf.level = 0.95)
  • assumes data are not paired (paired = FALSE)
  • assumes true variance in the two groups is not equal (var.equal = FALSE)

Running two-sample t-test

IgG_young <- df$IgG_concentration[df$age_group=="young"]
IgG_old <- df$IgG_concentration[df$age_group=="old"]

t.test(IgG_young, IgG_old)

    Welch Two Sample t-test

data:  IgG_young and IgG_old
t = -6.1969, df = 259.54, p-value = 2.25e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -111.09281  -57.51515
sample estimates:
mean of x mean of y 
 45.05056 129.35454 

The mean IgG concenration of young and old is 45.05 and 129.35 IU/mL, respectively. We reject null hypothesis that the difference in the mean IgG concentration of young and old is 0 IU/mL.

Linear regression fit in R

To fit regression models in R, we use the function glm() (Generalized Linear Model).

?glm

Fitting Generalized Linear Models

Description:

 'glm' is used to fit generalized linear models, specified by
 giving a symbolic description of the linear predictor and a
 description of the error distribution.

Usage:

 glm(formula, family = gaussian, data, weights, subset,
     na.action, start = NULL, etastart, mustart, offset,
     control = list(...), model = TRUE, method = "glm.fit",
     x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, ...)
 
 glm.fit(x, y, weights = rep.int(1, nobs),
         start = NULL, etastart = NULL, mustart = NULL,
         offset = rep.int(0, nobs), family = gaussian(),
         control = list(), intercept = TRUE, singular.ok = TRUE)
 
 ## S3 method for class 'glm'
 weights(object, type = c("prior", "working"), ...)
 

Arguments:

formula: an object of class ‘“formula”’ (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.

family: a description of the error distribution and link function to be used in the model. For ‘glm’ this can be a character string naming a family function, a family function or the result of a call to a family function. For ‘glm.fit’ only the third option is supported. (See ‘family’ for details of family functions.)

data: an optional data frame, list or environment (or object
      coercible by 'as.data.frame' to a data frame) containing the
      variables in the model.  If not found in 'data', the
      variables are taken from 'environment(formula)', typically
      the environment from which 'glm' is called.

weights: an optional vector of ‘prior weights’ to be used in the fitting process. Should be ‘NULL’ or a numeric vector.

subset: an optional vector specifying a subset of observations to be used in the fitting process.

na.action: a function which indicates what should happen when the data contain ‘NA’s. The default is set by the ’na.action’ setting of ‘options’, and is ‘na.fail’ if that is unset. The ‘factory-fresh’ default is ‘na.omit’. Another possible value is ‘NULL’, no action. Value ‘na.exclude’ can be useful.

start: starting values for the parameters in the linear predictor.

etastart: starting values for the linear predictor.

mustart: starting values for the vector of means.

offset: this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be ‘NULL’ or a numeric vector of length equal to the number of cases. One or more ‘offset’ terms can be included in the formula instead or as well, and if more than one is specified their sum is used. See ‘model.offset’.

control: a list of parameters for controlling the fitting process. For ‘glm.fit’ this is passed to ‘glm.control’.

model: a logical value indicating whether model frame should be included as a component of the returned value.

method: the method to be used in fitting the model. The default method ‘“glm.fit”’ uses iteratively reweighted least squares (IWLS): the alternative ‘“model.frame”’ returns the model frame and does no fitting.

      User-supplied fitting functions can be supplied either as a
      function or a character string naming a function, with a
      function which takes the same arguments as 'glm.fit'.  If
      specified as a character string it is looked up from within
      the 'stats' namespace.

x, y: For 'glm': logical values indicating whether the response
      vector and model matrix used in the fitting process should be
      returned as components of the returned value.

      For 'glm.fit': 'x' is a design matrix of dimension 'n * p',
      and 'y' is a vector of observations of length 'n'.

singular.ok: logical; if ‘FALSE’ a singular fit is an error.

contrasts: an optional list. See the ‘contrasts.arg’ of ‘model.matrix.default’.

intercept: logical. Should an intercept be included in the null model?

object: an object inheriting from class ‘“glm”’.

type: character, partial matching allowed.  Type of weights to
      extract from the fitted model object.  Can be abbreviated.

 ...: For 'glm': arguments to be used to form the default 'control'
      argument if it is not supplied directly.

      For 'weights': further arguments passed to or from other
      methods.

Details:

 A typical predictor has the form 'response ~ terms' where
 'response' is the (numeric) response vector and 'terms' is a
 series of terms which specifies a linear predictor for 'response'.
 For 'binomial' and 'quasibinomial' families the response can also
 be specified as a 'factor' (when the first level denotes failure
 and all others success) or as a two-column matrix with the columns
 giving the numbers of successes and failures.  A terms
 specification of the form 'first + second' indicates all the terms
 in 'first' together with all the terms in 'second' with any
 duplicates removed.

 A specification of the form 'first:second' indicates the set of
 terms obtained by taking the interactions of all terms in 'first'
 with all terms in 'second'.  The specification 'first*second'
 indicates the _cross_ of 'first' and 'second'.  This is the same
 as 'first + second + first:second'.

 The terms in the formula will be re-ordered so that main effects
 come first, followed by the interactions, all second-order, all
 third-order and so on: to avoid this pass a 'terms' object as the
 formula.

 Non-'NULL' 'weights' can be used to indicate that different
 observations have different dispersions (with the values in
 'weights' being inversely proportional to the dispersions); or
 equivalently, when the elements of 'weights' are positive integers
 w_i, that each response y_i is the mean of w_i unit-weight
 observations.  For a binomial GLM prior weights are used to give
 the number of trials when the response is the proportion of
 successes: they would rarely be used for a Poisson GLM.

 'glm.fit' is the workhorse function: it is not normally called
 directly but can be more efficient where the response vector,
 design matrix and family have already been calculated.

 If more than one of 'etastart', 'start' and 'mustart' is
 specified, the first in the list will be used.  It is often
 advisable to supply starting values for a 'quasi' family, and also
 for families with unusual links such as 'gaussian("log")'.

 All of 'weights', 'subset', 'offset', 'etastart' and 'mustart' are
 evaluated in the same way as variables in 'formula', that is first
 in 'data' and then in the environment of 'formula'.

 For the background to warning messages about 'fitted probabilities
 numerically 0 or 1 occurred' for binomial GLMs, see Venables &
 Ripley (2002, pp. 197-8).

Value:

 'glm' returns an object of class inheriting from '"glm"' which
 inherits from the class '"lm"'. See later in this section.  If a
 non-standard 'method' is used, the object will also inherit from
 the class (if any) returned by that function.

 The function 'summary' (i.e., 'summary.glm') can be used to obtain
 or print a summary of the results and the function 'anova' (i.e.,
 'anova.glm') to produce an analysis of variance table.

 The generic accessor functions 'coefficients', 'effects',
 'fitted.values' and 'residuals' can be used to extract various
 useful features of the value returned by 'glm'.

 'weights' extracts a vector of weights, one for each case in the
 fit (after subsetting and 'na.action').

 An object of class '"glm"' is a list containing at least the
 following components:

coefficients: a named vector of coefficients

residuals: the working residuals, that is the residuals in the final iteration of the IWLS fit. Since cases with zero weights are omitted, their working residuals are ‘NA’.

fitted.values: the fitted mean values, obtained by transforming the linear predictors by the inverse of the link function.

rank: the numeric rank of the fitted linear model.

family: the ‘family’ object used.

linear.predictors: the linear fit on link scale.

deviance: up to a constant, minus twice the maximized log-likelihood. Where sensible, the constant is chosen so that a saturated model has deviance zero.

 aic: A version of Akaike's _An Information Criterion_, minus twice
      the maximized log-likelihood plus twice the number of
      parameters, computed via the 'aic' component of the family.
      For binomial and Poison families the dispersion is fixed at
      one and the number of parameters is the number of
      coefficients.  For gaussian, Gamma and inverse gaussian
      families the dispersion is estimated from the residual
      deviance, and the number of parameters is the number of
      coefficients plus one.  For a gaussian family the MLE of the
      dispersion is used so this is a valid value of AIC, but for
      Gamma and inverse gaussian families it is not.  For families
      fitted by quasi-likelihood the value is 'NA'.

null.deviance: The deviance for the null model, comparable with ‘deviance’. The null model will include the offset, and an intercept if there is one in the model. Note that this will be incorrect if the link function depends on the data other than through the fitted mean: specify a zero offset to force a correct calculation.

iter: the number of iterations of IWLS used.

weights: the working weights, that is the weights in the final iteration of the IWLS fit.

prior.weights: the weights initially supplied, a vector of ’1’s if none were.

df.residual: the residual degrees of freedom.

df.null: the residual degrees of freedom for the null model.

   y: if requested (the default) the 'y' vector used. (It is a
      vector even for a binomial model.)

   x: if requested, the model matrix.

model: if requested (the default), the model frame.

converged: logical. Was the IWLS algorithm judged to have converged?

boundary: logical. Is the fitted value on the boundary of the attainable values?

call: the matched call.

formula: the formula supplied.

terms: the ‘terms’ object used.

data: the 'data argument'.

offset: the offset vector used.

control: the value of the ‘control’ argument used.

method: the name of the fitter function used (when provided as a ‘character’ string to ‘glm()’) or the fitter ‘function’ (when provided as that).

contrasts: (where relevant) the contrasts used.

xlevels: (where relevant) a record of the levels of the factors used in fitting.

na.action: (where relevant) information returned by ‘model.frame’ on the special handling of ’NA’s.

 In addition, non-empty fits will have components 'qr', 'R' and
 'effects' relating to the final weighted linear fit.

 Objects of class '"glm"' are normally of class 'c("glm", "lm")',
 that is inherit from class '"lm"', and well-designed methods for
 class '"lm"' will be applied to the weighted linear model at the
 final iteration of IWLS.  However, care is needed, as extractor
 functions for class '"glm"' such as 'residuals' and 'weights' do
 *not* just pick out the component of the fit with the same name.

 If a 'binomial' 'glm' model was specified by giving a two-column
 response, the weights returned by 'prior.weights' are the total
 numbers of cases (factored by the supplied case weights) and the
 component 'y' of the result is the proportion of successes.

Fitting functions:

 The argument 'method' serves two purposes.  One is to allow the
 model frame to be recreated with no fitting.  The other is to
 allow the default fitting function 'glm.fit' to be replaced by a
 function which takes the same arguments and uses a different
 fitting algorithm.  If 'glm.fit' is supplied as a character string
 it is used to search for a function of that name, starting in the
 'stats' namespace.

 The class of the object return by the fitter (if any) will be
 prepended to the class returned by 'glm'.

Author(s):

 The original R implementation of 'glm' was written by Simon Davies
 working for Ross Ihaka at the University of Auckland, but has
 since been extensively re-written by members of the R Core team.

 The design was inspired by the S function of the same name
 described in Hastie & Pregibon (1992).

References:

 Dobson, A. J. (1990) _An Introduction to Generalized Linear
 Models._ London: Chapman and Hall.

 Hastie, T. J. and Pregibon, D. (1992) _Generalized linear models._
 Chapter 6 of _Statistical Models in S_ eds J. M. Chambers and T.
 J. Hastie, Wadsworth & Brooks/Cole.

 McCullagh P. and Nelder, J. A. (1989) _Generalized Linear Models._
 London: Chapman and Hall.

 Venables, W. N. and Ripley, B. D. (2002) _Modern Applied
 Statistics with S._ New York: Springer.

See Also:

 'anova.glm', 'summary.glm', etc. for 'glm' methods, and the
 generic functions 'anova', 'summary', 'effects', 'fitted.values',
 and 'residuals'.

 'lm' for non-generalized _linear_ models (which SAS calls GLMs,
 for 'general' linear models).

 'loglin' and 'loglm' (package 'MASS') for fitting log-linear
 models (which binomial and Poisson GLMs are) to contingency
 tables.

 'bigglm' in package 'biglm' for an alternative way to fit GLMs to
 large datasets (especially those with many cases).

 'esoph', 'infert' and 'predict.glm' have examples of fitting
 binomial glms.

Examples:

 ## Dobson (1990) Page 93: Randomized Controlled Trial :
 counts <- c(18,17,15,20,10,20,25,13,12)
 outcome <- gl(3,1,9)
 treatment <- gl(3,3)
 data.frame(treatment, outcome, counts) # showing data
 glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
 anova(glm.D93)
 summary(glm.D93)
 ## Computing AIC [in many ways]:
 (A0 <- AIC(glm.D93))
 (ll <- logLik(glm.D93))
 A1 <- -2*c(ll) + 2*attr(ll, "df")
 A2 <- glm.D93$family$aic(counts, mu=fitted(glm.D93), wt=1) +
         2 * length(coef(glm.D93))
 stopifnot(exprs = {
   all.equal(A0, A1)
   all.equal(A1, A2)
   all.equal(A1, glm.D93$aic)
 })
 
 
 ## an example with offsets from Venables & Ripley (2002, p.189)
 utils::data(anorexia, package = "MASS")
 
 anorex.1 <- glm(Postwt ~ Prewt + Treat + offset(Prewt),
                 family = gaussian, data = anorexia)
 summary(anorex.1)
 
 
 # A Gamma example, from McCullagh & Nelder (1989, pp. 300-2)
 clotting <- data.frame(
     u = c(5,10,15,20,30,40,60,80,100),
     lot1 = c(118,58,42,35,27,25,21,19,18),
     lot2 = c(69,35,26,21,18,16,13,12,12))
 summary(glm(lot1 ~ log(u), data = clotting, family = Gamma))
 summary(glm(lot2 ~ log(u), data = clotting, family = Gamma))
 ## Aliased ("S"ingular) -> 1 NA coefficient
 (fS <- glm(lot2 ~ log(u) + log(u^2), data = clotting, family = Gamma))
 tools::assertError(update(fS, singular.ok=FALSE), verbose=interactive())
 ## -> .. "singular fit encountered"
 
 ## Not run:
 
 ## for an example of the use of a terms object as a formula
 demo(glm.vr)
 ## End(Not run)

Linear regression fit in R

We tend to focus on three arguments:

  • formula – model formula written using names of columns in our data
  • data – our data frame
  • family – error distribution and link function
fit1 <- glm(IgG_concentration~age+gender+slum, data=df, family=gaussian())
fit2 <- glm(seropos~age_group+gender+slum, data=df, family = binomial(link = "logit"))

summary.glm()

The summary() function when applied to a fit object based on a glm is technically the summary.glm() function and produces details of the model fit. Note on object oriented code.

Summarizing Generalized Linear Model Fits

Description:

 These functions are all 'methods' for class 'glm' or 'summary.glm'
 objects.

Usage:

 ## S3 method for class 'glm'
 summary(object, dispersion = NULL, correlation = FALSE,
         symbolic.cor = FALSE, ...)
 
 ## S3 method for class 'summary.glm'
 print(x, digits = max(3, getOption("digits") - 3),
       symbolic.cor = x$symbolic.cor,
       signif.stars = getOption("show.signif.stars"),
       show.residuals = FALSE, ...)
 

Arguments:

object: an object of class ‘“glm”’, usually, a result of a call to ‘glm’.

   x: an object of class '"summary.glm"', usually, a result of a
      call to 'summary.glm'.

dispersion: the dispersion parameter for the family used. Either a single numerical value or ‘NULL’ (the default), when it is inferred from ‘object’ (see ‘Details’).

correlation: logical; if ‘TRUE’, the correlation matrix of the estimated parameters is returned and printed.

digits: the number of significant digits to use when printing.

symbolic.cor: logical. If ‘TRUE’, print the correlations in a symbolic form (see ‘symnum’) rather than as numbers.

signif.stars: logical. If ‘TRUE’, ‘significance stars’ are printed for each coefficient.

show.residuals: logical. If ‘TRUE’ then a summary of the deviance residuals is printed at the head of the output.

 ...: further arguments passed to or from other methods.

Details:

 'print.summary.glm' tries to be smart about formatting the
 coefficients, standard errors, etc. and additionally gives
 'significance stars' if 'signif.stars' is 'TRUE'.  The
 'coefficients' component of the result gives the estimated
 coefficients and their estimated standard errors, together with
 their ratio.  This third column is labelled 't ratio' if the
 dispersion is estimated, and 'z ratio' if the dispersion is known
 (or fixed by the family).  A fourth column gives the two-tailed
 p-value corresponding to the t or z ratio based on a Student t or
 Normal reference distribution.  (It is possible that the
 dispersion is not known and there are no residual degrees of
 freedom from which to estimate it.  In that case the estimate is
 'NaN'.)

 Aliased coefficients are omitted in the returned object but
 restored by the 'print' method.

 Correlations are printed to two decimal places (or symbolically):
 to see the actual correlations print 'summary(object)$correlation'
 directly.

 The dispersion of a GLM is not used in the fitting process, but it
 is needed to find standard errors.  If 'dispersion' is not
 supplied or 'NULL', the dispersion is taken as '1' for the
 'binomial' and 'Poisson' families, and otherwise estimated by the
 residual Chisquared statistic (calculated from cases with non-zero
 weights) divided by the residual degrees of freedom.

 'summary' can be used with Gaussian 'glm' fits to handle the case
 of a linear regression with known error variance, something not
 handled by 'summary.lm'.

Value:

 'summary.glm' returns an object of class '"summary.glm"', a list
 with components

call: the component from 'object'.

family: the component from ‘object’.

deviance: the component from ‘object’.

contrasts: the component from ‘object’.

df.residual: the component from ‘object’.

null.deviance: the component from ‘object’.

df.null: the component from ‘object’.

deviance.resid: the deviance residuals: see ‘residuals.glm’.

coefficients: the matrix of coefficients, standard errors, z-values and p-values. Aliased coefficients are omitted.

aliased: named logical vector showing if the original coefficients are aliased.

dispersion: either the supplied argument or the inferred/estimated dispersion if the former is ‘NULL’.

  df: a 3-vector of the rank of the model and the number of
      residual degrees of freedom, plus number of coefficients
      (including aliased ones).

cov.unscaled: the unscaled (‘dispersion = 1’) estimated covariance matrix of the estimated coefficients.

cov.scaled: ditto, scaled by ‘dispersion’.

correlation: (only if ‘correlation’ is true.) The estimated correlations of the estimated coefficients.

symbolic.cor: (only if ‘correlation’ is true.) The value of the argument ‘symbolic.cor’.

See Also:

 'glm', 'summary'.

Examples:

 ## For examples see example(glm)

Linear regression fit in R

Lets look at the output…

summary(fit1)

Call:
glm(formula = IgG_concentration ~ age + gender + slum, family = gaussian(), 
    data = df)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    46.132     16.774   2.750  0.00613 ** 
age             9.324      1.388   6.718 4.15e-11 ***
genderMale     -9.655     11.543  -0.836  0.40321    
slumNon slum  -20.353     14.299  -1.423  0.15513    
slumSlum      -29.705     25.009  -1.188  0.23536    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 20918.39)

    Null deviance: 14141483  on 631  degrees of freedom
Residual deviance: 13115831  on 627  degrees of freedom
  (19 observations deleted due to missingness)
AIC: 8087.9

Number of Fisher Scoring iterations: 2
summary(fit2)

Call:
glm(formula = seropos ~ age_group + gender + slum, family = binomial(link = "logit"), 
    data = df)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -1.3220     0.2516  -5.254 1.49e-07 ***
age_groupmiddle   1.9020     0.2133   8.916  < 2e-16 ***
age_groupold      2.8443     0.2522  11.278  < 2e-16 ***
genderMale       -0.1725     0.1895  -0.910    0.363    
slumNon slum     -0.1099     0.2329  -0.472    0.637    
slumSlum         -0.1073     0.4118  -0.261    0.794    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 866.98  on 631  degrees of freedom
Residual deviance: 679.10  on 626  degrees of freedom
  (19 observations deleted due to missingness)
AIC: 691.1

Number of Fisher Scoring iterations: 4

Summary

  • the aggregate() function can be used to conduct analyses across groups (i.e., categorical variables in the data(
  • the table() function can generate frequency tables for 2 plus variables, but to get percentage tables, the prop.table() is useful
  • the chisq.test() function tests independence of factor variables
  • the cor() or cor.test() functions can be used to calculate correlation between two numeric vectors
  • the t.test() functions conducts one and two sample (paired or unpaired) t-tests
  • the function glm() fits generalized linear modules to data and returns a fit object that can be read with the summary() function
  • changing the family argument in the glm() function allows you to fit models with different link functions

Acknowledgements

These are the materials we looked through, modified, or extracted to complete this module’s lecture.