Data Analysis Walkthrough

Learning goals

  • Use logical operators, subsetting functions, and math calculations in R
  • Translate human-understandable problem descriptions into instructions that R can understand.

Remember, R always does EXACTLY what you tell it to do!

Instructions

  • Make a new R script for this case study, and save it to your code folder.
  • We’ll use the diphtheria serosample data from Exercise 1 for this case study. Load it into R and use the functions we’ve learned to look at it.

Instructions

  • Make a new R script for this case study, and save it to your code folder.
  • We’ll use the diphtheria serosample data from Exercise 1 for this case study. Load it into R and use the functions we’ve learned to look at it.
  • The str() of your dataset should look like this.
tibble [250 × 5] (S3: tbl_df/tbl/data.frame)
 $ age_months  : num [1:250] 15 44 103 88 88 118 85 19 78 112 ...
 $ group       : chr [1:250] "urban" "rural" "urban" "urban" ...
 $ DP_antibody : num [1:250] 0.481 0.657 1.368 1.218 0.333 ...
 $ DP_infection: num [1:250] 1 1 1 1 1 1 1 1 1 1 ...
 $ DP_vacc     : num [1:250] 0 1 1 1 1 1 1 1 1 1 ...

Q1: Was the overall prevalence higher in urban or rural areas?

  1. How do we calculate the prevalence from the data?
  2. How do we calculate the prevalence separately for urban and rural areas?
  3. How do we determine which prevalence is higher and if the difference is meaningful?

Q1: How do we calculate the prevalence from the data?

  • The variable DP_infection in our dataset is binary / dichotomous.
  • The prevalence is the number or percent of people who had the disease over some duration.
  • The average of a binary variable gives the prevalence!
mean(diph$DP_infection)
[1] 0.8

Q1: How do we calculate the prevalence separately for urban and rural areas?

mean(diph[diph$group == "urban", ]$DP_infection)
[1] 0.8235294
mean(diph[diph$group == "rural", ]$DP_infection)
[1] 0.778626
  • There are many ways you could write this code! You can use subset() or you can write the indices many ways.
  • Using tbl_df objects from haven uses different [[ rules than a base R data frame.

Q1: How do we calculate the prevalence separately for urban and rural areas?

  • One easy way is to use the aggregate() function.
aggregate(DP_infection ~ group, data = diph, FUN = mean)
  group DP_infection
1 rural    0.7786260
2 urban    0.8235294

Q1: How do we determine which prevalence is higher and if the difference is meaningful?

  • We probably need to include a confidence interval in our calculation.
  • This is actually not so easy without more advanced tools that we will learn in upcoming modules.
  • Right now the best options are to do it by hand or google a function.

Q1: By hand

p_urban <- mean(diph[diph$group == "urban", ]$DP_infection)
p_rural <- mean(diph[diph$group == "rural", ]$DP_infection)
se_urban <- sqrt(p_urban * (1 - p_urban) / nrow(diph[diph$group == "urban", ]))
se_rural <- sqrt(p_rural * (1 - p_rural) / nrow(diph[diph$group == "rural", ])) 

result_urban <- paste0(
    "Urban: ", round(p_urban, 2), "; 95% CI: (",
    round(p_urban - 1.96 * se_urban, 2), ", ",
    round(p_urban + 1.96 * se_urban, 2), ")"
)

result_rural <- paste0(
    "Rural: ", round(p_rural, 2), "; 95% CI: (",
    round(p_rural - 1.96 * se_rural, 2), ", ",
    round(p_rural + 1.96 * se_rural, 2), ")"
)

cat(result_urban, result_rural, sep = "\n")
Urban: 0.82; 95% CI: (0.76, 0.89)
Rural: 0.78; 95% CI: (0.71, 0.85)

Q1: By hand

  • We can see that the 95% CI’s overlap, so the groups are probably not that different. To be sure, we need to do a 2-sample test! But this is not a statistics class.
  • Some people will tell you that coding like this is “bad”. But ‘bad’ code that gives you answers is better than broken code! We will learn techniques for writing this with less work and less repetition in upcoming modules.

Q1: Googling a package

# install.packages("DescTools")
library(DescTools)

aggregate(DP_infection ~ group, data = diph, FUN = DescTools::MeanCI)
  group DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci
1 rural         0.7786260           0.7065872           0.8506647
2 urban         0.8235294           0.7540334           0.8930254

You try it!

  • Using any of the approaches you can think of, answer this question!
  • How many children under 5 were vaccinated? In children under 5, did vaccination lower the prevalence of infection?

You try it!

# How many children under 5 were vaccinated
sum(diph$DP_vacc[diph$age_months < 60])
[1] 91
# Prevalence in both vaccine groups for children under 5
aggregate(
    DP_infection ~ DP_vacc,
    data = subset(diph, age_months < 60),
    FUN = DescTools::MeanCI
)
  DP_vacc DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci
1       0         0.4285714           0.1977457           0.6593972
2       1         0.6373626           0.5366845           0.7380407

It appears that prevalence was HIGHER in the vaccine group? That is counterintuitive, but the sample size for the unvaccinated group is too small to be sure.

Congratulations for finishing the first case study!

  • What R functions and skills did you practice?
  • What other questions could you answer about the same dataset with the skills you know now?