Module 8: Data Merging and Reshaping

Learning Objectives

After module 8, you should be able to…

  • Merge/join data together
  • Reshape data from wide to long
  • Reshape data from long to wide

Joining types

Pay close attention to the number of rows in your data set before and after a join. This will help flag when an issue has arisen. This will depend on the type of merge:

  • 1:1 merge (one-to-one merge) – Simplest merge (sometimes things go wrong)
  • 1:m merge (one-to-many merge) – More complex (things often go wrong)
    • The “one” suggests that one dataset has the merging variable (e.g., id) each represented once and the “many” implies that one dataset has the merging variable represented multiple times
  • m:m merge (many-to-many merge) – Danger zone (can be unpredictable)

one-to-one merge

  • This means that each row of data represents a unique unit of analysis that exists in another dataset (e.g,. id variable)
  • Will likely have variables that don’t exist in the current dataset (that’s why you are trying to merge it in)
  • The merging variable (e.g., id) each represented a single time
  • You should try to structure your data so that a 1:1 merge or 1:m merge is possible so that fewer things can go wrong.

merge() function

We will use the merge() function to conduct one-to-one merge

?merge
Registered S3 method overwritten by 'printr':
  method                from     
  knit_print.data.frame rmarkdown

Merge Two Data Frames

Description:

 Merge two data frames by common columns or row names, or do other
 versions of database _join_ operations.

Usage:

 merge(x, y, ...)
 
 ## Default S3 method:
 merge(x, y, ...)
 
 ## S3 method for class 'data.frame'
 merge(x, y, by = intersect(names(x), names(y)),
       by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
       sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
       incomparables = NULL, ...)
 

Arguments:

x, y: data frames, or objects to be coerced to one.

by, by.x, by.y: specifications of the columns used for merging. See ‘Details’.

 all: logical; 'all = L' is shorthand for 'all.x = L' and 'all.y =
      L', where 'L' is either 'TRUE' or 'FALSE'.

all.x: logical; if ‘TRUE’, then extra rows will be added to the output, one for each row in ‘x’ that has no matching row in ‘y’. These rows will have ‘NA’s in those columns that are usually filled with values from ’y’. The default is ‘FALSE’, so that only rows with data from both ‘x’ and ‘y’ are included in the output.

all.y: logical; analogous to ‘all.x’.

sort: logical.  Should the result be sorted on the 'by' columns?

suffixes: a character vector of length 2 specifying the suffixes to be used for making unique the names of columns in the result which are not used for merging (appearing in ‘by’ etc).

no.dups: logical indicating that ‘suffixes’ are appended in more cases to avoid duplicated column names in the result. This was implicitly false before R version 3.5.0.

incomparables: values which cannot be matched. See ‘match’. This is intended to be used for merging on one column, so these are incomparable values of that column.

 ...: arguments to be passed to or from methods.

Details:

 'merge' is a generic function whose principal method is for data
 frames: the default method coerces its arguments to data frames
 and calls the '"data.frame"' method.

 By default the data frames are merged on the columns with names
 they both have, but separate specifications of the columns can be
 given by 'by.x' and 'by.y'.  The rows in the two data frames that
 match on the specified columns are extracted, and joined together.
 If there is more than one match, all possible matches contribute
 one row each.  For the precise meaning of 'match', see 'match'.

 Columns to merge on can be specified by name, number or by a
 logical vector: the name '"row.names"' or the number '0' specifies
 the row names.  If specified by name it must correspond uniquely
 to a named column in the input.

 If 'by' or both 'by.x' and 'by.y' are of length 0 (a length zero
 vector or 'NULL'), the result, 'r', is the _Cartesian product_ of
 'x' and 'y', i.e., 'dim(r) = c(nrow(x)*nrow(y), ncol(x) +
 ncol(y))'.

 If 'all.x' is true, all the non matching cases of 'x' are appended
 to the result as well, with 'NA' filled in the corresponding
 columns of 'y'; analogously for 'all.y'.

 If the columns in the data frames not used in merging have any
 common names, these have 'suffixes' ('".x"' and '".y"' by default)
 appended to try to make the names of the result unique.  If this
 is not possible, an error is thrown.

 If a 'by.x' column name matches one of 'y', and if 'no.dups' is
 true (as by default), the y version gets suffixed as well,
 avoiding duplicate column names in the result.

 The complexity of the algorithm used is proportional to the length
 of the answer.

 In SQL database terminology, the default value of 'all = FALSE'
 gives a _natural join_, a special case of an _inner join_.
 Specifying 'all.x = TRUE' gives a _left (outer) join_, 'all.y =
 TRUE' a _right (outer) join_, and both ('all = TRUE') a _(full)
 outer join_.  DBMSes do not match 'NULL' records, equivalent to
 'incomparables = NA' in R.

Value:

 A data frame.  The rows are by default lexicographically sorted on
 the common columns, but for 'sort = FALSE' are in an unspecified
 order.  The columns are the common columns followed by the
 remaining columns in 'x' and then those in 'y'.  If the matching
 involved row names, an extra character column called 'Row.names'
 is added at the left, and in all cases the result has 'automatic'
 row names.

Note:

 This is intended to work with data frames with vector-like
 columns: some aspects work with data frames containing matrices,
 but not all.

 Currently long vectors are not accepted for inputs, which are thus
 restricted to less than 2^31 rows. That restriction also applies
 to the result for 32-bit platforms.

See Also:

 'data.frame', 'by', 'cbind'.

 'dendrogram' for a class which has a 'merge' method.

Examples:

 authors <- data.frame(
     ## I(*) : use character columns of names to get sensible sort order
     surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
     nationality = c("US", "Australia", "US", "UK", "Australia"),
     deceased = c("yes", rep("no", 4)))
 authorN <- within(authors, { name <- surname; rm(surname) })
 books <- data.frame(
     name = I(c("Tukey", "Venables", "Tierney",
              "Ripley", "Ripley", "McNeil", "R Core")),
     title = c("Exploratory Data Analysis",
               "Modern Applied Statistics ...",
               "LISP-STAT",
               "Spatial Statistics", "Stochastic Simulation",
               "Interactive Data Analysis",
               "An Introduction to R"),
     other.author = c(NA, "Ripley", NA, NA, NA, NA,
                      "Venables & Smith"))
 
 (m0 <- merge(authorN, books))
 (m1 <- merge(authors, books, by.x = "surname", by.y = "name"))
  m2 <- merge(books, authors, by.x = "name", by.y = "surname")
 stopifnot(exprs = {
    identical(m0, m2[, names(m0)])
    as.character(m1[, 1]) == as.character(m2[, 1])
    all.equal(m1[, -1], m2[, -1][ names(m1)[-1] ])
    identical(dim(merge(m1, m2, by = NULL)),
              c(nrow(m1)*nrow(m2), ncol(m1)+ncol(m2)))
 })
 
 ## "R core" is missing from authors and appears only here :
 merge(authors, books, by.x = "surname", by.y = "name", all = TRUE)
 
 
 ## example of using 'incomparables'
 x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5)
 y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5)
 merge(x, y, by = c("k1","k2")) # NA's match
 merge(x, y, by = "k1") # NA's match, so 6 rows
 merge(x, y, by = "k2", incomparables = NA) # 2 rows

Join Types

  • Full join: includes all unique observations in object df.x and df.y
    • merged.df <- merge(df.x, df.y, all.x=T, all.y=T, by=merge_variable)
    • arguments all = TRUE is the same as all.x = TRUE, all.y = TRUE
    • the number of rows in merged.df is >= max(nrow(df.x), nrow(df.y))
  • Inner join: includes observations that are in both df.x and df.y
    • merged.df <- merge(df.x, df.y, all.x=F, all.y=F, by=merge_variable)
    • the number of rows in merged.df is <= min(nrow(df.x), nrow(df.y))
  • Left join: joining on the first object (df.x) so it includes observations that in df.x
    • merged.df <- merge(df.x, df.y, all.x=T, all.y=F, by=merge_variable)
    • the number of rows in merged.df is nrow(df.x)
  • Right join: joining on the second object (df.y) so it includes observations that in df.y
    • merged.df <- merge(df.x, df.y, all.x=F, all.y=T, by=merge_variable)
    • the number of rows in merged.df is nrow(df.y)

Lets import the new data we want to merge and take a look

The new data serodata_new.csv represents a follow-up serological survey four years later. At this follow-up individuals were retested for IgG antibody concentrations and their ages were collected.

df_new <- read.csv("data/serodata_new.csv")
str(df_new)
'data.frame':   636 obs. of  3 variables:
 $ observation_id   : int  5772 8095 9784 9338 6369 6885 6252 8913 7332 6941 ...
 $ IgG_concentration: num  0.261 2.981 0.282 136.638 0.381 ...
 $ age              : int  6 8 8 8 5 8 8 NA 8 6 ...
summary(df_new)
observation_id IgG_concentration age
Min. :5006 Min. : 0.0051 Min. : 5.00
1st Qu.:6328 1st Qu.: 0.2751 1st Qu.: 7.00
Median :7494 Median : 1.5477 Median :10.00
Mean :7490 Mean : 82.7684 Mean :10.63
3rd Qu.:8736 3rd Qu.:129.6389 3rd Qu.:14.00
Max. :9982 Max. :950.6590 Max. :19.00
NA NA NA’s :9

Merge the new data with the original data

Lets load the old data as well and look for a variable, or variables, to merge by.

df <- read.csv("data/serodata.csv")
colnames(df)
[1] "observation_id"    "IgG_concentration" "age"              
[4] "gender"            "slum"             

We notice that observation_id seems to be the obvious variable by which to merge. However, we also realize that IgG_concentration and age are the exact same names. If we merge now we see that R has forced the IgG_concentration and age to have a .x or .y to make sure that these variables are different.

head(merge(df, df_new, all.x=T, all.y=T, by=c('observation_id')))
observation_id IgG_concentration.x age.x gender slum IgG_concentration.y age.y
5006 164.2979452 7 Male Non slum 155.5811325 11
5024 0.3000000 5 Female Non slum 0.2918605 9
5026 0.3000000 10 Female Non slum 0.2542945 14
5030 0.0555556 7 Female Non slum 0.0533262 11
5035 26.2112514 11 Female Non slum 22.0159300 15
5054 0.3000000 3 Male Non slum 0.2709671 7

Merge the new data with the original data

What do we do?

The first option is to rename the IgG_concentration and age variables before the merge, so that it is clear which is time point 1 and time point 2.

df$IgG_concentration_time1 <- df$IgG_concentration
df$age_time1 <- df$age
df$IgG_concentration <- df$age <- NULL #remove the original variables

df_new$IgG_concentration_time2 <- df_new$IgG_concentration
df_new$age_time2 <- df_new$age
df_new$IgG_concentration <- df_new$age <- NULL #remove the original variables

Now, lets merge.

df_all_wide <- merge(df, df_new, all.x=T, all.y=T, by=c('observation_id'))
str(df_all_wide)
'data.frame':   651 obs. of  7 variables:
 $ observation_id         : int  5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...
 $ gender                 : chr  "Male" "Female" "Female" "Female" ...
 $ slum                   : chr  "Non slum" "Non slum" "Non slum" "Non slum" ...
 $ IgG_concentration_time1: num  164.2979 0.3 0.3 0.0556 26.2113 ...
 $ age_time1              : int  7 5 10 7 11 3 3 12 14 6 ...
 $ IgG_concentration_time2: num  155.5811 0.2919 0.2543 0.0533 22.0159 ...
 $ age_time2              : int  11 9 14 11 15 7 7 16 18 10 ...
head(df_all_wide)
observation_id gender slum IgG_concentration_time1 age_time1 IgG_concentration_time2 age_time2
5006 Male Non slum 164.2979452 7 155.5811325 11
5024 Female Non slum 0.3000000 5 0.2918605 9
5026 Female Non slum 0.3000000 10 0.2542945 14
5030 Female Non slum 0.0555556 7 0.0533262 11
5035 Female Non slum 26.2112514 11 22.0159300 15
5054 Male Non slum 0.3000000 3 0.2709671 7

Merge the new data with the original data

The second option is to add a time variable to the two data sets and then merge by observation_id, time, age, and IgG_concentration. Note, I need to read in the data again b/c I removed the IgG_concentration and age variables.

df <- read.csv("data/serodata.csv")
df_new <- read.csv("data/serodata_new.csv")
df$time <- 1 #you can put in one number and it will repeat it
df_new$time <- 2
head(df)
observation_id IgG_concentration age gender slum time
5772 0.3176895 2 Female Non slum 1
8095 3.4368231 4 Female Non slum 1
9784 0.3000000 4 Male Non slum 1
9338 143.2363014 4 Male Non slum 1
6369 0.4476534 1 Male Non slum 1
6885 0.0252708 4 Male Non slum 1
head(df_new)
observation_id IgG_concentration age time
5772 0.2612388 6 2
8095 2.9809049 8 2
9784 0.2819489 8 2
9338 136.6382260 8 2
6369 0.3810119 5 2
6885 0.0245951 8 2

Now, lets merge. Note, “By default the data frames are merged on the columns with names they both have” therefore if I don’t specify the by argument it will merge on all matching variables.

df_all_long <- merge(df, df_new, all.x=T, all.y=T)
str(df_all_long)
'data.frame':   1287 obs. of  6 variables:
 $ observation_id   : int  5006 5006 5024 5024 5026 5026 5030 5030 5035 5035 ...
 $ IgG_concentration: num  155.581 164.298 0.292 0.3 0.254 ...
 $ age              : int  11 7 9 5 14 10 11 7 15 11 ...
 $ time             : num  2 1 2 1 2 1 2 1 2 1 ...
 $ gender           : chr  NA "Male" NA "Female" ...
 $ slum             : chr  NA "Non slum" NA "Non slum" ...
head(df_all_long)
observation_id IgG_concentration age time gender slum
5006 155.5811325 11 2 NA NA
5006 164.2979452 7 1 Male Non slum
5024 0.2918605 9 2 NA NA
5024 0.3000000 5 1 Female Non slum
5026 0.2542945 14 2 NA NA
5026 0.3000000 10 1 Female Non slum

Note, there are 1287 rows, which is the sum of the number of rows of df (651 rows) and df_new (636 rows)

Notice that there are some missing values though, because df_new doesn’t have the gender or slum variables. If we assume that those are constant and don’t change between the two study points, we can fill in the data points before merging for an easy solution. One easy way to make a new dataframe from df_new with extra columns is to use the transform() function, which lets us make multiple column changes to a data frame at one time. We just need to make sure to match the correct observation_id values together, using the match() function.

df_new_filled <- transform(
  df_new,
  gender = df[match(df_new$observation_id, df$observation_id), "gender"],
  slum = df[match(df_new$observation_id, df$observation_id), "slum"]
)

Now we can redo the merge.

df_all_long <- merge(df, df_new_filled, all.x=T, all.y=T)
head(df_all_long)
observation_id IgG_concentration age gender slum time
5006 155.5811325 11 Male Non slum 2
5006 164.2979452 7 Male Non slum 1
5024 0.2918605 9 Female Non slum 2
5024 0.3000000 5 Female Non slum 1
5026 0.2542945 14 Female Non slum 2
5026 0.3000000 10 Female Non slum 1

Looks good now! Another solution would be to edit the data file, or use a function that can actually fill in missing values for the same individual, like zoo::na.locf().

What is wide/long data?

Above, we actually created a wide and long version of the data.

Wide: has many columns

  • multiple columns per individual, values spread across multiple columns
  • easier for humans to read

Long: has many rows

  • column names become data
  • multiple rows per observation, a single column contains the values
  • easier for R to make plots & do analysis

reshape() function

The reshape() function allows you to toggle between wide and long data

?reshape

Reshape Grouped Data

Description:

 This function reshapes a data frame between 'wide' format (with
 repeated measurements in separate columns of the same row) and
 'long' format (with the repeated measurements in separate rows).

Usage:

 reshape(data, varying = NULL, v.names = NULL, timevar = "time",
         idvar = "id", ids = 1:NROW(data),
         times = seq_along(varying[[1]]),
         drop = NULL, direction, new.row.names = NULL,
         sep = ".",
         split = if (sep == "") {
             list(regexp = "[A-Za-z][0-9]", include = TRUE)
         } else {
             list(regexp = sep, include = FALSE, fixed = TRUE)}
         )
 
 ### Typical usage for converting from long to wide format:
 
 # reshape(data, direction = "wide",
 #         idvar = "___", timevar = "___", # mandatory
 #         v.names = c(___),    # time-varying variables
 #         varying = list(___)) # auto-generated if missing
 
 ### Typical usage for converting from wide to long format:
 
 ### If names of wide-format variables are in a 'nice' format
 
 # reshape(data, direction = "long",
 #         varying = c(___), # vector 
 #         sep)              # to help guess 'v.names' and 'times'
 
 ### To specify long-format variable names explicitly
 
 # reshape(data, direction = "long",
 #         varying = ___,  # list / matrix / vector (use with care)
 #         v.names = ___,  # vector of variable names in long format
 #         timevar, times, # name / values of constructed time variable
 #         idvar, ids)     # name / values of constructed id variable
 

Arguments:

data: a data frame

varying: names of sets of variables in the wide format that correspond to single variables in long format (‘time-varying’). This is canonically a list of vectors of variable names, but it can optionally be a matrix of names, or a single vector of names. In each case, when ‘direction = “long”’, the names can be replaced by indices which are interpreted as referring to ‘names(data)’. See ‘Details’ for more details and options.

v.names: names of variables in the long format that correspond to multiple variables in the wide format. See ‘Details’.

timevar: the variable in long format that differentiates multiple records from the same group or individual. If more than one record matches, the first will be taken (with a warning).

idvar: Names of one or more variables in long format that identify multiple records from the same group/individual. These variables may also be present in wide format.

 ids: the values to use for a newly created 'idvar' variable in
      long format.

times: the values to use for a newly created ‘timevar’ variable in long format. See ‘Details’.

drop: a vector of names of variables to drop before reshaping.

direction: character string, partially matched to either ‘“wide”’ to reshape to wide format, or ‘“long”’ to reshape to long format.

new.row.names: character or ‘NULL’: a non-null value will be used for the row names of the result.

 sep: A character vector of length 1, indicating a separating
      character in the variable names in the wide format.  This is
      used for guessing 'v.names' and 'times' arguments based on
      the names in 'varying'.  If 'sep == ""', the split is just
      before the first numeral that follows an alphabetic
      character.  This is also used to create variable names when
      reshaping to wide format.

split: A list with three components, ‘regexp’, ‘include’, and (optionally) ‘fixed’. This allows an extended interface to variable name splitting. See ‘Details’.

Details:

 Although 'reshape()' can be used in a variety of contexts, the
 motivating application is data from longitudinal studies, and the
 arguments of this function are named and described in those terms.
 A longitudinal study is characterized by repeated measurements of
 the same variable(s), e.g., height and weight, on each unit being
 studied (e.g., individual persons) at different time points (which
 are assumed to be the same for all units). These variables are
 called time-varying variables. The study may include other
 variables that are measured only once for each unit and do not
 vary with time (e.g., gender and race); these are called
 time-constant variables.

 A 'wide' format representation of a longitudinal dataset will have
 one record (row) for each unit, typically with some time-constant
 variables that occupy single columns, and some time-varying
 variables that occupy multiple columns (one column for each time
 point).  A 'long' format representation of the same dataset will
 have multiple records (rows) for each individual, with the
 time-constant variables being constant across these records and
 the time-varying variables varying across the records.  The 'long'
 format dataset will have two additional variables: a 'time'
 variable identifying which time point each record comes from, and
 an 'id' variable showing which records refer to the same unit.

 The type of conversion (long to wide or wide to long) is
 determined by the 'direction' argument, which is mandatory unless
 the 'data' argument is the result of a previous call to 'reshape'.
 In that case, the operation can be reversed simply using
 'reshape(data)' (the other arguments are stored as attributes on
 the data frame).

 Conversion from long to wide format with 'direction = "wide"' is
 the simpler operation, and is mainly useful in the context of
 multivariate analysis where data is often expected as a
 wide-format matrix. In this case, the time variable 'timevar' and
 id variable 'idvar' must be specified. All other variables are
 assumed to be time-varying, unless the time-varying variables are
 explicitly specified via the 'v.names' argument.  A warning is
 issued if time-constant variables are not actually constant.

 Each time-varying variable is expanded into multiple variables in
 the wide format.  The names of these expanded variables are
 generated automatically, unless they are specified as the
 'varying' argument in the form of a list (or matrix) with one
 component (or row) for each time-varying variable. If 'varying' is
 a vector of names, it is implicitly converted into a matrix, with
 one row for each time-varying variable. Use this option with care
 if there are multiple time-varying variables, as the ordering (by
 column, the default in the 'matrix' constructor) may be
 unintuitive, whereas the explicit list or matrix form is
 unambiguous.

 Conversion from wide to long with 'direction = "long"' is the more
 common operation as most (univariate) statistical modeling
 functions expect data in the long format. In the simpler case
 where there is only one time-varying variable, the corresponding
 columns in the wide format input can be specified as the 'varying'
 argument, which can be either a vector of column names or the
 corresponding column indices. The name of the corresponding
 variable in the long format output combining these columns can be
 optionally specified as the 'v.names' argument, and the name of
 the time variables as the 'timevar' argument. The values to use as
 the time values corresponding to the different columns in the wide
 format can be specified as the 'times' argument.  If 'v.names' is
 unspecified, the function will attempt to guess 'v.names' and
 'times' from 'varying' (an explicitly specified 'times' argument
 is unused in that case).  The default expects variable names like
 'x.1', 'x.2', where 'sep = "."' specifies to split at the dot and
 drop it from the name.  To have alphabetic followed by numeric
 times use 'sep = ""'.

 Multiple time-varying variables can be specified in two ways,
 either with 'varying' as an atomic vector as above, or as a list
 (or a matrix). The first form is useful (and mandatory) if the
 automatic variable name splitting as described above is used; this
 requires the names of all time-varying variables to be suitably
 formatted in the same manner, and 'v.names' to be unspecified. If
 'varying' is a list (with one component for each time-varying
 variable) or a matrix (one row for each time-varying variable),
 variable name splitting is not attempted, and 'v.names' and
 'times' will generally need to be specified, although they will
 default to, respectively, the first variable name in each set, and
 sequential times.

 Also, guessing is not attempted if 'v.names' is given explicitly,
 even if 'varying' is an atomic vector. In that case, the number of
 time-varying variables is taken to be the length of 'v.names', and
 'varying' is implicitly converted into a matrix, with one row for
 each time-varying variable. As in the case of long to wide
 conversion, the matrix is filled up by column, so careful
 attention needs to be paid to the order of variable names (or
 indices) in 'varying', which is taken to be like 'x.1', 'y.1',
 'x.2', 'y.2' (i.e., variables corresponding to the same time point
 need to be grouped together).

 The 'split' argument should not usually be necessary.  The
 'split$regexp' component is passed to either 'strsplit' or
 'regexpr', where the latter is used if 'split$include' is 'TRUE',
 in which case the splitting occurs after the first character of
 the matched string.  In the 'strsplit' case, the separator is not
 included in the result, and it is possible to specify fixed-string
 matching using 'split$fixed'.

Value:

 The reshaped data frame with added attributes to simplify
 reshaping back to the original form.

See Also:

 'stack', 'aperm'; 'relist' for reshaping the result of 'unlist'.
 'xtabs' and 'as.data.frame.table' for creating contingency tables
 and converting them back to data frames.

Examples:

 summary(Indometh) # data in long format
 
 ## long to wide (direction = "wide") requires idvar and timevar at a minimum
 reshape(Indometh, direction = "wide", idvar = "Subject", timevar = "time")
 
 ## can also explicitly specify name of combined variable
 wide <- reshape(Indometh, direction = "wide", idvar = "Subject",
                 timevar = "time", v.names = "conc", sep= "_")
 wide
 
 ## reverse transformation
 reshape(wide, direction = "long")
 reshape(wide, idvar = "Subject", varying = list(2:12),
         v.names = "conc", direction = "long")
 
 ## times need not be numeric
 df <- data.frame(id = rep(1:4, rep(2,4)),
                  visit = I(rep(c("Before","After"), 4)),
                  x = rnorm(4), y = runif(4))
 df
 reshape(df, timevar = "visit", idvar = "id", direction = "wide")
 ## warns that y is really varying
 reshape(df, timevar = "visit", idvar = "id", direction = "wide", v.names = "x")
 
 
 ##  unbalanced 'long' data leads to NA fill in 'wide' form
 df2 <- df[1:7, ]
 df2
 reshape(df2, timevar = "visit", idvar = "id", direction = "wide")
 
 ## Alternative regular expressions for guessing names
 df3 <- data.frame(id = 1:4, age = c(40,50,60,50), dose1 = c(1,2,1,2),
                   dose2 = c(2,1,2,1), dose4 = c(3,3,3,3))
 reshape(df3, direction = "long", varying = 3:5, sep = "")
 
 
 ## an example that isn't longitudinal data
 state.x77 <- as.data.frame(state.x77)
 long <- reshape(state.x77, idvar = "state", ids = row.names(state.x77),
                 times = names(state.x77), timevar = "Characteristic",
                 varying = list(names(state.x77)), direction = "long")
 
 reshape(long, direction = "wide")
 
 reshape(long, direction = "wide", new.row.names = unique(long$state))
 
 ## multiple id variables
 df3 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6),
                   time = rep(c(1,1,2,2), 3), score = rnorm(12))
 wide <- reshape(df3, idvar = c("school", "class"), direction = "wide")
 wide
 ## transform back
 reshape(wide)

wide to long data

Reminder: “typical usage for converting from long to wide format”

### If names of wide-format variables are in a 'nice' format

reshape(data, direction = "long",
       varying = c(___), # vector 
       sep)              # to help guess 'v.names' and 'times'

### To specify long-format variable names explicitly

reshape(data, direction = "long",
       varying = ___,  # list / matrix / vector (use with care)
       v.names = ___,  # vector of variable names in long format
       timevar, times, # name / values of constructed time variable
       idvar, ids)     # name / values of constructed id variable

We can try to apply that to our data.

df_wide_to_long <-
  reshape(
    # First argument is the wide-format data frame to be reshaped
    df_all_wide,
    # We are inputting wide data and expect long format as output
    direction = "long",
    # "varying" argument is a list of vectors. Each vector in the list is a
    # group of time-varying (or grouping-factor-varying) variables which
    # should become one variable after reformat. We want two variables after
    # reformating, so we need two vectors in a list.
    varying = list(
      c("IgG_concentration_time1", "IgG_concentration_time2"),
      c("age_time1", "age_time2")
    ),
    # "v.names" is a vector of names for the new long-format variables, it
    # should have the same length as the list for varying and the names will
    # be assigned in order.
    v.names = c("IgG_concentration", "age"),
    # Name of the variable for the time index that will be created
    timevar = "time",
    # Values of the time variable that should be created. Note that if you
    # have any missing observations over time, they NEED to be in the dataset
    # as NAs or your times will get messed up.
    times = 1:2,
    # 'idvar' is a variable that marks which records belong to each
    # observational unit, for us that is the ID marking individuals.
    idvar = "observation_id"
  )

Notice that this has exactly twice as many rows as our wide data format, and doesn’t appear to have any systematic missingness, so it seems correct.

str(df_wide_to_long)
'data.frame':   1302 obs. of  6 variables:
 $ observation_id   : int  5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...
 $ gender           : chr  "Male" "Female" "Female" "Female" ...
 $ slum             : chr  "Non slum" "Non slum" "Non slum" "Non slum" ...
 $ time             : int  1 1 1 1 1 1 1 1 1 1 ...
 $ IgG_concentration: num  164.2979 0.3 0.3 0.0556 26.2113 ...
 $ age              : int  7 5 10 7 11 3 3 12 14 6 ...
 - attr(*, "reshapeLong")=List of 4
  ..$ varying:List of 2
  .. ..$ : chr [1:2] "IgG_concentration_time1" "IgG_concentration_time2"
  .. ..$ : chr [1:2] "age_time1" "age_time2"
  ..$ v.names: chr [1:2] "IgG_concentration" "age"
  ..$ idvar  : chr "observation_id"
  ..$ timevar: chr "time"
nrow(df_wide_to_long)
[1] 1302
nrow(df_all_wide)
[1] 651

long to wide data

Reminder: “typical usage for converting from long to wide format”

reshape(data, direction = "wide",
       idvar = "___", timevar = "___", # mandatory
       v.names = c(___),    # time-varying variables
       varying = list(___)) # auto-generated if missing

We can try to apply that to our data. Note that the arguments are the same as in the wide to long case, but we don’t need to specify the times argument because they are in the data already. The varying argument is optional also, and R will auto-generate names for the wide variables if it is left empty.

df_long_to_wide <-
  reshape(
    df_all_long,
    direction = "wide",
    idvar = "observation_id",
    timevar = "time",
    v.names = c("IgG_concentration", "age"),
    varying = list(
      c("IgG_concentration_time1", "IgG_concentration_time2"),
      c("age_time1", "age_time2")
    )
  )

We can do the same checks to make sure we pivoted correctly.

str(df_long_to_wide)
'data.frame':   651 obs. of  7 variables:
 $ observation_id         : int  5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...
 $ gender                 : chr  "Male" "Female" "Female" "Female" ...
 $ slum                   : chr  "Non slum" "Non slum" "Non slum" "Non slum" ...
 $ IgG_concentration_time1: num  155.5811 0.2919 0.2543 0.0533 22.0159 ...
 $ age_time1              : int  11 9 14 11 15 7 7 16 18 10 ...
 $ IgG_concentration_time2: num  164.2979 0.3 0.3 0.0556 26.2113 ...
 $ age_time2              : int  7 5 10 7 11 3 3 12 14 6 ...
 - attr(*, "reshapeWide")=List of 5
  ..$ v.names: chr [1:2] "IgG_concentration" "age"
  ..$ timevar: chr "time"
  ..$ idvar  : chr "observation_id"
  ..$ times  : num [1:2] 2 1
  ..$ varying: chr [1:2, 1:2] "IgG_concentration_time1" "age_time1" "IgG_concentration_time2" "age_time2"
nrow(df_long_to_wide)
[1] 651
nrow(df_all_long)
[1] 1287

Note that this time we don’t have exactly twice as many records because of some quirks in how reshape() works. When we go from wide to long, R will create new records with NA values at the second time point for the individuals who were not in the second study – it won’t do that when we go from long to wide data. This is why it can be important to make sure all of your missing data are explicit rather than implicit.

# For the original long dataset, we can see that not all individuals have 2
# time points
all(table(df_all_long$observation_id) == 2)
[1] FALSE
# But for the reshaped version they do all have 2 time points
all(table(df_wide_to_long$observation_id) == 2)
[1] TRUE

reshape metadata

Whenever you use reshape() to change the data format, it leaves behind some metadata on our new data frame, as an attr.

str(df_wide_to_long)
'data.frame':   1302 obs. of  6 variables:
 $ observation_id   : int  5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...
 $ gender           : chr  "Male" "Female" "Female" "Female" ...
 $ slum             : chr  "Non slum" "Non slum" "Non slum" "Non slum" ...
 $ time             : int  1 1 1 1 1 1 1 1 1 1 ...
 $ IgG_concentration: num  164.2979 0.3 0.3 0.0556 26.2113 ...
 $ age              : int  7 5 10 7 11 3 3 12 14 6 ...
 - attr(*, "reshapeLong")=List of 4
  ..$ varying:List of 2
  .. ..$ : chr [1:2] "IgG_concentration_time1" "IgG_concentration_time2"
  .. ..$ : chr [1:2] "age_time1" "age_time2"
  ..$ v.names: chr [1:2] "IgG_concentration" "age"
  ..$ idvar  : chr "observation_id"
  ..$ timevar: chr "time"

This stores information so we can reshape() back to the other format and we don’t have to specify arguments again.

df_back_to_wide <- reshape(df_wide_to_long)

Let’s get real

We recommend checking out the pivot_wider() and pivot_longer() from the tidyr package!

Summary

  • the merge() function can be used to merge datasets.
  • pay close attention to the number of rows in your data set before and after a merge
  • wide data has many columns per observation
  • long data has many rows per observation
  • the reshape()function allows you to toggle between wide and long data. although we highly recommend playing around with the pivot_wider() and pivot_longer() from the tidyr package instead

Acknowledgements

These are the materials we looked through, modified, or extracted to complete this module’s lecture.