After module 8, you should be able to…
Pay close attention to the number of rows in your data set before and after a join. This will help flag when an issue has arisen. This will depend on the type of merge:
merge()
functionWe will use the merge()
function to conduct one-to-one merge
Registered S3 method overwritten by 'printr':
method from
knit_print.data.frame rmarkdown
Merge Two Data Frames
Description:
Merge two data frames by common columns or row names, or do other
versions of database _join_ operations.
Usage:
merge(x, y, ...)
## Default S3 method:
merge(x, y, ...)
## S3 method for class 'data.frame'
merge(x, y, by = intersect(names(x), names(y)),
by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
incomparables = NULL, ...)
Arguments:
x, y: data frames, or objects to be coerced to one.
by, by.x, by.y: specifications of the columns used for merging. See ‘Details’.
all: logical; 'all = L' is shorthand for 'all.x = L' and 'all.y =
L', where 'L' is either 'TRUE' or 'FALSE'.
all.x: logical; if ‘TRUE’, then extra rows will be added to the output, one for each row in ‘x’ that has no matching row in ‘y’. These rows will have ‘NA’s in those columns that are usually filled with values from ’y’. The default is ‘FALSE’, so that only rows with data from both ‘x’ and ‘y’ are included in the output.
all.y: logical; analogous to ‘all.x’.
sort: logical. Should the result be sorted on the 'by' columns?
suffixes: a character vector of length 2 specifying the suffixes to be used for making unique the names of columns in the result which are not used for merging (appearing in ‘by’ etc).
no.dups: logical indicating that ‘suffixes’ are appended in more cases to avoid duplicated column names in the result. This was implicitly false before R version 3.5.0.
incomparables: values which cannot be matched. See ‘match’. This is intended to be used for merging on one column, so these are incomparable values of that column.
...: arguments to be passed to or from methods.
Details:
'merge' is a generic function whose principal method is for data
frames: the default method coerces its arguments to data frames
and calls the '"data.frame"' method.
By default the data frames are merged on the columns with names
they both have, but separate specifications of the columns can be
given by 'by.x' and 'by.y'. The rows in the two data frames that
match on the specified columns are extracted, and joined together.
If there is more than one match, all possible matches contribute
one row each. For the precise meaning of 'match', see 'match'.
Columns to merge on can be specified by name, number or by a
logical vector: the name '"row.names"' or the number '0' specifies
the row names. If specified by name it must correspond uniquely
to a named column in the input.
If 'by' or both 'by.x' and 'by.y' are of length 0 (a length zero
vector or 'NULL'), the result, 'r', is the _Cartesian product_ of
'x' and 'y', i.e., 'dim(r) = c(nrow(x)*nrow(y), ncol(x) +
ncol(y))'.
If 'all.x' is true, all the non matching cases of 'x' are appended
to the result as well, with 'NA' filled in the corresponding
columns of 'y'; analogously for 'all.y'.
If the columns in the data frames not used in merging have any
common names, these have 'suffixes' ('".x"' and '".y"' by default)
appended to try to make the names of the result unique. If this
is not possible, an error is thrown.
If a 'by.x' column name matches one of 'y', and if 'no.dups' is
true (as by default), the y version gets suffixed as well,
avoiding duplicate column names in the result.
The complexity of the algorithm used is proportional to the length
of the answer.
In SQL database terminology, the default value of 'all = FALSE'
gives a _natural join_, a special case of an _inner join_.
Specifying 'all.x = TRUE' gives a _left (outer) join_, 'all.y =
TRUE' a _right (outer) join_, and both ('all = TRUE') a _(full)
outer join_. DBMSes do not match 'NULL' records, equivalent to
'incomparables = NA' in R.
Value:
A data frame. The rows are by default lexicographically sorted on
the common columns, but for 'sort = FALSE' are in an unspecified
order. The columns are the common columns followed by the
remaining columns in 'x' and then those in 'y'. If the matching
involved row names, an extra character column called 'Row.names'
is added at the left, and in all cases the result has 'automatic'
row names.
Note:
This is intended to work with data frames with vector-like
columns: some aspects work with data frames containing matrices,
but not all.
Currently long vectors are not accepted for inputs, which are thus
restricted to less than 2^31 rows. That restriction also applies
to the result for 32-bit platforms.
See Also:
'data.frame', 'by', 'cbind'.
'dendrogram' for a class which has a 'merge' method.
Examples:
authors <- data.frame(
## I(*) : use character columns of names to get sensible sort order
surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
nationality = c("US", "Australia", "US", "UK", "Australia"),
deceased = c("yes", rep("no", 4)))
authorN <- within(authors, { name <- surname; rm(surname) })
books <- data.frame(
name = I(c("Tukey", "Venables", "Tierney",
"Ripley", "Ripley", "McNeil", "R Core")),
title = c("Exploratory Data Analysis",
"Modern Applied Statistics ...",
"LISP-STAT",
"Spatial Statistics", "Stochastic Simulation",
"Interactive Data Analysis",
"An Introduction to R"),
other.author = c(NA, "Ripley", NA, NA, NA, NA,
"Venables & Smith"))
(m0 <- merge(authorN, books))
(m1 <- merge(authors, books, by.x = "surname", by.y = "name"))
m2 <- merge(books, authors, by.x = "name", by.y = "surname")
stopifnot(exprs = {
identical(m0, m2[, names(m0)])
as.character(m1[, 1]) == as.character(m2[, 1])
all.equal(m1[, -1], m2[, -1][ names(m1)[-1] ])
identical(dim(merge(m1, m2, by = NULL)),
c(nrow(m1)*nrow(m2), ncol(m1)+ncol(m2)))
})
## "R core" is missing from authors and appears only here :
merge(authors, books, by.x = "surname", by.y = "name", all = TRUE)
## example of using 'incomparables'
x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5)
y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5)
merge(x, y, by = c("k1","k2")) # NA's match
merge(x, y, by = "k1") # NA's match, so 6 rows
merge(x, y, by = "k2", incomparables = NA) # 2 rows
merged.df <- merge(df.x, df.y, all.x=T, all.y=T, by=merge_variable)
all = TRUE
is the same as all.x = TRUE, all.y = TRUE
merged.df
is >= max(nrow(df.x), nrow(df.y))merged.df <- merge(df.x, df.y, all.x=F, all.y=F, by=merge_variable)
merged.df
is <= min(nrow(df.x), nrow(df.y))merged.df <- merge(df.x, df.y, all.x=T, all.y=F, by=merge_variable)
merged.df
is nrow(df.x)merged.df <- merge(df.x, df.y, all.x=F, all.y=T, by=merge_variable)
merged.df
is nrow(df.y)The new data serodata_new.csv
represents a follow-up serological survey four years later. At this follow-up individuals were retested for IgG antibody concentrations and their ages were collected.
'data.frame': 636 obs. of 3 variables:
$ observation_id : int 5772 8095 9784 9338 6369 6885 6252 8913 7332 6941 ...
$ IgG_concentration: num 0.261 2.981 0.282 136.638 0.381 ...
$ age : int 6 8 8 8 5 8 8 NA 8 6 ...
observation_id | IgG_concentration | age | |
---|---|---|---|
Min. :5006 | Min. : 0.0051 | Min. : 5.00 | |
1st Qu.:6328 | 1st Qu.: 0.2751 | 1st Qu.: 7.00 | |
Median :7494 | Median : 1.5477 | Median :10.00 | |
Mean :7490 | Mean : 82.7684 | Mean :10.63 | |
3rd Qu.:8736 | 3rd Qu.:129.6389 | 3rd Qu.:14.00 | |
Max. :9982 | Max. :950.6590 | Max. :19.00 | |
NA | NA | NA’s :9 |
Lets load the old data as well and look for a variable, or variables, to merge by.
[1] "observation_id" "IgG_concentration" "age"
[4] "gender" "slum"
We notice that observation_id
seems to be the obvious variable by which to merge. However, we also realize that IgG_concentration
and age
are the exact same names. If we merge now we see that R has forced the IgG_concentration
and age
to have a .x
or .y
to make sure that these variables are different.
observation_id | IgG_concentration.x | age.x | gender | slum | IgG_concentration.y | age.y |
---|---|---|---|---|---|---|
5006 | 164.2979452 | 7 | Male | Non slum | 155.5811325 | 11 |
5024 | 0.3000000 | 5 | Female | Non slum | 0.2918605 | 9 |
5026 | 0.3000000 | 10 | Female | Non slum | 0.2542945 | 14 |
5030 | 0.0555556 | 7 | Female | Non slum | 0.0533262 | 11 |
5035 | 26.2112514 | 11 | Female | Non slum | 22.0159300 | 15 |
5054 | 0.3000000 | 3 | Male | Non slum | 0.2709671 | 7 |
What do we do?
The first option is to rename the IgG_concentration
and age
variables before the merge, so that it is clear which is time point 1 and time point 2.
df$IgG_concentration_time1 <- df$IgG_concentration
df$age_time1 <- df$age
df$IgG_concentration <- df$age <- NULL #remove the original variables
df_new$IgG_concentration_time2 <- df_new$IgG_concentration
df_new$age_time2 <- df_new$age
df_new$IgG_concentration <- df_new$age <- NULL #remove the original variables
Now, lets merge.
'data.frame': 651 obs. of 7 variables:
$ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...
$ gender : chr "Male" "Female" "Female" "Female" ...
$ slum : chr "Non slum" "Non slum" "Non slum" "Non slum" ...
$ IgG_concentration_time1: num 164.2979 0.3 0.3 0.0556 26.2113 ...
$ age_time1 : int 7 5 10 7 11 3 3 12 14 6 ...
$ IgG_concentration_time2: num 155.5811 0.2919 0.2543 0.0533 22.0159 ...
$ age_time2 : int 11 9 14 11 15 7 7 16 18 10 ...
observation_id | gender | slum | IgG_concentration_time1 | age_time1 | IgG_concentration_time2 | age_time2 |
---|---|---|---|---|---|---|
5006 | Male | Non slum | 164.2979452 | 7 | 155.5811325 | 11 |
5024 | Female | Non slum | 0.3000000 | 5 | 0.2918605 | 9 |
5026 | Female | Non slum | 0.3000000 | 10 | 0.2542945 | 14 |
5030 | Female | Non slum | 0.0555556 | 7 | 0.0533262 | 11 |
5035 | Female | Non slum | 26.2112514 | 11 | 22.0159300 | 15 |
5054 | Male | Non slum | 0.3000000 | 3 | 0.2709671 | 7 |
The second option is to add a time variable to the two data sets and then merge by observation_id
, time
, age
, and IgG_concentration
. Note, I need to read in the data again b/c I removed the IgG_concentration
and age
variables.
observation_id | IgG_concentration | age | gender | slum | time |
---|---|---|---|---|---|
5772 | 0.3176895 | 2 | Female | Non slum | 1 |
8095 | 3.4368231 | 4 | Female | Non slum | 1 |
9784 | 0.3000000 | 4 | Male | Non slum | 1 |
9338 | 143.2363014 | 4 | Male | Non slum | 1 |
6369 | 0.4476534 | 1 | Male | Non slum | 1 |
6885 | 0.0252708 | 4 | Male | Non slum | 1 |
observation_id | IgG_concentration | age | time |
---|---|---|---|
5772 | 0.2612388 | 6 | 2 |
8095 | 2.9809049 | 8 | 2 |
9784 | 0.2819489 | 8 | 2 |
9338 | 136.6382260 | 8 | 2 |
6369 | 0.3810119 | 5 | 2 |
6885 | 0.0245951 | 8 | 2 |
Now, lets merge. Note, “By default the data frames are merged on the columns with names they both have” therefore if I don’t specify the by argument it will merge on all matching variables.
'data.frame': 1287 obs. of 6 variables:
$ observation_id : int 5006 5006 5024 5024 5026 5026 5030 5030 5035 5035 ...
$ IgG_concentration: num 155.581 164.298 0.292 0.3 0.254 ...
$ age : int 11 7 9 5 14 10 11 7 15 11 ...
$ time : num 2 1 2 1 2 1 2 1 2 1 ...
$ gender : chr NA "Male" NA "Female" ...
$ slum : chr NA "Non slum" NA "Non slum" ...
observation_id | IgG_concentration | age | time | gender | slum |
---|---|---|---|---|---|
5006 | 155.5811325 | 11 | 2 | NA | NA |
5006 | 164.2979452 | 7 | 1 | Male | Non slum |
5024 | 0.2918605 | 9 | 2 | NA | NA |
5024 | 0.3000000 | 5 | 1 | Female | Non slum |
5026 | 0.2542945 | 14 | 2 | NA | NA |
5026 | 0.3000000 | 10 | 1 | Female | Non slum |
Note, there are 1287 rows, which is the sum of the number of rows of df
(651 rows) and df_new
(636 rows)
Notice that there are some missing values though, because df_new
doesn’t have the gender
or slum
variables. If we assume that those are constant and don’t change between the two study points, we can fill in the data points before merging for an easy solution. One easy way to make a new dataframe from df_new
with extra columns is to use the transform()
function, which lets us make multiple column changes to a data frame at one time. We just need to make sure to match the correct observation_id
values together, using the match()
function.
Now we can redo the merge.
observation_id | IgG_concentration | age | gender | slum | time |
---|---|---|---|---|---|
5006 | 155.5811325 | 11 | Male | Non slum | 2 |
5006 | 164.2979452 | 7 | Male | Non slum | 1 |
5024 | 0.2918605 | 9 | Female | Non slum | 2 |
5024 | 0.3000000 | 5 | Female | Non slum | 1 |
5026 | 0.2542945 | 14 | Female | Non slum | 2 |
5026 | 0.3000000 | 10 | Female | Non slum | 1 |
Looks good now! Another solution would be to edit the data file, or use a function that can actually fill in missing values for the same individual, like zoo::na.locf()
.
Above, we actually created a wide and long version of the data.
Wide: has many columns
Long: has many rows
reshape()
functionThe reshape()
function allows you to toggle between wide and long data
Reshape Grouped Data
Description:
This function reshapes a data frame between 'wide' format (with
repeated measurements in separate columns of the same row) and
'long' format (with the repeated measurements in separate rows).
Usage:
reshape(data, varying = NULL, v.names = NULL, timevar = "time",
idvar = "id", ids = 1:NROW(data),
times = seq_along(varying[[1]]),
drop = NULL, direction, new.row.names = NULL,
sep = ".",
split = if (sep == "") {
list(regexp = "[A-Za-z][0-9]", include = TRUE)
} else {
list(regexp = sep, include = FALSE, fixed = TRUE)}
)
### Typical usage for converting from long to wide format:
# reshape(data, direction = "wide",
# idvar = "___", timevar = "___", # mandatory
# v.names = c(___), # time-varying variables
# varying = list(___)) # auto-generated if missing
### Typical usage for converting from wide to long format:
### If names of wide-format variables are in a 'nice' format
# reshape(data, direction = "long",
# varying = c(___), # vector
# sep) # to help guess 'v.names' and 'times'
### To specify long-format variable names explicitly
# reshape(data, direction = "long",
# varying = ___, # list / matrix / vector (use with care)
# v.names = ___, # vector of variable names in long format
# timevar, times, # name / values of constructed time variable
# idvar, ids) # name / values of constructed id variable
Arguments:
data: a data frame
varying: names of sets of variables in the wide format that correspond to single variables in long format (‘time-varying’). This is canonically a list of vectors of variable names, but it can optionally be a matrix of names, or a single vector of names. In each case, when ‘direction = “long”’, the names can be replaced by indices which are interpreted as referring to ‘names(data)’. See ‘Details’ for more details and options.
v.names: names of variables in the long format that correspond to multiple variables in the wide format. See ‘Details’.
timevar: the variable in long format that differentiates multiple records from the same group or individual. If more than one record matches, the first will be taken (with a warning).
idvar: Names of one or more variables in long format that identify multiple records from the same group/individual. These variables may also be present in wide format.
ids: the values to use for a newly created 'idvar' variable in
long format.
times: the values to use for a newly created ‘timevar’ variable in long format. See ‘Details’.
drop: a vector of names of variables to drop before reshaping.
direction: character string, partially matched to either ‘“wide”’ to reshape to wide format, or ‘“long”’ to reshape to long format.
new.row.names: character or ‘NULL’: a non-null value will be used for the row names of the result.
sep: A character vector of length 1, indicating a separating
character in the variable names in the wide format. This is
used for guessing 'v.names' and 'times' arguments based on
the names in 'varying'. If 'sep == ""', the split is just
before the first numeral that follows an alphabetic
character. This is also used to create variable names when
reshaping to wide format.
split: A list with three components, ‘regexp’, ‘include’, and (optionally) ‘fixed’. This allows an extended interface to variable name splitting. See ‘Details’.
Details:
Although 'reshape()' can be used in a variety of contexts, the
motivating application is data from longitudinal studies, and the
arguments of this function are named and described in those terms.
A longitudinal study is characterized by repeated measurements of
the same variable(s), e.g., height and weight, on each unit being
studied (e.g., individual persons) at different time points (which
are assumed to be the same for all units). These variables are
called time-varying variables. The study may include other
variables that are measured only once for each unit and do not
vary with time (e.g., gender and race); these are called
time-constant variables.
A 'wide' format representation of a longitudinal dataset will have
one record (row) for each unit, typically with some time-constant
variables that occupy single columns, and some time-varying
variables that occupy multiple columns (one column for each time
point). A 'long' format representation of the same dataset will
have multiple records (rows) for each individual, with the
time-constant variables being constant across these records and
the time-varying variables varying across the records. The 'long'
format dataset will have two additional variables: a 'time'
variable identifying which time point each record comes from, and
an 'id' variable showing which records refer to the same unit.
The type of conversion (long to wide or wide to long) is
determined by the 'direction' argument, which is mandatory unless
the 'data' argument is the result of a previous call to 'reshape'.
In that case, the operation can be reversed simply using
'reshape(data)' (the other arguments are stored as attributes on
the data frame).
Conversion from long to wide format with 'direction = "wide"' is
the simpler operation, and is mainly useful in the context of
multivariate analysis where data is often expected as a
wide-format matrix. In this case, the time variable 'timevar' and
id variable 'idvar' must be specified. All other variables are
assumed to be time-varying, unless the time-varying variables are
explicitly specified via the 'v.names' argument. A warning is
issued if time-constant variables are not actually constant.
Each time-varying variable is expanded into multiple variables in
the wide format. The names of these expanded variables are
generated automatically, unless they are specified as the
'varying' argument in the form of a list (or matrix) with one
component (or row) for each time-varying variable. If 'varying' is
a vector of names, it is implicitly converted into a matrix, with
one row for each time-varying variable. Use this option with care
if there are multiple time-varying variables, as the ordering (by
column, the default in the 'matrix' constructor) may be
unintuitive, whereas the explicit list or matrix form is
unambiguous.
Conversion from wide to long with 'direction = "long"' is the more
common operation as most (univariate) statistical modeling
functions expect data in the long format. In the simpler case
where there is only one time-varying variable, the corresponding
columns in the wide format input can be specified as the 'varying'
argument, which can be either a vector of column names or the
corresponding column indices. The name of the corresponding
variable in the long format output combining these columns can be
optionally specified as the 'v.names' argument, and the name of
the time variables as the 'timevar' argument. The values to use as
the time values corresponding to the different columns in the wide
format can be specified as the 'times' argument. If 'v.names' is
unspecified, the function will attempt to guess 'v.names' and
'times' from 'varying' (an explicitly specified 'times' argument
is unused in that case). The default expects variable names like
'x.1', 'x.2', where 'sep = "."' specifies to split at the dot and
drop it from the name. To have alphabetic followed by numeric
times use 'sep = ""'.
Multiple time-varying variables can be specified in two ways,
either with 'varying' as an atomic vector as above, or as a list
(or a matrix). The first form is useful (and mandatory) if the
automatic variable name splitting as described above is used; this
requires the names of all time-varying variables to be suitably
formatted in the same manner, and 'v.names' to be unspecified. If
'varying' is a list (with one component for each time-varying
variable) or a matrix (one row for each time-varying variable),
variable name splitting is not attempted, and 'v.names' and
'times' will generally need to be specified, although they will
default to, respectively, the first variable name in each set, and
sequential times.
Also, guessing is not attempted if 'v.names' is given explicitly,
even if 'varying' is an atomic vector. In that case, the number of
time-varying variables is taken to be the length of 'v.names', and
'varying' is implicitly converted into a matrix, with one row for
each time-varying variable. As in the case of long to wide
conversion, the matrix is filled up by column, so careful
attention needs to be paid to the order of variable names (or
indices) in 'varying', which is taken to be like 'x.1', 'y.1',
'x.2', 'y.2' (i.e., variables corresponding to the same time point
need to be grouped together).
The 'split' argument should not usually be necessary. The
'split$regexp' component is passed to either 'strsplit' or
'regexpr', where the latter is used if 'split$include' is 'TRUE',
in which case the splitting occurs after the first character of
the matched string. In the 'strsplit' case, the separator is not
included in the result, and it is possible to specify fixed-string
matching using 'split$fixed'.
Value:
The reshaped data frame with added attributes to simplify
reshaping back to the original form.
See Also:
'stack', 'aperm'; 'relist' for reshaping the result of 'unlist'.
'xtabs' and 'as.data.frame.table' for creating contingency tables
and converting them back to data frames.
Examples:
summary(Indometh) # data in long format
## long to wide (direction = "wide") requires idvar and timevar at a minimum
reshape(Indometh, direction = "wide", idvar = "Subject", timevar = "time")
## can also explicitly specify name of combined variable
wide <- reshape(Indometh, direction = "wide", idvar = "Subject",
timevar = "time", v.names = "conc", sep= "_")
wide
## reverse transformation
reshape(wide, direction = "long")
reshape(wide, idvar = "Subject", varying = list(2:12),
v.names = "conc", direction = "long")
## times need not be numeric
df <- data.frame(id = rep(1:4, rep(2,4)),
visit = I(rep(c("Before","After"), 4)),
x = rnorm(4), y = runif(4))
df
reshape(df, timevar = "visit", idvar = "id", direction = "wide")
## warns that y is really varying
reshape(df, timevar = "visit", idvar = "id", direction = "wide", v.names = "x")
## unbalanced 'long' data leads to NA fill in 'wide' form
df2 <- df[1:7, ]
df2
reshape(df2, timevar = "visit", idvar = "id", direction = "wide")
## Alternative regular expressions for guessing names
df3 <- data.frame(id = 1:4, age = c(40,50,60,50), dose1 = c(1,2,1,2),
dose2 = c(2,1,2,1), dose4 = c(3,3,3,3))
reshape(df3, direction = "long", varying = 3:5, sep = "")
## an example that isn't longitudinal data
state.x77 <- as.data.frame(state.x77)
long <- reshape(state.x77, idvar = "state", ids = row.names(state.x77),
times = names(state.x77), timevar = "Characteristic",
varying = list(names(state.x77)), direction = "long")
reshape(long, direction = "wide")
reshape(long, direction = "wide", new.row.names = unique(long$state))
## multiple id variables
df3 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6),
time = rep(c(1,1,2,2), 3), score = rnorm(12))
wide <- reshape(df3, idvar = c("school", "class"), direction = "wide")
wide
## transform back
reshape(wide)
Reminder: “typical usage for converting from long to wide format”
### If names of wide-format variables are in a 'nice' format
reshape(data, direction = "long",
varying = c(___), # vector
sep) # to help guess 'v.names' and 'times'
### To specify long-format variable names explicitly
reshape(data, direction = "long",
varying = ___, # list / matrix / vector (use with care)
v.names = ___, # vector of variable names in long format
timevar, times, # name / values of constructed time variable
idvar, ids) # name / values of constructed id variable
We can try to apply that to our data.
df_wide_to_long <-
reshape(
# First argument is the wide-format data frame to be reshaped
df_all_wide,
# We are inputting wide data and expect long format as output
direction = "long",
# "varying" argument is a list of vectors. Each vector in the list is a
# group of time-varying (or grouping-factor-varying) variables which
# should become one variable after reformat. We want two variables after
# reformating, so we need two vectors in a list.
varying = list(
c("IgG_concentration_time1", "IgG_concentration_time2"),
c("age_time1", "age_time2")
),
# "v.names" is a vector of names for the new long-format variables, it
# should have the same length as the list for varying and the names will
# be assigned in order.
v.names = c("IgG_concentration", "age"),
# Name of the variable for the time index that will be created
timevar = "time",
# Values of the time variable that should be created. Note that if you
# have any missing observations over time, they NEED to be in the dataset
# as NAs or your times will get messed up.
times = 1:2,
# 'idvar' is a variable that marks which records belong to each
# observational unit, for us that is the ID marking individuals.
idvar = "observation_id"
)
Notice that this has exactly twice as many rows as our wide data format, and doesn’t appear to have any systematic missingness, so it seems correct.
'data.frame': 1302 obs. of 6 variables:
$ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...
$ gender : chr "Male" "Female" "Female" "Female" ...
$ slum : chr "Non slum" "Non slum" "Non slum" "Non slum" ...
$ time : int 1 1 1 1 1 1 1 1 1 1 ...
$ IgG_concentration: num 164.2979 0.3 0.3 0.0556 26.2113 ...
$ age : int 7 5 10 7 11 3 3 12 14 6 ...
- attr(*, "reshapeLong")=List of 4
..$ varying:List of 2
.. ..$ : chr [1:2] "IgG_concentration_time1" "IgG_concentration_time2"
.. ..$ : chr [1:2] "age_time1" "age_time2"
..$ v.names: chr [1:2] "IgG_concentration" "age"
..$ idvar : chr "observation_id"
..$ timevar: chr "time"
[1] 1302
[1] 651
Reminder: “typical usage for converting from long to wide format”
We can try to apply that to our data. Note that the arguments are the same as in the wide to long case, but we don’t need to specify the times
argument because they are in the data already. The varying
argument is optional also, and R will auto-generate names for the wide variables if it is left empty.
We can do the same checks to make sure we pivoted correctly.
'data.frame': 651 obs. of 7 variables:
$ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...
$ gender : chr "Male" "Female" "Female" "Female" ...
$ slum : chr "Non slum" "Non slum" "Non slum" "Non slum" ...
$ IgG_concentration_time1: num 155.5811 0.2919 0.2543 0.0533 22.0159 ...
$ age_time1 : int 11 9 14 11 15 7 7 16 18 10 ...
$ IgG_concentration_time2: num 164.2979 0.3 0.3 0.0556 26.2113 ...
$ age_time2 : int 7 5 10 7 11 3 3 12 14 6 ...
- attr(*, "reshapeWide")=List of 5
..$ v.names: chr [1:2] "IgG_concentration" "age"
..$ timevar: chr "time"
..$ idvar : chr "observation_id"
..$ times : num [1:2] 2 1
..$ varying: chr [1:2, 1:2] "IgG_concentration_time1" "age_time1" "IgG_concentration_time2" "age_time2"
[1] 651
[1] 1287
Note that this time we don’t have exactly twice as many records because of some quirks in how reshape()
works. When we go from wide to long, R will create new records with NA values at the second time point for the individuals who were not in the second study – it won’t do that when we go from long to wide data. This is why it can be important to make sure all of your missing data are explicit rather than implicit.
reshape
metadataWhenever you use reshape()
to change the data format, it leaves behind some metadata on our new data frame, as an attr
.
'data.frame': 1302 obs. of 6 variables:
$ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...
$ gender : chr "Male" "Female" "Female" "Female" ...
$ slum : chr "Non slum" "Non slum" "Non slum" "Non slum" ...
$ time : int 1 1 1 1 1 1 1 1 1 1 ...
$ IgG_concentration: num 164.2979 0.3 0.3 0.0556 26.2113 ...
$ age : int 7 5 10 7 11 3 3 12 14 6 ...
- attr(*, "reshapeLong")=List of 4
..$ varying:List of 2
.. ..$ : chr [1:2] "IgG_concentration_time1" "IgG_concentration_time2"
.. ..$ : chr [1:2] "age_time1" "age_time2"
..$ v.names: chr [1:2] "IgG_concentration" "age"
..$ idvar : chr "observation_id"
..$ timevar: chr "time"
This stores information so we can reshape()
back to the other format and we don’t have to specify arguments again.
We recommend checking out the pivot_wider()
and pivot_longer()
from the tidyr package!
merge()
function can be used to merge datasets.reshape()
function allows you to toggle between wide and long data. although we highly recommend playing around with the pivot_wider()
and pivot_longer()
from the tidyr package insteadThese are the materials we looked through, modified, or extracted to complete this module’s lecture.