Generating data structures in a memory efficient manner
We often generate data structures, such as vectors or data frames, in
for
loops or in our own used-defined functions. Unfortunately, if you're not careful, generating a data structure can be very memory intensive. Specifically, in each iteration of a
for
loop we often concatenate the next element of the vector onto the existing vector to generate the final vector. However, every time you do this, R implicitly copies the existing vector and then adds the additional element. Therefore, you are using 2=n=+1 the amount of memory, where
n
is the length of the vector during a specific iteration of the
for
loop, just to add a single element to the vector. Depending on the length
n
, this can be all of your memory.
A much more efficient way of generating a vector, is to define an "empty" vector of the
final length, if you know what this final length will be.
For example, suppose we want to generate a numeric vector of 100 elements. Instead of,
a <- NULL
for (i in 1:100) {
a <- c(a, rnorm(1))
}
We can do the following:
a <- numeric(100)
a
a[] <- NA
for(i in 1:100) {
a[i] <- rnorm(1)
}
The
numeric()
function generates a
numeric vector of specified length, where each element has a value
0
. We could have also used the
character()
function to generate a
character vector of specified length, where each element has a value
""
. In either case, we can easily replace all of the elements of the vector with
NA
using the code
a[] <- NA
.
We can use a similar process to efficiently generate a data frame --- i.e., have defined dimensions of the data frame. For example,
c <- data.frame(a = numeric(100), b = character(100))
c[ , ] <- NA
If you wanted a completely numeric (or completely character) data frame, you could have also done the following:
b <- numeric(100)
attr(b, "dim") <- c(10,10)
c <- as.data.frame(c)
c[ , ] <- NA