R Internals
This page is about how R stores objects internally. I've had to re-figure this out several times, so I thought a TWiki page about this might be prudent.
SEXP
All R objects are accessible in C via the
SEXP
type.
SEXP
is just a pointer defined by a typedef, and you can access the actual data inside an SEXP object by using macros which call other macros which use some typedefs of macros and other typedefs, etc. (basically, a real pain to understand). Here's an attempted trace of what
SEXP
actually is:
SEXP
is a pointer to an
SEXPREC
object:
typedef struct SEXPREC *SEXP;
SEXPREC
is a struct:
typedef struct SEXPREC {
SEXPREC_HEADER;
union {
struct primsxp_struct primsxp;
struct symsxp_struct symsxp;
struct listsxp_struct listsxp;
struct envsxp_struct envsxp;
struct closxp_struct closxp;
struct promsxp_struct promsxp;
} u;
} SEXPREC, *SEXP;
SEXPREC_HEADER
is a preprocessor-defined variable:
#define SEXPREC_HEADER \
struct sxpinfo_struct sxpinfo; \
struct SEXPREC *attrib; \
struct SEXPREC *gengc_next_node, *gengc_prev_node
spxinfo_struct
looks like this:
struct sxpinfo_struct {
SEXPTYPE type : 5;
unsigned int obj : 1;
unsigned int named : 2;
unsigned int gp : 16;
unsigned int mark : 1;
unsigned int debug : 1;
unsigned int trace : 1;
unsigned int fin : 1; /* has finalizer installed */
unsigned int gcgen : 1; /* old generation number */
unsigned int gccls : 3; /* node class */
}; /* Tot: 32 */
SEXPTYPE
is an integer that defines what this object is (list, string vector, numeric vector, etc.). The other variables in
sxpinfo_struct
seem to be flags of some kind.
The union in
SEXPREC
(called
u
) can contain several structs, which I assume depend on what type the R object happens to be. Here they are for reference:
struct primsxp_struct {
int offset;
};
struct symsxp_struct {
struct SEXPREC *pname;
struct SEXPREC *value;
struct SEXPREC *internal;
};
struct listsxp_struct {
struct SEXPREC *carval;
struct SEXPREC *cdrval;
struct SEXPREC *tagval;
};
struct envsxp_struct {
struct SEXPREC *frame;
struct SEXPREC *enclos;
struct SEXPREC *hashtab;
};
struct closxp_struct {
struct SEXPREC *formals;
struct SEXPREC *body;
struct SEXPREC *env;
};
struct promsxp_struct {
struct SEXPREC *value;
struct SEXPREC *expr;
struct SEXPREC *env;
};
DATAPTR
The
DATAPTR
macro is used to access the actual data of an
SEXP
object. Here's what I found by tracing
DATAPTR
.
DATAPTR
is called from lots of macros such as
REAL
and
INTEGER
, where
x
is an
SEXP
object:
#define CHAR(x) ((char *) DATAPTR(x))
#define LOGICAL(x) ((int *) DATAPTR(x))
#define INTEGER(x) ((int *) DATAPTR(x))
#define RAW(x) ((Rbyte *) DATAPTR(x))
#define COMPLEX(x) ((Rcomplex *) DATAPTR(x))
#define REAL(x) ((double *) DATAPTR(x))
DATAPTR
is also a preprocessor-defined macro:
#define DATAPTR(x) (((SEXPREC_ALIGN *) (x)) + 1)
SEXPREC_ALIGN
is a typedef'd union:
typedef union { VECTOR_SEXPREC s; double align; } SEXPREC_ALIGN;
VECTOR_SEXPREC
is a struct that is a mini-version of
SEXPREC
for use with vectors:
typedef struct VECTOR_SEXPREC {
SEXPREC_HEADER;
struct vecsxp_struct vecsxp;
} VECTOR_SEXPREC, *VECSEXP;
Getting Data
So how do you get at the actual data? It depends on the type of R object you're dealing with.
Reals and Integers
double *myArray = REAL(some_SEXP_object);
which is equivalent to:
double *myArray = ((double *) (((SEXPREC_ALIGN *) (some_SEXP_object)) + 1))
What this code does is take an
SEXP
object, which is really a pointer to an
SEXPREC
struct, and cast it as an
SEXPREC_ALIGN
pointer.
SEXPREC_ALIGN
is either another pointer or a double. So what I think is happening here, is that by using pointer arithmetic, you magically get a pointer of type
double
that points to the actual data you're trying to access. I don't completely understand this yet, but I'm working on it.
Memory Allocation
The information above might lead to the question, "How does R allocate memory for its objects?" Maybe the answer will also answer some questions about how to handle
SEXP
objects.
R_alloc
When you create an object in R, a function named
R_alloc
that lives in
src/main/memory.c is called.
R_alloc
is a small function that looks like this:
char *R_alloc(long nelem, int eltsize)
{
R_size_t size = nelem * eltsize;
double dsize = nelem * eltsize;
if (dsize > 0) { /* precaution against integer overflow */
SEXP s;
#if SIZEOF_LONG > 4
if(dsize < R_LEN_T_MAX)
s = allocString(size); /**** avoid extra null byte?? */
else if(dsize < sizeof(double) * (R_LEN_T_MAX - 1))
s = allocVector(REALSXP, (int)(0.99+dsize/sizeof(double)));
else {
s = R_NilValue; /* -Wall */
error(_("cannot allocate memory block of size %.0f"), dsize);
}
#else
if(dsize > R_LEN_T_MAX)
error(_("cannot allocate memory block of size %.0f"), dsize);
s = allocString(size); /**** avoid extra null byte?? */
#endif
ATTRIB(s) = R_VStack;
R_VStack = s;
#if VALGRIND_LEVEL > 0
VALGRIND_MAKE_WRITABLE(CHAR(s), (int) dsize);
#endif
return CHAR(s);
}
else return NULL;
}
R_alloc
either calls
allocString
or
allocVector
, but since
allocString
is nothing more than a wrapper around
allocVector
,
allocVector
seems to handle most object allocation requests.
allocVector
This is where the rubber meets the road, if you will. Its prototype looks like this:
SEXP allocVector(SEXPTYPE type, R_len_t length);
So you tell it what you want and how big you want it, and it spits out an allocated R object (as long as there's memory available and you don't exceed R's limits). First, there's a large
switch
statement that calculates how much memory space it needs based on what kind of object you want and how big you want it. It does this through some macros like
BYTE2VEC
and
INT2VEC
that are defined as such (where
VECREC
is a struct/union that helps with stack maintenance):
#define BYTE2VEC(n) (((n)>0)?(((n)-1)/sizeof(VECREC)+1):0)
#define INT2VEC(n) (((n)>0)?(((n)*sizeof(int)-1)/sizeof(VECREC)+1):0)
Next, R determines whether or not it needs to garbage collect (along with doing some other stuff related to garbage collection). If there isn't enough space, R will clean up objects that aren't in use anymore to make room for new objects. As far as I can tell there are two different ways to classify R objects: large and small. Objects deemed as "small" are stored as nodes on a stack designated for "small" objects, and objects deemed as "large" are stores as vectors on a different stack (vector stack) from the "small" objects. Both stacks are maintained independently.
From the code:
Node Classes. Non-vector nodes are of class zero. Small vector nodes are in classes 1, ..., NUM_SMALL_NODE_CLASSES, and large vector nodes are in class LARGE_NODE_CLASS. For vector nodes the node header is followed in memory by the vector data, offset from the header by SEXPREC_ALIGN.
The node header is
VECREC
, which looks like this (from
src/include/Defn.h):
typedef struct {
union {
SEXP backpointer;
double align;
} u;
} VECREC, *VECP;
There is a lot more information in the source code comments about the heap structure, and it's pretty interesting stuff, although I won't go into detail about it here.
Garbage Collection
As I was stepping through
allocVector
I noticed some code dedicated to garbage collection, and I found some helpful comments at the top of
src/main/memory.c that explain what's happening.
There are 3 levels of garbage collection in R. A level 0 collection deals with the "youngest generation". Level 1 collections occur after a certain number of level 0 collections (i.e. 20), and level 2 collections occur after a certain number of level 1 collections (i.e. 5). So level 2 collections happen after every 100 level 0 collections. If a level N collection fails to free a certain amount of space, the next collection will be level N + 1.
There is a maximum and minimum size for the two object stacks (nodes and vectors), and these stacks grow and shrink according to a 6 or so constants named
R_NGrowFac
,
R_VGrowFac
,
R_NShrinkFac
, etc.
Relevant files
Here is a list of R files I waded through to get this information (relative to an unpacked R source tree):
The attached files are taken from the R-2.2.1 source tree.