How to write your own functions and good R programing techniques

Good R Programing Techniques

R is quite different from regular programing languages in the way that it executes the code given to it. Due to R's eccentricities in order to create functions that run faster and consume less memory there are several guide lines to keep in mind. All of the guide lines are optional however as datasets get bigger the more benefits will be gained by following them.

  • Don't use recursion. Recursion and the way R executes code results in functions that run slower and consume massive amounts of memory.
  • If you use a complex calculation that is constant often in a function consider assigning it to its own variable.
    • Before:
      w <- log(x) + log(y) + 10
      v <- log(x) + log(y) + 32
      
    • After:
      t <- log(x) + log(y)
      w <- t + 10
      v <- t + 32
      

  • If you are only using a variable once consider eliminating that variable.
    • Before:
      f <- function(n = 125000) {
        x <- runif(n)
        sum(x + 1)
      }
      
    • After:
      f <- function(n = 125000) {
        sum(runif(n) + 1)
      }
      
  • When temporary variables are needed, using the same name for ones of the same size that are not required simultaneously can avoid unneeded copying.
    • Before:
      g <- function(n = 125000) {
        tmp <- runif(n)
        tmp1 <- 2 * tmp + tmp^2
        tmp2 <- tmp1 - trunc(tmp1)
        mean(tmp2 > 0.5)
      }
      
    • After:
      g1 <- function(n = 125000) {
        tmp <- runif(n)
        tmp <- 2 * tmp + tmp^2
        tmp <- tmp - trunc(tmp)
        mean(tmp < 0.5)
      }
      

  • Try not to use loops.
    • In R loops are very slow. In this example 15 is added to a 100000 element vector.
      n <- 100000
      a <- runif(n)
      d <- vector(mode=typeof(a), n)
      
      system.time(a + 15, gcFirst=TRUE)
      system.time(lapply(a, function(x) x + 15), gcFirst=TRUE)
      system.time(sapply(a, function(x) x + 15), gcFirst=TRUE)
      system.time(for(i in seq(along.with=a)) d[i] <- a[i] + 15, gcFirst=TRUE)
      
    • Use Vectorized Arithmetic, sapply or lapply instead.
      • Before
        over.thresh <- function(x, threshold) {
          for (i in 1:length(x))
            if (x[i] < threshold)
              x[i] <- 0
          x
        }
        
      • After
        over.thresh2 <- function(x, threshold) {
          x[x < threshold] <- 0
          x
        }
        

  • For operations on individual elements of a list use the apply function family, such as lapply, tapply, etc...

  • Avoid looping over a named data set. If necessary save any names by tnames <- names(x) and then remove them by names(x) <- NULL, perform the loop, then reassign the names by names(x) <- tnames.

  • Avoid growing data sets in a loop. Always create a data set of the desired size before entering the loop; this greatly improves memory allocation. If you don't know the exact size over estimate it and the shorten the vector at the end of the loop.
    • Before:
      grow <- function() {
        nrow <- 1000
        x <- NULL
        for(i in 1:(nrow)) {
          x <- rbind(x, i:(i+9))
        }
        x
      }
      
      system.time(grow(), gcFirst=TRUE)
      
    • After:
      no.grow <- function() {
        nrow <- 1000
        x <- matrix(0, nrow = nrow, ncol = 10)
        for(i in 1:nrow) {
          x[i, ] <- i:(i + 9)
        }
        x
      }
      
      system.time(no.grow(), gcFirst=TRUE)
      
    • When an element is added to an existing vector, R allocates a new vector of length equal to the current vector plus the additional element. It then copies the existing vector and the new element into the new vector. In contrast overwriting a element in a vector requires just the copying of the replacement elements.

These are some good programing techniques in general.
  • Always use parentheses to make groupings explicit.
    • x < 5 && y > 10 || z < 6 not clear as to what should be happening
      x <- 6
      y <- 10
      z <- 5
      w <- 15
      
      x < 5 && y > 10 || z < 6 && w < 25      # TRUE
      (x < 5 && y > 10) || (z < 6 && w < 25)  # TRUE
      x < 5 && (y > 10 || z < 6) && w < 25    # FALSE
      (x < 5 && y > 10 || z < 6) && w < 25    # TRUE
      x < 5 && (y > 10 || z < 6 && w < 25)    # FALSE
      
    • -2^2 equals -4 not 4 like (-2)^2.

  • Always use { and } in functions, loops, if, else, and other statements
    • Example of ambiguity:
      test <- function(x, y) {
        if(x)
          if(y) 5
        else 6
      }
      
      (test(TRUE, TRUE))       # [1] 5
      (test(TRUE, FALSE))      # [1] 6
      (test(FALSE, TRUE))      # NULL
      (test(FALSE, FALSE))     # NULL
      

  • Use the return() function at the end of functions.
  • Always use TRUE and FALSE instead of T and F.

Writing Functions

Functions in R can do 3 things
  1. Be passed values
  2. Return a value
  3. Side effects, anything caused that is not the returning of a value. This would be like the text output of print() or the opening of a dvi viewer from print.latex(). We are not going to deal with this topic in this lecture.

The R function Statement

The basic R function statement looks like this
FUNname <- function( arglist ) { code }

  • FUNname is th name that you have selected as you function name.
    • You can select any non-reserved word to be you function name.
    • However you cannot have an function name and a variable name be the same.

  • arglist is a coma separated list of 0 or more arguments that can be passed to the function.
  • code is the statements that perform the actions of the function

Function Return Values

Functions are designed to return values. It is call returning because the value is taken from the function and is returned to the calling code.
  • Functions have 2 ways to return values.
    1. The value of the last statement evaluated in the function.
      • Example:
        f1 <- function() {
          10
        }
        
        > f1()
        [1] 10
        

    1. The value passed to the return statement. Calling a return statement always causes the function to return
      • Example:
        f2 <- function() {
          return(20)
          10
        }
        
        > f2()
        [1] 20
        

Function Arguments

The function argument statement looks like this
VARname
or
VARname = VALUE

  • Function arguments are how values are passed to the function by the calling code.
  • By convention the first argument is the main data object being passed to the function. The class of first argument is used for matching S3 methods.
  • The function arguments are treated like variables inside the functions code.
    • Example:
      f3 <- function(x) {
        x + 5
      }
      
      > f3(5)
      [1] 10
      

  • Arguments can be set to a default value is the calling code does not explicitly an argument to a value. This is done using the VARname = VALUE.
    • Example:
      f4 <- function(x, y=5) {
        x + y
      }
      
      > f4(5)
      [1] 10
      > f4(5,7)
      [1] 12
      

  • You can test to see if the calling code has set an argument to a value using the test function missing. Function missing(x) returns a logical TRUE if the value of x has not been set by the calling code.
    • Example:
      f5 <- function(x) {
        if( missing(x) ) {
          return("x is missing")
        } else {
          return("x is not missing")
        }
      }
      
      > f5()
      [1] "x is missing"
      > f5(10)
      [1] "x is not missing"
      

  • The arglist can also have a special type of argument ... . This is argument can hold a variable number of arguments. In R functions it is mostly used for passing parameters to other functions.
    • Example:
      f6 <- function(z, ...) {
        print(paste("The value of z is", z, sep=" "))
        
        paste("The value of f5(...) is", f5(...), sep=" ")
      }
      
      > f6(5, x=3)
      [1] "The value of x is 5"
      [1] "The value of f5(...) is x is not missing"
      > f6(5)
      [1] "The value of x is 5"
      [1] "The value of f5(...) is x is missing"
      
      f6.5 <- function(z, ...) {
        print(paste("The value of z is", z, sep=" "))
        
        b <- list(...)
        cat(paste(names(b), b, sep=':', collapse=" "))
        cat("\n")
      }
      
      f6(5)
      f6(5, a="sdd", "desg")
      

Useful Functions for Use in Functions

There are many functions that are designed to only be used in other functions. In this section we go over some of the more useful functions.

  • The as. family of coercion functions, such as as.vector(x), as.data.frame(x), or as.double(x). These functions are used to change the data type of x to a new data type.
    • Example:
      f7 <-  function(x) {
        as.character(x)
      }
      
      > class(f7(7))
      [1] "character"
      > class(f7("a"))
      [1] "character"
      

  • The is. family of test functions, such as is.null(), is.numeric(), and is.data.frame(). These functions are used to tell what kind of variable that the function has been passed.
    • Example:%BR
      f8 <- function(x) {
        is.character(x)
      }
      
      > f8(9)
      [1] FALSE
      > f8("abc")
      [1] TRUE
      

  • The on.exit() function records the expression pass to it and executes that expression after the function exits, either naturally or the result of an error
    • Example:
      f9 <- function() {
        opar <- par(mai = c(1,1,1,1))
        on.exit(par(opar))
        par()$mai
      }
      
      par()$mai     # [1] 1.0627614 0.8543768 0.8543768 0.4376077
      f9()          # [1] 1 1 1 1
      par()$mai     # [1] 1.0627614 0.8543768 0.8543768 0.4376077
      

R Gotchas

R has several 'features' that can trip up those who are not aware of them.

Environments

  • Environments are the location that R stores its objects (variables, functions, etc.).
  • All environments (except the global environment) have a parent environment.
    • An environment can read from and write to objects in the parent environment. If an object from the parent environment is told to change instead a new object is created in the current environment holding the new value. The new object now masks the object from the parent environment. Use the alternative assignment operator to modify variables in the parent environment
      a <- 1
      test <- function() {
         print(a)
         a <- 2
         print(a)
         invisible(NULL)
      }
      test()
      print(a)
      test2 <- function() {
         print(a)
         a <<- 3
         print(a)
         invisible(NULL)
      }
      test2()
      print(a)
      
  • Each call to a function receives it's own environment that is a child of the calling environment. When a function returns that environment is destroyed.
    • Because of this errors of omission can lead to very odd bugs.
      • This is the correct code
        a <- 5
        test1 <- function(b=1) {
           a <- 10 * b
           if(a > 5) {
              return(45)
           }
           return(b)
        }
        
        test1()        # [1] 45
        
      • Here is the same code but a critical line has been commented out leading to a functioning function that produces incorrect output.
        a <- 5
        test2 <- function(b=1) {
           #a <- 10 * b
           if(a > 5) {
               return(45)
           }
           return(b)
        }
        
        test2()       # [1] 1
        
    • when debugging its a good idea to check to make sure that all of the variables in question have been assigned in the function.

T and F vs. TRUE and FALSE

  • Always use TRUE and FALSE instead of T and F

Object name confusion

  • If attach or with functions are used the is a good chance that this will lead to confusion on the users part about what objects are being referenced.
    • Lets say we want to add the object mod to column a in the data frame junk.
      mod <- 15
      junk <- data.frame(a = 1:10)
      
      with(junk, a + mod)     # [1] 16 17 18 19 20 21 22 23 24 25
      
    • What happens if there column in 'junk' named 'mod'?
      mod <- 15
      junk <- data.frame(a = 1:10, mod = 6:15)
      
      with(junk, a + mod)     #[1]  7  9 11 13 15 17 19 21 23 25
      

Basic Intro to debugging in R

4 Most important debug functions in R. cat(), debug(), traceback(), str()
  • cat() - This is good for determining if a part of the code is being called running.
  • str() - This is the key to determining what an object actually contains. It gives a comprehensive summary of the contents of the object. It takes a while to easly read its output but it is invaluable for determining the structure of complex lists.
  • traceback() - When a function dies by an error the user can call traceback. This returns the stack trace at the time of the error. The higher the number the deeper in the call stack the function is.
  • debug() - marks a function to call the debugger when ever it is called.
    • to advance a line press enter and empty prompt or enter n.
    • to continue to the next debugged function call enter c.
    • to print a stack trace of all active function calls enter where.
    • to quit enter Q.
    • anything else entered at the prompt is evaluated as an expression in the current environment.
-- CharlesDupont - 03 Jun 2005
Topic attachments
I Attachment Action Size Date Who Comment
RClinic-20050609.RR RClinic-20050609.R manage 2.7 K 07 Jun 2007 - 13:59 CharlesDupont Functions in examples
Topic revision: r7 - 13 Mar 2008, WillGray
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback