R Programing Gotchas

R Gotchas

R has several 'features' that can trip up those who are not aware of them.

Non-Value Values.

R contains several values are specially treated. All data types have an NA value. Lists can have NULL value. Numeric vectors have Inf, -Inf, and NaN values in addition to the NA value. All of these values are distinct from each other.

The NA VALUE

  • Commonly referred to as 'Not Applicable' or Missing. It is most like a representation of all possible values. For this reason almost any operation applied to a NA will return NA.
    • NA + 5 is equivalent to all possible values + 5 which should equal all possible values.
         NA + 5
         [1] NA
      
    • A[NA,] is asking for all possible rows as such it returns a vector of NAs representing all possible values for those columns.
      A <- matrix(1:25, ncol=5)
      A[c(1,NA),]
           [,1] [,2] [,3] [,4] [,5]
      [1,]    1    6   11   16   21
      [2,]   NA   NA   NA   NA   NA
      
  • Two operations that do not all ways return NA when applied to a NA are the AND ("&") and the OR ("|") operations.
    • A & FALSE must be false. There is no possible value for A which would make this statement true. Therefor NA & FALSE is equal to FALSE.
      NA & FALSE
      [1] FALSE
      
    • A | TRUE must be true. There is no possible value for A which would make this statement false. Therefor NA | TRUE is equal to TRUE.
      NA | TRUE
      [1] TRUE
      
  • It is impossible to directly compare an NA to anything. In order to check if a value is equal to NA the is.na() function must be used. Other wise we are asking whether a value is equal to all possible values.
    5 == NA
    [1] NA
    
    is.na(NA)
    [1] TRUE
    

Some functions include a parameter for ignoring NA in the computation. For example, sum(..., na.rm=FALSE). By specifying na.rm=TRUE, you can get a non-NA result.
sum(c(1,5,10,NA))
[1] NA

sum(c(1,5,10,NA), na.rm=TRUE)
[1] 16

The NULL Value

  • A NULL special value meaning essentially 'has no value'.
  • Comparing NULL to anything is an invalid question. In order to check if a value is equal to NULL the is.null() function must be used.
    • The A == B statement asks is the value A the same as the value B.
    • The A == NULL statement asks is the value A the same as the value which has no value. As there is no value to compare against the operation returns a logical vector of length 0.
      A <- 5
      A == NULL
      logical(0)
      
      is.null(A)
      [1] FALSE
      
  • An element of a list that is equal to NULL means that this element contains no data.
  • Assigning the NULL value to an element of a list indicates that the data presently residing in that location should be forgotten about.

Trying to access a list element by an invalid name will also result in NULL, but trying to access an index greater than the length of the list is be a subscript out of bounds error.
l <- list()
l[[1]] <- 4
l$x
NULL
l[[2]]
Error in list()[[2]] : subscript out of bounds

The Inf, -Inf, and NaN Values

  • Inf, -Inf, and NaN are special numeric values.
  • Inf, and -Inf represent positive and negative infinity and behave accordingly.
    A <- 1/0
    B <- -1/0
    A
    [1] Inf
    B
    [1] -Inf
    5 * A
    [1] Inf
    
  • NaN is short for Not a Number. It is the result of any undefined mathematical operation.
    0/0
    [1] NaN
    

Environments

  • Locations where objects are stored.
  • All environments (except the global environment) have a parent environment.
  • All function calls are executed in its own environment that is a child of the call environment.
  • values stored in parent environment are inherited by the child environment.
    a <- 1
    test <- function() {
       print(a)
       invisible(NULL)
    }
    test()
    [1] 1
    
  • However the child environment only has access to copies of the original values. Any modifications to values done in the child environment will not propagate to the parent environment.
    a <- 1
    test <- function() {
       print(a)
       a <- 2
       print(a)
       invisible(NULL)
    }
    
    test()
    [1] 1
    [1] 2
    print(a)
    [1] 1
    
  • A parent environment cannot access any values from the child environment.

Object Name Confusion

Object name confusion occurs when the variable that you think you are using is not the same as the variable that you are actually using.
  • This means that typos can lead to functions that work but are using wrong values from the parent enviroment instead of throwing an error.
    • This is the function we want to write:
      test1 <- function(cat) {
         5 + cat
      }
      
      test1(5)
      [1] 10
      
    • This is the function we actually wrote:
      test1 <- function(cat) {
         5 + car     ## Typo should be cat
      }
      
      test1(5)
      Error in test1() : object 'car' not found
      
    • What happens if the object 'car' exists in your working environment.
      car <- 2
      test1(5)
      [1] 7
      
  • If attach(), with() or within() functions are used there is a chance that this will lead to confusion on the user's part about which objects are being referenced.
    • Lets say we want to add the object 'mod' to column 'a' in the data frame 'junk'.
      mod <- 15
      junk <- data.frame(a = 1:10)
      
      with(junk, a + mod)     
      [1] 16 17 18 19 20 21 22 23 24 25
      
    • What happens if there is a column in 'junk' named 'mod'?
      mod <- 15
      junk <- data.frame(a = 1:10, mod = 6:15)
      
      with(junk, a + mod)     
      [1]  7  9 11 13 15 17 19 21 23 25
      

TRUE and FALSE vs. T and F

  • always use 'TRUE' and 'FALSE' Variables. Variables 'T' and 'F' can be assigned other values.
    T <- 0
    TRUE == T
    [1] FALSE
    
    TRUE <- 0
    Error in TRUE <- 0 : invalid (do_set) left-hand side to assignment
    

&& vs &

  • Logical operations come in two forms. Vectorized and Non-Vectorized. The single character version ('&') is the vectorized version and the double character version ('&&') is the non vectorized version.
    • The vectorized operation will operate element by element, returning a result for the set.
      a <- c(1:10)
      b <- c(1:10)
      a < 7 & b > 3
      [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
      
    • The non vectorized operation only compares the first elements of the arguments returning a single value.
      a <- c(1:10)
      b <- c(1:10)
      a < 7 && b > 3
      [1] FALSE
      

Partial Argument Matching

R performs partial argument matching on all function calls. It attempts to match names of the arguments passed to the function with the names of defined arguments of the function.
  • Partial argument matching can cause difficulties when attempting to pass arguments through the '...' argument.
    test1 <- function(x=5, b=2) {
       b*x-5
    }
    
    test2 <- function(f, bob="Hi", ...) {
       print(bob)
       f(...)
    }
    
    test2(test1)
    [1] "Hi"
    [1] 5
    
    test2(test1, b=8)
    [1] 8
    [1] 5
    
  • We would expect that b=8 should have altered the returned value of the function. Looking closer at the test2() function arguments reveals the answer. test() has an argument 'bob' which is before the '...' argument. Partial argument matching happens mapping b=4 to bob=4. Following that no unmatched arguments remain to be assigned to the '...' argument.

The read.table() Function

  • The read.table and data.frame() functions by default convert string vectors into factor vectors.
  • NA value conversion
    • By default read.table() converts the string value "NA" to the R NA value.
    • For non-character vectors the zero length string value "" is converted to the R NA value.
    • For character vectors the zero length string value is kept as is.
      cat('A,B
      1,a
      ,b
      5,
      NA,"NA"
      6,NA
      ', file="tmp.csv")
      read.table(file="tmp.csv", sep=',', header=TRUE, stringsAsFactors=TRUE)
         A    B
      1  1    a
      2 NA    b
      3  5
      4 NA <NA>
      5  6 <NA>
      
  • By default read.table() believes that a "#" character is a comment character. It will ignore all text between the "#" and the next line.
    cat('A,B
    7,Patient #5
    8,Patient #8
    ', file='tmp.csv')
    read.table(file="tmp.csv", sep=',', header=TRUE, stringsAsFactors=TRUE)
      A        B
    1 7 Patient
    2 8 Patient
    

Programming Tips For Statisticians

-- CharlesDupont - 23 Jun 2009
Topic revision: r3 - 29 Jun 2009, WillGray
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback