R Cook Book

This page is dedicated to the little R functions that we write to do various things. Please feel free to add to the list or make improvements to current functions. If you do add (or change) something, please put your code into a (existing or new) category and include a small description that explains what your code does (for extra points include an example or two as well). Thanks!

String Manipulation

Splitting into segments

Here's a function to split a string into segments of length n. The default value of n is 80, since most terminals have rows that are 80 characters long. strSegs <- function(x, n=80) { retval <- c() while(nchar(x) > n) { segment <- substr(x, 1, n) x <- substr(x, n+1, nchar(x))

retval <- c(retval, segment) } retval <- c(retval, x) }
Example: myString <- "This is a string that I want to split into segments." segments <- strSegs(myString, 10) cat(segments, sep="\n")

Data Frame Manipulation

Multisort

Here's a function that sorts each column of a data frame in decreasing or increasing order (the normal sort method only does either decreasing or increasing for all columns). I took the idea for this function from Cole Beck. You give the function your data frame and a series of name/value pairs, where each name is the name of a column, and each value is either TRUE or FALSE (TRUE for a decreasing sort and FALSE for an increasing sort.) The sort precedence goes from left to right. You do not have to specify all columns, but you must specify at least one. multisort <- function(x, ...) { # given the data frame x, and a list of name/value pairs, sort pairList <- list(...)

# make sure names from ... are valid tmp <- intersect(names(pairList), names(x)) if (length(tmp) = 0 || tmp names(pairList)) { stop("Invalid column names.") }

retval <- x for (i in length(pairList):1) { name <- names(pairList[i]) if (pairListi = TRUE && pairListi = FALSE) { # make sure values are TRUE or FALSE only stop(paste("Invalid value (", pairListi, ") for ", name, " (should be TRUE or FALSE).", sep="")) }

retval <- retval[order(retval[,name], decreasing=pairListi),] } return(retval) }

Example: x <- data.frame(uno=sample(c("a","b","c"), 10, replace=TRUE), dos=sample(c("x","y","z"), 10, replace=TRUE), tres=sample(1:10, 10)) multisort(x, uno=TRUE, dos=FALSE) multisort(x, dos=FALSE, uno=TRUE, tres=FALSE)

order.data.frame

The multisort function is inefficient because it has to rearrange the data several times. This function produces the order necessary to sort a data.frame or matrix based on a set of columns, how to handle NAs for each, and whether the column(s) should be decreasing or increasing. The sort precedence goes from left to right.

order.data.frame <- function(x, ..., na.last=TRUE, decreasing=FALSE) { len <- ncol(x) len2 <- length(na.last) len3 <- length(decreasing) if (len < len2) na.last <- na.last[1 : len] if (len > len2) na.last[(len2 + 1) : len] <- TRUE if (len < len3) decreasing <- decreasing[1 : len] if (len > len3) decreasing[(len3 + 1) : len] <- FALSE ox <- 1:nrow(x) for (i in len:1) { ox <- order(x[,i], order(ox) * (1 - 2 * decreasing[i]), na.last=na.last[i], decreasing=decreasing[i]) } return(ox) }

Example: x <- data.frame(uno=sample(c("a","b","c"), 10, replace=TRUE), dos=sample(c("x","y","z"), 10, replace=TRUE), tres=sample(1:10, 10)) x[order.data.frame(x[c("uno", "dos")], decreasing=c(TRUE, FALSE)), ] x[order.data.frame(x[c("dos", "uno", "tres")], decreasing=c(FALSE, TRUE, FALSE)), ]

Printing (small) data.frames

I wanted print.data.frame to look more like the results of a query in MySql. So I overwrote it with this: setMethod("print", signature(x="data.frame"), function(x, ...) { size=c() # for each column for(j in 1:dim(x)[2]) { len=nchar(names(x)[j]) for(i in 1:dim(x)[1]) { if(nchar(as.character(x[i,j])) > len) { len<-nchar(as.character(x[i,j])) } } size<-append(size, len+2) } # size is the number of '-' characters to display newrow="" for(s in size) { newrow<-paste(c(newrow,"+",rep("-",s)),collapse="") } newrow<-paste(c(newrow,"+\n"),collapse="") # newrow is an empty row to print out header="" for(j in 1:dim(x)[2]) { diff=size[j]-nchar(names(x)[j])-1 header<-paste(c(header,"| ",names(x)[j],rep(" ",diff)),collapse="") } header<-paste(c(header,"|\n"),collapse="") cat(newrow) cat(header) cat(newrow) for(i in 1:dim(x)[1]) { row="" for(j in 1:dim(x)[2]) { diff=size[j]-nchar(as.character(x[i,j]))-1 row<-paste(c(row,"| ",as.character(x[i,j]),rep(" ",diff)),collapse="") } row<-paste(c(row,"|\n"),collapse="") cat(row) } cat(newrow) } ) This may not look very good for a data.frame that has many columns. (someone want to extend that functionality?)

Here's an example:

data.frame(Car=c(16, 22, 07, 31, 2, 5, 8, 99, 7, 24, 66, 11, 29, 88, 48, 9, 17, 32, 43, 14, 6, 19, 26, 42, 01, 12, 45, 96, 10, 38, 21, 41, 20, 40, 1, 25, 55, 4, 18), Driver=c("Greg Biffle", "Dave Blaney", "Clint Bowyer", "Jeff Burton", "Kurt Busch", "Kyle Busch", "Dale Earnhardt", "Carl Edwards", "Robby Gordon", "Jeff Gordon", "Jeff Green", "Denny Hamlin", "Kevin Harvick", "Dale Jarrett", "Jimmie Johnson", "Kasey Kahne", "Matt Kenseth", "Travis Kvapil", "Bobby Labonte", "Sterling Marlin", "Mark Martin", "Jeremy Mayfield", "Jamie McMurray", "Casey Mears", "Joe Nemechek", "Ryan Newman", "Kyle Petty", "Tony Raines", "Scott Riggs", "Elliott Sadler", "Ken Schrader", "Reed Sorenson", "Tony Stewart", "David Stremme", "Martin Truex", "Brian Vickers", "Michael Waltrip", "Scott Wimmer", "J.J. Yeley"), Crew_Chief=c("Doug Richert", "Kevin Hamlin", "Gil Martin", "Scott Miller", "Roy McCauley", "Alan Gustafson", "Tony Eury Jr.", "Wally Brown", "Greg Erwin", "Steve Letarte", "Bootie Barker", "Mike Ford", "Todd Berrier", "Slugger Labbe", "Chad Knaus", "Kenny Francis", "Robbie Reiser", "Gary Putnam", "Todd Parrott", "Doug Randolph", "Pat Tryson", "Chris Andrews", "Bob Osborne", "Donnie Wingo", "Ryan Pemberton", "Matt Borland", "Paul Andrews", "Philippe Lopez", "Rodney Childers", "Tommy Baldwin Jr.", "David Hyder", "Jimmy Elledge", "Greg Zipadelli", "Steven Lane", "Kevin Manion", "Lance McGrew", "Joe Shear Jr.", "Chris Carrier", "Steve Addington"))

And here's the output:
+-----+-----------------+-------------------+
| Car | Driver          | Crew_Chief        |
+-----+-----------------+-------------------+
| 16  | Greg Biffle     | Doug Richert      |
| 22  | Dave Blaney     | Kevin Hamlin      |
| 7   | Clint Bowyer    | Gil Martin        |
| 31  | Jeff Burton     | Scott Miller      |
| 2   | Kurt Busch      | Roy McCauley      |
| 5   | Kyle Busch      | Alan Gustafson    |
| 8   | Dale Earnhardt  | Tony Eury Jr.     |
| 99  | Carl Edwards    | Wally Brown       |
| 7   | Robby Gordon    | Greg Erwin        |
| 24  | Jeff Gordon     | Steve Letarte     |
| 66  | Jeff Green      | Bootie Barker     |
| 11  | Denny Hamlin    | Mike Ford         |
| 29  | Kevin Harvick   | Todd Berrier      |
| 88  | Dale Jarrett    | Slugger Labbe     |
| 48  | Jimmie Johnson  | Chad Knaus        |
| 9   | Kasey Kahne     | Kenny Francis     |
| 17  | Matt Kenseth    | Robbie Reiser     |
| 32  | Travis Kvapil   | Gary Putnam       |
| 43  | Bobby Labonte   | Todd Parrott      |
| 14  | Sterling Marlin | Doug Randolph     |
| 6   | Mark Martin     | Pat Tryson        |
| 19  | Jeremy Mayfield | Chris Andrews     |
| 26  | Jamie McMurray  | Bob Osborne       |
| 42  | Casey Mears     | Donnie Wingo      |
| 1   | Joe Nemechek    | Ryan Pemberton    |
| 12  | Ryan Newman     | Matt Borland      |
| 45  | Kyle Petty      | Paul Andrews      |
| 96  | Tony Raines     | Philippe Lopez    |
| 10  | Scott Riggs     | Rodney Childers   |
| 38  | Elliott Sadler  | Tommy Baldwin Jr. |
| 21  | Ken Schrader    | David Hyder       |
| 41  | Reed Sorenson   | Jimmy Elledge     |
| 20  | Tony Stewart    | Greg Zipadelli    |
| 40  | David Stremme   | Steven Lane       |
| 1   | Martin Truex    | Kevin Manion      |
| 25  | Brian Vickers   | Lance McGrew      |
| 55  | Michael Waltrip | Joe Shear Jr.     |
| 4   | Scott Wimmer    | Chris Carrier     |
| 18  | J.J. Yeley      | Steve Addington   |
+-----+-----------------+-------------------+

Formula Manipulation

Arbittray Formula Element Replacement

  • Input Variables * Formula - formula object * VarName - character vector, containing a variable name
  • Output Variables

VarName <- "a" Formula <- y ~ .(AltVar)

NewFormula <- eval(substitute(eval(bquote(Formula, where=list(AltVar=as.name(VarName)))), env=list(Formula=Formula)))

Misc vector operations

Edge triggering

2D Edge triggering

Trigger on rising edge of Condition
  • Input variables
    • Condition - logical vector
  • Output Variable
    • Trigger - logical vector

Trigger <- diff(c(0, Condition)) > 0
Trigger on falling edge of Condition
  • Input variables
    • Condition - logical vector
  • Output Variable
    • Trigger - logical vector
Trigger <- diff(c(0, Condition)) < 0
Trigger on any edge of Condition
  • Input variables
    • Condition - logical vector
  • Output Variable
    • Trigger - logical vector
Trigger <- diff(c(0, Condition)) = 0

3D Edge triggering

Trigger on any edge of Condtion
  • Input variables
    • Condition - logical matrix
  • Output Variable
    • Trigger - logical matrix
index <- seq_along(Condition) ## Wrap in FALSES bCond <- cbind(FALSE,rbind(FALSE, Condition, FALSE), FALSE)

## matrix dimentions condDim <- dim(Condition) nRowBCond <- nrow(bCond)

## function to convert matrix row col pos to vector index matrixIndex <- function(x,y, nRow) y*nRow + x

index <- outer(seq_len(condDim[1]), seq_len(condDim[2]), FUN=matrixIndex, nRow=nRowBCond) + 1

## Row/column index adjustments RP1adj <- 1 RM1adj <- -1 CP1adj <- nRowBCond CM1adj <- -nRowBCond

Trigger <- ifelse(Condition, 8, 0) - (bCond[index + RM1adj + CM1adj] + bCond[index + CM1adj] + bCond[index + RP1adj + CM1adj] + bCond[index + RM1adj] + bCond[index + RP1adj] + bCond[index + RM1adj + CP1adj] + bCond[index + CP1adj] + bCond[index + RP1adj + CP1adj]) = 0

cumsum of Values with reset at Condition

Does what the title says it does

  • Input variables
    • Values - vector
    • Condition - logical vector
  • Output Variable
    • Resp - vector

edgeNum <- cumsum(Condition) + 1 csValues <- cumsum(Values)

resetVals <- c(0, csValues[Condition] - 1) Resp <- csValues - resetVals[edgeNum]

Example set.seed(1234) Values <- rep(TRUE, 50) Condition <- sample(c(TRUE,FALSE), 50, replace=TRUE, prob=c(1, 10))

edgeNum <- cumsum(Condition) + 1 csValues <- cumsum(Values)

resetVals <- c(0, csValues[Condition] - 1) Resp <- csValues - resetVals[edgeNum]

data.frame(Values,Condition,edgeNum,csValues,Resp)

Output
> set.seed(1234)                                                                
> Values <- rep(TRUE, 50)                                                       
> Condition <- sample(c(TRUE,FALSE), 50, replace=TRUE, prob=c(1, 10))
>
> edgeNum <- cumsum(Condition) + 1
> csValues <- cumsum(Values)
>
> resetVals <- c(0, csValues[Condition] - 1)
> Resp <- csValues - resetVals[edgeNum]
>
> data.frame(Values,Condition,edgeNum,csValues,Resp)
   Values Condition edgeNum csValues Resp
1    TRUE     FALSE       1        1    1
2    TRUE     FALSE       1        2    2
3    TRUE     FALSE       1        3    3
4    TRUE     FALSE       1        4    4
5    TRUE     FALSE       1        5    5
6    TRUE     FALSE       1        6    6
7    TRUE     FALSE       1        7    7
8    TRUE     FALSE       1        8    8
9    TRUE     FALSE       1        9    9
10   TRUE     FALSE       1       10   10
11   TRUE     FALSE       1       11   11
12   TRUE     FALSE       1       12   12
13   TRUE     FALSE       1       13   13
14   TRUE      TRUE       2       14    1
15   TRUE     FALSE       2       15    2
16   TRUE     FALSE       2       16    3
17   TRUE     FALSE       2       17    4
18   TRUE     FALSE       2       18    5
19   TRUE     FALSE       2       19    6
20   TRUE     FALSE       2       20    7
21   TRUE     FALSE       2       21    8
22   TRUE     FALSE       2       22    9
23   TRUE     FALSE       2       23   10
24   TRUE     FALSE       2       24   11
25   TRUE     FALSE       2       25   12
26   TRUE     FALSE       2       26   13
27   TRUE     FALSE       2       27   14
28   TRUE      TRUE       3       28    1
29   TRUE     FALSE       3       29    2
30   TRUE     FALSE       3       30    3
31   TRUE     FALSE       3       31    4
32   TRUE     FALSE       3       32    5
33   TRUE     FALSE       3       33    6
34   TRUE     FALSE       3       34    7
35   TRUE     FALSE       3       35    8
36   TRUE     FALSE       3       36    9
37   TRUE     FALSE       3       37   10
38   TRUE     FALSE       3       38   11
39   TRUE      TRUE       4       39    1
40   TRUE     FALSE       4       40    2
41   TRUE     FALSE       4       41    3
42   TRUE     FALSE       4       42    4
43   TRUE     FALSE       4       43    5
44   TRUE     FALSE       4       44    6
45   TRUE     FALSE       4       45    7
46   TRUE     FALSE       4       46    8
47   TRUE     FALSE       4       47    9
48   TRUE     FALSE       4       48   10
49   TRUE     FALSE       4       49   11
50   TRUE     FALSE       4       50   12

Trigger on each of N or more sequential TRUE Conditions

Does what the title says it does

  • Input variables
    • N - integer
    • Condition - logical vector
  • Output Variable
    • Trigger - logical vector

edge <- diff(c(0, Condition)) == 1L

edgeIndex <- cumsum(ifelse(Condition, edge, FALSE)) waveDuration <- table(edgeIndex)

waveLength <- waveDuration[edgeIndex]

Trigger <- waveLength >= N

Fast long to wide reshaping

provides reshape functionality without using loops

  • Input variables
    • Id - sorted vector of record ids
    • Value - vector sorted by Id of repeated values
  • Output variables
    • ResultVals - matrix of values with single record per id.
    • ResultIds - vector of ids

condition <- duplicated(Id) csValues <- seq_along(Id)

repCount <- csValues - csValues[condition][as.numeric(Id)]

ResultIds <- unique(Id) numRow <- length(ResultIds) numCol <- max(repCount) + 1L

ResultVals <- matrix(Value[0], nrow=numRow, ncol=numCol, dimnames=list(c(ResultIds), paste('Value', seq_len(numCol), sep='')))

ResultVals[repCount*numRow + as.numeric(Id)] <- Value

Example set.seed(1234L)

Id <- factor(sample(LETTERS, 50, replace=TRUE)) Value <- sample(letters, 50, replace=TRUE)

Value <- Value[order(Id)] Id <- sort(Id)

condition <- duplicated(Id) csValues <- seq_along(Id)

repCount <- csValues - csValues[condition][as.numeric(Id)]

ResultIds <- unique(Id) numRow <- length(ResultIds) numCol <- max(repCount) + 1L

ResultVals <- matrix(Value[0], nrow=numRow, ncol=numCol, dimnames=list(c(ResultIds), paste('Value', seq_len(numCol), sep='')))

ResultVals[repCount*numRow + as.numeric(Id)] <- Value

ResultVals

Unlabelling objects

Original source: http://scs.math.yorku.ca/index.php/R/Importing_data_from_SPSS

unlabel <- function(x,...) UseMethod('unlabel')

unlabel.default <- function(x,...) { if (inherits(x, 'labelled')) return(x) cl <- class(x) cl <- cl[cl = 'labelled'] attr(x, 'label') <- NULL if (length(cl) == 0) return(unclass(x)) class(x) <- cl x }

unlabel.data.frame <- function(x,...) { for (i in seq_along(x)) xi <- unlabel(xi) x }

This topic: Main > RCookBook
Topic revision: 22 Jun 2012, JeremyStephens
 
This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback