You are here: Vanderbilt Biostatistics Wiki>Main Web>RCookBook (revision 9)EditAttach

# R Cook Book

This page is dedicated to the little R functions that we write to do various things. Please feel free to add to the list or make improvements to current functions. If you do add (or change) something, please put your code into a (existing or new) category and include a small description that explains what your code does (for extra points include an example or two as well). Thanks!

## String Manipulation

### Splitting into segments

Here's a function to split a string into segments of length `n`. The default value of `n` is 80, since most terminals have rows that are 80 characters long.

```strSegs <- function(x, n=80) {
retval <- c()
while(nchar(x) > n) {
segment <- substr(x, 1, n)
x <- substr(x, n+1, nchar(x))

retval <- c(retval, segment)
}
retval <- c(retval, x)
}
```
Example:

```myString <- "This is a string that I want to split into segments."
segments <- strSegs(myString, 10)
cat(segments, sep="\n")
```

## Data Frame Manipulation

### Multisort

Here's a function that sorts each column of a data frame in decreasing or increasing order (the normal sort method only does either decreasing or increasing for all columns). I took the idea for this function from Cole Beck. You give the function your data frame and a series of name/value pairs, where each name is the name of a column, and each value is either `TRUE` or `FALSE` (`TRUE` for a decreasing sort and `FALSE` for an increasing sort.) The sort precedence goes from left to right. You do not have to specify all columns, but you must specify at least one.

```multisort <- function(x, ...) {
# given the data frame x, and a list of name/value pairs, sort
pairList <- list(...)

# make sure names from ... are valid
tmp <- intersect(names(pairList), names(x))
if (length(tmp) == 0 || tmp != names(pairList)) {
stop("Invalid column names.")
}

retval <- x
for (i in length(pairList):1) {
name <- names(pairList[i])
if (pairList[[i]] != TRUE && pairList[[i]] != FALSE) {
# make sure values are TRUE or FALSE only
stop(paste("Invalid value (", pairList[[i]], ") for ", name,
" (should be TRUE or FALSE).", sep=""))
}

retval <- retval[order(retval[,name], decreasing=pairList[[i]]),]
}
return(retval)
}
```

Example:

```x <- data.frame(uno=sample(c("a","b","c"), 10, replace=TRUE),
dos=sample(c("x","y","z"), 10, replace=TRUE),
tres=sample(1:10, 10))
multisort(x, uno=TRUE, dos=FALSE)
multisort(x, dos=FALSE, uno=TRUE, tres=FALSE)
```

### order.data.frame

The multisort function is inefficient because it has to rearrange the data several times. This function produces the order necessary to sort a data.frame or matrix based on a set of columns, how to handle NAs for each, and whether the column(s) should be decreasing or increasing. The sort precedence goes from left to right.

```order.data.frame <- function(x, ..., na.last=TRUE, decreasing=FALSE) {
len <- ncol(x)
len2 <- length(na.last)
len3 <- length(decreasing)
if (len < len2)
na.last <- na.last[1 : len]
if (len > len2)
na.last[(len2 + 1) : len] <- TRUE
if (len < len3)
decreasing <- decreasing[1 : len]
if (len > len3)
decreasing[(len3 + 1) : len] <- FALSE
ox <- 1:nrow(x)
for (i in len:1) {
ox <- order(x[,i], order(ox) * (1 - 2 * decreasing[i]),
na.last=na.last[i], decreasing=decreasing[i])
}
return(ox)
}
```

Example:

```x <- data.frame(uno=sample(c("a","b","c"), 10, replace=TRUE),
dos=sample(c("x","y","z"), 10, replace=TRUE),
tres=sample(1:10, 10))
x[order.data.frame(x[c("uno", "dos")],
decreasing=c(TRUE, FALSE)), ]
x[order.data.frame(x[c("dos", "uno", "tres")],
decreasing=c(FALSE, TRUE, FALSE)), ]
```

### Printing (small) data.frames

I wanted print.data.frame to look more like the results of a query in MySql. So I overwrote it with this:

```setMethod("print", signature(x="data.frame"),
function(x, ...) {
size=c()
#         for each column
for(j in 1:dim(x)[2]) {
len=nchar(names(x)[j])
for(i in 1:dim(x)[1]) {
if(nchar(as.character(x[i,j])) > len) {
len<-nchar(as.character(x[i,j]))
}
}
size<-append(size, len+2)
}
#         size is the number of '-' characters to display
newrow=""
for(s in size) {
newrow<-paste(c(newrow,"+",rep("-",s)),collapse="")
}
newrow<-paste(c(newrow,"+\n"),collapse="")
#         newrow is an empty row to print out
for(j in 1:dim(x)[2]) {
diff=size[j]-nchar(names(x)[j])-1
}
cat(newrow)
cat(newrow)
for(i in 1:dim(x)[1]) {
row=""
for(j in 1:dim(x)[2]) {
diff=size[j]-nchar(as.character(x[i,j]))-1
row<-paste(c(row,"| ",as.character(x[i,j]),rep(" ",diff)),collapse="")
}
row<-paste(c(row,"|\n"),collapse="")
cat(row)
}
cat(newrow)
}
)
```
This may not look very good for a data.frame that has many columns. (someone want to extend that functionality?)

Here's an example:

```data.frame(Car=c(16, 22, 07, 31, 2, 5, 8, 99, 7, 24, 66, 11, 29,
88, 48, 9, 17, 32, 43, 14, 6, 19, 26, 42, 01, 12, 45, 96, 10,
38, 21, 41, 20, 40, 1, 25, 55, 4, 18),
Driver=c("Greg Biffle", "Dave Blaney", "Clint Bowyer",
"Jeff Burton", "Kurt Busch", "Kyle Busch", "Dale Earnhardt",
"Carl Edwards", "Robby Gordon", "Jeff Gordon", "Jeff Green",
"Denny Hamlin", "Kevin Harvick", "Dale Jarrett",
"Jimmie Johnson", "Kasey Kahne", "Matt Kenseth",
"Travis Kvapil", "Bobby Labonte", "Sterling Marlin",
"Mark Martin", "Jeremy Mayfield", "Jamie McMurray",
"Casey Mears", "Joe Nemechek", "Ryan Newman", "Kyle Petty",
"Reed Sorenson", "Tony Stewart", "David Stremme",
"Martin Truex", "Brian Vickers", "Michael Waltrip",
"Scott Wimmer", "J.J. Yeley"),
Crew_Chief=c("Doug Richert", "Kevin Hamlin", "Gil Martin",
"Scott Miller", "Roy McCauley", "Alan Gustafson",
"Tony Eury Jr.", "Wally Brown", "Greg Erwin", "Steve Letarte",
"Bootie Barker", "Mike Ford", "Todd Berrier", "Slugger Labbe",
"Chad Knaus", "Kenny Francis", "Robbie Reiser", "Gary Putnam",
"Todd Parrott", "Doug Randolph", "Pat Tryson", "Chris Andrews",
"Bob Osborne", "Donnie Wingo", "Ryan Pemberton", "Matt Borland",
"Paul Andrews", "Philippe Lopez", "Rodney Childers",
"Tommy Baldwin Jr.", "David Hyder", "Jimmy Elledge",
"Greg Zipadelli", "Steven Lane", "Kevin Manion", "Lance McGrew",
"Joe Shear Jr.", "Chris Carrier", "Steve Addington"))
```

And here's the output:
```+-----+-----------------+-------------------+
| Car | Driver          | Crew_Chief        |
+-----+-----------------+-------------------+
| 16  | Greg Biffle     | Doug Richert      |
| 22  | Dave Blaney     | Kevin Hamlin      |
| 7   | Clint Bowyer    | Gil Martin        |
| 31  | Jeff Burton     | Scott Miller      |
| 2   | Kurt Busch      | Roy McCauley      |
| 5   | Kyle Busch      | Alan Gustafson    |
| 8   | Dale Earnhardt  | Tony Eury Jr.     |
| 99  | Carl Edwards    | Wally Brown       |
| 7   | Robby Gordon    | Greg Erwin        |
| 24  | Jeff Gordon     | Steve Letarte     |
| 66  | Jeff Green      | Bootie Barker     |
| 11  | Denny Hamlin    | Mike Ford         |
| 29  | Kevin Harvick   | Todd Berrier      |
| 88  | Dale Jarrett    | Slugger Labbe     |
| 48  | Jimmie Johnson  | Chad Knaus        |
| 9   | Kasey Kahne     | Kenny Francis     |
| 17  | Matt Kenseth    | Robbie Reiser     |
| 32  | Travis Kvapil   | Gary Putnam       |
| 43  | Bobby Labonte   | Todd Parrott      |
| 14  | Sterling Marlin | Doug Randolph     |
| 6   | Mark Martin     | Pat Tryson        |
| 19  | Jeremy Mayfield | Chris Andrews     |
| 26  | Jamie McMurray  | Bob Osborne       |
| 42  | Casey Mears     | Donnie Wingo      |
| 1   | Joe Nemechek    | Ryan Pemberton    |
| 12  | Ryan Newman     | Matt Borland      |
| 45  | Kyle Petty      | Paul Andrews      |
| 96  | Tony Raines     | Philippe Lopez    |
| 10  | Scott Riggs     | Rodney Childers   |
| 38  | Elliott Sadler  | Tommy Baldwin Jr. |
| 21  | Ken Schrader    | David Hyder       |
| 41  | Reed Sorenson   | Jimmy Elledge     |
| 20  | Tony Stewart    | Greg Zipadelli    |
| 40  | David Stremme   | Steven Lane       |
| 1   | Martin Truex    | Kevin Manion      |
| 25  | Brian Vickers   | Lance McGrew      |
| 55  | Michael Waltrip | Joe Shear Jr.     |
| 4   | Scott Wimmer    | Chris Carrier     |
| 18  | J.J. Yeley      | Steve Addington   |
+-----+-----------------+-------------------+
```

## Formula Manipulation

### Arbittray Formula Element Replacement

• Input Variables * Formula - formula object * VarName - character vector, containing a variable name
• Output Variables

```VarName <- "a"
Formula <- y ~ .(AltVar)

NewFormula <- eval(substitute(eval(bquote(Formula, where=list(AltVar=as.name(VarName)))), env=list(Formula=Formula)))
```

## Misc vector operations

### Edge triggering

#### 2D Edge triggering

##### Trigger on rising edge of Condition
• Input variables
• Condition - logical vector
• Output Variable
• Trigger - logical vector

```Trigger <- diff(c(0, Condition)) > 0
```
##### Trigger on falling edge of Condition
• Input variables
• Condition - logical vector
• Output Variable
• Trigger - logical vector

```Trigger <- diff(c(0, Condition)) < 0
```
##### Trigger on any edge of Condition
• Input variables
• Condition - logical vector
• Output Variable
• Trigger - logical vector

```Trigger <- diff(c(0, Condition)) != 0
```

#### 3D Edge triggering

##### Trigger on any edge of Condtion
• Input variables
• Condition - logical matrix
• Output Variable
• Trigger - logical matrix

```index <- seq_along(Condition)
## Wrap in FALSES
bCond <- cbind(FALSE,rbind(FALSE, Condition, FALSE), FALSE)

## matrix dimentions
condDim <- dim(Condition)
nRowBCond <- nrow(bCond)

## function to convert matrix row col pos to vector index
matrixIndex <- function(x,y, nRow) y*nRow + x

index <- outer(seq_len(condDim[1]), seq_len(condDim[2]), FUN=matrixIndex, nRow=nRowBCond) + 1

Trigger <- ifelse(Condition, 8, 0) -
```

### cumsum of Values with reset at Condition

Does what the title says it does

• Input variables
• Values - vector
• Condition - logical vector
• Output Variable
• Resp - vector

```edgeNum <- cumsum(Condition) + 1
csValues <- cumsum(Values)

resetVals <- c(0, csValues[Condition] - 1)
Resp <- csValues - resetVals[edgeNum]
```

Example

```set.seed(1234)
Values <- rep(TRUE, 50)
Condition <- sample(c(TRUE,FALSE), 50, replace=TRUE, prob=c(1, 10))

edgeNum <- cumsum(Condition) + 1
csValues <- cumsum(Values)

resetVals <- c(0, csValues[Condition] - 1)
Resp <- csValues - resetVals[edgeNum]

data.frame(Values,Condition,edgeNum,csValues,Resp)
```

Output
```> set.seed(1234)
> Values <- rep(TRUE, 50)
> Condition <- sample(c(TRUE,FALSE), 50, replace=TRUE, prob=c(1, 10))
>
> edgeNum <- cumsum(Condition) + 1
> csValues <- cumsum(Values)
>
> resetVals <- c(0, csValues[Condition] - 1)
> Resp <- csValues - resetVals[edgeNum]
>
> data.frame(Values,Condition,edgeNum,csValues,Resp)
Values Condition edgeNum csValues Resp
1    TRUE     FALSE       1        1    1
2    TRUE     FALSE       1        2    2
3    TRUE     FALSE       1        3    3
4    TRUE     FALSE       1        4    4
5    TRUE     FALSE       1        5    5
6    TRUE     FALSE       1        6    6
7    TRUE     FALSE       1        7    7
8    TRUE     FALSE       1        8    8
9    TRUE     FALSE       1        9    9
10   TRUE     FALSE       1       10   10
11   TRUE     FALSE       1       11   11
12   TRUE     FALSE       1       12   12
13   TRUE     FALSE       1       13   13
14   TRUE      TRUE       2       14    1
15   TRUE     FALSE       2       15    2
16   TRUE     FALSE       2       16    3
17   TRUE     FALSE       2       17    4
18   TRUE     FALSE       2       18    5
19   TRUE     FALSE       2       19    6
20   TRUE     FALSE       2       20    7
21   TRUE     FALSE       2       21    8
22   TRUE     FALSE       2       22    9
23   TRUE     FALSE       2       23   10
24   TRUE     FALSE       2       24   11
25   TRUE     FALSE       2       25   12
26   TRUE     FALSE       2       26   13
27   TRUE     FALSE       2       27   14
28   TRUE      TRUE       3       28    1
29   TRUE     FALSE       3       29    2
30   TRUE     FALSE       3       30    3
31   TRUE     FALSE       3       31    4
32   TRUE     FALSE       3       32    5
33   TRUE     FALSE       3       33    6
34   TRUE     FALSE       3       34    7
35   TRUE     FALSE       3       35    8
36   TRUE     FALSE       3       36    9
37   TRUE     FALSE       3       37   10
38   TRUE     FALSE       3       38   11
39   TRUE      TRUE       4       39    1
40   TRUE     FALSE       4       40    2
41   TRUE     FALSE       4       41    3
42   TRUE     FALSE       4       42    4
43   TRUE     FALSE       4       43    5
44   TRUE     FALSE       4       44    6
45   TRUE     FALSE       4       45    7
46   TRUE     FALSE       4       46    8
47   TRUE     FALSE       4       47    9
48   TRUE     FALSE       4       48   10
49   TRUE     FALSE       4       49   11
50   TRUE     FALSE       4       50   12
```

### Trigger on each of N or more sequential TRUE Conditions

Does what the title says it does

• Input variables
• N - integer
• Condition - logical vector
• Output Variable
• Trigger - logical vector

```edge <- diff(c(0, Condition)) == 1L

edgeIndex <- cumsum(ifelse(Condition, edge, FALSE))
waveDuration <- table(edgeIndex)

waveLength <- waveDuration[edgeIndex]

Trigger <- waveLength >= N
```

### Fast long to wide reshaping

provides reshape functionality without using loops

• Input variables
• Id - sorted vector of record ids
• Value - vector sorted by Id of repeated values
• Output variables
• ResultVals - matrix of values with single record per id.
• ResultIds - vector of ids

```condition <- !duplicated(Id)
csValues <- seq_along(Id)

repCount <- csValues - csValues[condition][as.numeric(Id)]

ResultIds <- unique(Id)
numRow <- length(ResultIds)
numCol <- max(repCount) + 1L

ResultVals <- matrix(Value[0], nrow=numRow, ncol=numCol, dimnames=list(c(ResultIds), paste('Value', seq_len(numCol), sep='')))

ResultVals[repCount*numRow + as.numeric(Id)] <- Value
```

Example

```set.seed(1234L)

Id <- factor(sample(LETTERS, 50, replace=TRUE))
Value <- sample(letters, 50, replace=TRUE)

Value <- Value[order(Id)]
Id <- sort(Id)

condition <- !duplicated(Id)
csValues <- seq_along(Id)

repCount <- csValues - csValues[condition][as.numeric(Id)]

ResultIds <- unique(Id)
numRow <- length(ResultIds)
numCol <- max(repCount) + 1L

ResultVals <- matrix(Value[0], nrow=numRow, ncol=numCol, dimnames=list(c(ResultIds), paste('Value', seq_len(numCol), sep='')))

ResultVals[repCount*numRow + as.numeric(Id)] <- Value

ResultVals
```
Edit | Attach | Print version |  | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r9 - 07 May 2010, CharlesDupont

• Biostatistics Webs

Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback