You are here:
Vanderbilt Biostatistics Wiki
>
Main Web
>
Seminars
>
RClinic
>
BaseballDataManip
(27 Nov 2006,
TheresaScott
)
(raw view)
E
dit
A
ttach
---+ Cole Beck's Baseball data set problem The goal of this problem was to 'reshape' the original data set (=bos_pit.csv=) to the desired data set. * [[%ATTACHURL%/bos_pit.csv][bos_pit.csv]] The following is a possible solution. <highlight> x <- read.table("bos_pit.csv", header = TRUE, sep = ",", as.is = TRUE, row.names = 1) # Each row represents a game between BOSTON and # some other team # (1) Add some columns to x that will be needed when # we "reshape" x from wide to long x <- upData(x, # Whether or not BOSTON won the game won = ifelse(v_score > h_score, v_team, h_team), # Boston's starting pitcher (regardless of whether BOSTON was # the home of visiting team) bstp = ifelse(v_team == "BOS", v_sp_name, ifelse(h_team == "BOS", h_sp_name, NA)), # Boston's winning pitcher (regardless of whether BOSTON was # the home of visiting team) bwp = ifelse(won == "BOS", winning_name, NA), # Boston's saving pitcher (regardless of whether BOSTON was # the home of visiting team) bsvp = ifelse(won == "BOS" & saving_name != "(none)", saving_name, NA), # Whether the starting pitcher "won" bstpwon = ifelse(bstp == bwp, bstp, NA) ) # (2) "Reshape" x from wide to long longx <- with(x, data.frame(pitcher = c(bstp, bwp, bsvp, bstpwon))) longx$outcome <- factor(c(rep("started", nrow(x)), rep("won", nrow(x)), rep("saved", nrow(x)), rep("won_as_stp", nrow(x)))) # (3) Remove any missing pitcher values longx <- subset(longx, !is.na(pitcher)) # (4) Calculate the number of starts, saves, wins, and wins as # starting pitcher for each pitcher newx <- with(longx, aggregate(x = outcome, by = list(pitcher, outcome), FUN = length)) # (5) Reshape newx so each level of outcome is its own column wide.newx <- reshape(newx, direction = "wide", v.names = "x", timevar = "Group.2", idvar = "Group.1") # Rename the columns names(wide.newx) <- Cs(pitcher, saves, starts, wins, wins_as_stp) # (6) Make some changes wide.newx wide.newx <- upData(wide.newx, # Replace all missing values with 0 starts = ifelse(is.na(starts), 0, starts), wins_as_stp = ifelse(is.na(wins_as_stp), 0, wins_as_stp), wins = ifelse(is.na(wins), 0, wins), saves = ifelse(is.na(saves), 0, saves), # Add a "win_per" column = wins_as_stp/starts # --> replace any win_per values of NaN with 0 win_per = ifelse(wins_as_stp !=0 & starts != 0, wins_as_stp/starts, 0)) # Change the order of the columns wide.newx <- wide.newx[, Cs(pitcher, starts, wins_as_stp, win_per, wins, saves)] # (7) Sort wide.newx by pitcher's last name but keep # pitcher column as "firstname lastname" last_name_order <- order(with(wide.newx, mapply(FUN = function(i) unlist(strsplit(i, " "))[2], as.character(pitcher)))) finalx <- wide.newx[last_name_order,] </highlight>
E
dit
|
A
ttach
|
P
rint version
|
H
istory
: r2
<
r1
|
B
acklinks
|
V
iew topic
|
Edit
w
iki text
|
M
ore topic actions
Topic revision: r2 - 27 Nov 2006,
TheresaScott
Main
Department Home Page
Biostatistics Graduate Program
Vanderbilt University Medical Center
Main Web
Main Web Home
Search
Recent Changes
Changes
Topic list
Biostatistics Webs
Archive
Main
Sandbox
System
Register
|
Log In
Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki?
Send feedback