BaseballDataManip < Main < Vanderbilt Biostatistics Wiki

You are here: Vanderbilt Biostatistics Wiki>Main Web>Seminars>RClinic>BaseballDataManip (27 Nov 2006, TheresaScott) (raw view)EditAttach

---+ Cole Beck's Baseball data set problem

The goal of this problem was to 'reshape' the original data set (=bos_pit.csv=) to the desired data set.
   * [[%ATTACHURL%/bos_pit.csv][bos_pit.csv]]

The following is a possible solution.
<highlight>
x <- read.table("bos_pit.csv", header = TRUE, sep = ",", 
   as.is = TRUE, row.names = 1)
# Each row represents a game between BOSTON and 
#    some other team

# (1) Add some columns to x that will be needed when
#    we "reshape" x from wide to long
x <- upData(x,
   # Whether or not BOSTON won the game
   won = ifelse(v_score > h_score, v_team, h_team),
   # Boston's starting pitcher (regardless of whether BOSTON was
   #      the home of visiting team)
   bstp = ifelse(v_team == "BOS", v_sp_name, 
      ifelse(h_team == "BOS", h_sp_name, NA)),
   # Boston's winning pitcher (regardless of whether BOSTON was
   #      the home of visiting team)
   bwp = ifelse(won == "BOS", winning_name, NA),
   # Boston's saving pitcher (regardless of whether BOSTON was
   #      the home of visiting team)
   bsvp = ifelse(won == "BOS" & saving_name != "(none)",
      saving_name, NA),
   # Whether the starting pitcher "won"
   bstpwon = ifelse(bstp == bwp, bstp, NA)
)

# (2) "Reshape" x from wide to long
longx <- with(x, data.frame(pitcher = c(bstp, bwp, bsvp, bstpwon)))
longx$outcome <- factor(c(rep("started", nrow(x)),
   rep("won", nrow(x)), rep("saved", nrow(x)),
   rep("won_as_stp", nrow(x))))

# (3) Remove any missing pitcher values
longx <- subset(longx, !is.na(pitcher))

# (4) Calculate the number of starts, saves, wins, and wins as
#    starting pitcher for each pitcher
newx <- with(longx, aggregate(x = outcome, 
   by = list(pitcher, outcome), FUN = length))

# (5) Reshape newx so each level of outcome is its own column
wide.newx <- reshape(newx, direction = "wide",
   v.names = "x", timevar = "Group.2", idvar = "Group.1")
# Rename the columns
names(wide.newx) <- Cs(pitcher, saves, starts, wins, wins_as_stp)

# (6) Make some changes wide.newx
wide.newx <- upData(wide.newx,
   # Replace all missing values with 0
   starts = ifelse(is.na(starts), 0, starts),
   wins_as_stp = ifelse(is.na(wins_as_stp), 0, wins_as_stp),
   wins = ifelse(is.na(wins), 0, wins),
   saves = ifelse(is.na(saves), 0, saves),
   # Add a "win_per" column = wins_as_stp/starts
   #   --> replace any win_per values of NaN with 0
   win_per = ifelse(wins_as_stp !=0 & starts != 0, wins_as_stp/starts,
      0))
# Change the order of the columns
wide.newx <- wide.newx[, Cs(pitcher, starts, wins_as_stp,
   win_per, wins, saves)]

# (7) Sort wide.newx by pitcher's last name but keep
#    pitcher column as "firstname lastname"
last_name_order <- order(with(wide.newx, 
   mapply(FUN = function(i) unlist(strsplit(i, " "))[2],
   as.character(pitcher))))
finalx <- wide.newx[last_name_order,]
</highlight>

Topic revision: r2 - 27 Nov 2006, TheresaScott

Main

Department Home Page

Biostatistics Graduate Program

Vanderbilt University Medical Center

Biostatistics Webs
- Archive
- Main
- Sandbox
- System

Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback