You are here:
Vanderbilt Biostatistics Wiki
>
Main Web
>
Seminars
>
RClinic
>
OddDateManipulation
(15 Nov 2006,
TheresaScott
)
(raw view)
E
dit
A
ttach
---+ Converting dates of the form 'mddyy', 'mddyyyy', 'mmddyy', or 'mmddyyyy' to the form 'YYYY-MM-DD' ---++ Problem: A data file contains multiple date columns, whose format is somewhat consistent within each column, but not consistent across the columns. ---++ Data specifics: * The month, day, and year elements of each date are not seperated by any character (i.e. they are in the form 01012001, not 01/01/2001 or 01-01-2001). * The years are between 1980 and 2005. * The month element does not have a leading 0 if the month is < 10. * The data columns have one of four possible formats: 1 'mddyy' - 23180 2 'mmddyy' - 113180 3 'mddyyyy' - 2311980 4 'mmddyyyy' - 12311980 * Within the same column, there may be dates of the format 'mddyy' and 'mddyyyy' or of the format 'mmddyy' or 'mmddyyyy'. ---++ "Solution": First, let's load some needed add-on packages. As noted, you might have to first install the =chron= package from CRAN to then load it. <highlight> library(Hmisc) library(chron) </highlight> For illustration sake, I created a dummy data file with five columns: 1 A subject ID column 2 A date column of form 'mddyy' 3 A date column of form 'mmddyy' 4 A date column of form 'mddyyyy' 5 A date column of form 'mmddyyyy' <highlight> outFile <- file("tmp.txt", open="wt") cat("subject,date5,date6,date7,date8", "\n", file=outFile) cat("100001,12285,102280,5241980,11101982", "\n", file=outFile) cat("100002,22502,120503,6082001,12302004", "\n", file=outFile) close(outFile) </highlight> Let's read in the data file, and see what it looks like. It's important to note the class of the date column. <highlight> datta<-csv.get("tmp.txt") datta class(datta$date5) # "integer" class(datta$date6) # "integer" class(datta$date7) # "integer" class(datta$date8) # "integer" </highlight> I now attempt to convert each date format, one by one. The conversion process of any of the date formats follows the same basic steps: 1 Convert the integer date to a character (i.e. =01012001= is now interpreted as ="01012001"=, a character string). 2 Insert any additional needed month or year elements to the date, such as a leading zero, or the century to the year (="19"= or ="20"=). 3 Convert the character date to a recognized date-time format. The next sections assume that all the dates in one column have all the same format. __'mddyy' date format:__ <highlight> # 1. Convert the integer date to a character datta$date5<-as.character(datta$date5) class(datta$date5) # [1] "character" # 2. Insert a leading zero into the month element, and insert the century ("19" or "20") # into the year element datta$date5 # 3. Convert the character date to a recognized date-time format. datta$date5 <- as.character(strptime(datta$date5, "%m%d%Y")) datta$date5 datta$date5<-ifelse( # IF 'mddyy' (number of characters = 5) nchar(datta$date5)==5 & # & two digit year element > 79 as.numeric(substr(datta$date5, start=rep(4,dim(datta)[1]), stop=rep(5,dim(datta)[1]))) > 79, # THEN insert a 0 before the one digit month element paste("0", substr(datta$date5, start=rep(1,dim(datta)[1]), stop=rep(3,dim(datta)[1])), # AND insert 19 before the two digit year element "19", substr(datta$date5, start=rep(4,dim(datta)[1]), stop=rep(5,dim(datta)[1])), sep=""), # ELSE insert a 0 before the one digit month element paste("0", substr(datta$date5, start=rep(1,dim(datta)[1]), stop=rep(3,dim(datta)[1])), "20", # AND insert 20 before the two digit year element substr(datta$date5, start=rep(4,dim(datta)[1]), stop=rep(5,dim(datta)[1])),sep="")) datta$date5 # [1] "01221985" "02252002" datta$date5 <- as.Date(datta$date5) datta$date5 # [1] "1985-01-22" "2002-02-25" datta$date5 <- as.Date(datta$date5) datta$date5 # [1] "1985-01-22" "2002-02-25" class(datta$date5) # [1] "Date" </highlight> I repeat the conversion process in a similar fashion for the other date formats. __'mmddyy' date format:__ <highlight> datta$date6<-as.character(datta$date6) datta$date6<-ifelse( # IF 'mmddyy' (6 characters) nchar(datta$date6)==6 & # & two digit year element > 79 as.numeric(substr(datta$date6, start=rep(5,dim(datta)[1]), stop=rep(6,dim(datta)[1]))) > 79, # THEN insert a 19 before the two digit year element paste(substr(datta$date6, start=rep(1,dim(datta)[1]), stop=rep(4,dim(datta)[1])), "19", substr(datta$date6, start=rep(5,dim(datta)[1]), stop=rep(6,dim(datta)[1])), sep=""), # ELSE insert a 20 before the two digit year element paste(substr(datta$date6, start=rep(1,dim(datta)[1]), stop=rep(4,dim(datta)[1])), "20", substr(datta$date6, start=rep(5,dim(datta)[1]), stop=rep(6,dim(datta)[1])),sep="")) datta$date6 = as.character(strptime(datta$date6, "%m%d%Y")) datta$date6 = as.Date(datta$date6) datta$date6 class(datta$date6) </highlight> __'mddyyyy' date format:__ <highlight> datta$date7<-as.character(datta$date7) datta$date7<- ifelse( # IF 'mddyyyy' (7 characters) nchar(datta$date7)==7, # THEN insert a 0 before the one digit month element paste("0",datta$date7,sep=""), # ELSE leave as is datta$date7) datta$date7 = as.character(strptime(datta$date7, "%m%d%Y")) datta$date7 = as.Date(datta$date7) datta$date7 class(datta$date7) </highlight> __'mmddyyyy' date format:__ <highlight> datta$date8<-as.character(datta$date8) datta$date8 = as.character(strptime(datta$date8, "%m%d%Y")) datta$date8 = as.Date(datta$date8) datta$date8 class(datta$date8) </highlight> Now I wanted to try using the real data set associated with this problem. I also defined the conversion as a funcion that handles any of the date formats. The date set I used to test this function (=r1.RData=) is a subset of the actual data set the problem arose from. =r1.RData= has N = 8325 records, while the actual data set has over 1 million records. __NOTE:__ I tried writing this function several different ways, but this is the way it seemed to work best. <highlight> # Call datecon function on x, where x is the form df$col, # and df is just the name of the data frame datecon<-function(x, df) { # Define the numbers of rows of the dataframe len<-dim(df)[1] # Define x as a character vector in order to be able to manipulate # its month, day, and year elements x<-as.character(x) # Convert date using the steps outlined above x<-ifelse( # IF 'mddyy' (number of characters = 5) nchar(x)==5 & # AND two digit year element > 79 as.numeric(substr(x, start=rep(4, len), stop=rep(5, len))) > 79, # THEN insert a 0 before the 1 digit month element paste("0", substr(x, start=rep(1, len), stop=rep(3, len)), # And insert 19 for the century before the two # digit year element in order to create a 4 # digit year element "19", substr(x, start=rep(4, len), stop=rep(5, len)), sep=""), # ELSE # IF 'mddyy' (number of characters = 5) ifelse(nchar(x)==5 & #AND two digit year element <= 79 as.numeric(substr(x, start=rep(4, len), stop=rep(5, len))) <= 79, # THEN insert a 0 before the 1 digit month element paste("0", substr(x, start=rep(1, len), stop=rep(3, len)), # And insert 20 for the century before the two # digit year element in order to create a 4 # digit year element "20", substr(x, start=rep(4, len), stop=rep(5, len)), sep=""), # ELSE # IF 'mmddyy' (number of characters = 6) ifelse(nchar(x)==6 & # AND two digit year element > 79 as.numeric(substr(x, start=rep(5, len), stop=rep(6, len))) > 79, # THEN insert 19 for the century before the two digit # year element in order to create a 4 digit year element paste(substr(x, start=rep(1, len), stop=rep(4, len)), "19", substr(x, start=rep(5, len), stop=rep(6, len)), sep=""), # ELSE # IF 'mmddyy' (number of characters = 6) ifelse(nchar(x)==6 & # AND two digit year element <= 79 as.numeric(substr(x, start=rep(5, len), stop=rep(6, len))) <= 79, # THEN insert 19 for the century before the two digit # year element in order to create a 4 digit year element paste(substr(x, start=rep(1, len), stop=rep(4, len)), "20", substr(x, start=rep(5, len), stop=rep(6, len)), sep=""), # ELSE # IF 'mddyyyy' (number of characters =7) ifelse(nchar(x)==7, # THEN insert a 0 before the 1 digit month element paste("0", x, sep=""), # ELSE --- 'mmddyyyy' (number of characters = 8) # THEN no need to expand the date format, and leave as is x))))) x <- as.character(strptime(x, "%m%d%Y")) x <- as.Date(x) } </highlight> Let's actually try calling the =datecon()= function on one of the columns in =r1.RData=. First we need to read in the =r1.RData= data set, and let's look at the original column (just the first 10 values). <highlight> load("r1.RData") r1$FROM[1:10] # [1] 7081999 7151996 7151996 11291999 10281999 11151999 4191999 1291999 # [9] 1231999 2221999 </highlight> The =datecon()= function doesn't print out anything, so you simply use it to define a new column: <highlight> r1$FROM2<-datecon(r1$FROM, r1) r1$FROM2[1:10] # [1] "1999-07-08" "1996-07-15" "1996-07-15" "1999-11-29" "1999-10-28" # [6] "1999-11-15" "1999-04-19" "1999-01-29" "1999-01-23" "1999-02-22" </highlight> For those of you wanting to, the =datecon()= function also works within the =Hmisc= package's =upData()= function in order to define new columns, or redefine old ones. <highlight> r1<-upData(r1, SDATE02=datecon(r1$SDATE01, r1), FROM2=datecon(r1$FROM, r1)) </highlight> All the previous R code is give in the attached =dateconversion.R= file, and the file containing the =r1.RData= data set is also attached below. To "read" the =r1.Data= file, use the =load()= function: <highlight> load("r1.Data") </highlight> ---++ Acknowledgements: I would like to thank Richard Urbano for posing this problem, and Svetlana Eden for showing me the usefulness of the =chron= package.
Attachments
2
Attachments
2
Topic attachments
I
Attachment
Action
Size
Date
Who
Comment
R
dateconversion.R
manage
6.8 K
12 Apr 2005 - 13:21
TheresaScott
RData
r1.RData
manage
161.7 K
12 Apr 2005 - 13:21
TheresaScott
E
dit
|
A
ttach
|
P
rint version
|
H
istory
: r2
<
r1
|
B
acklinks
|
V
iew topic
|
Edit
w
iki text
|
M
ore topic actions
Topic revision: r2 - 15 Nov 2006,
TheresaScott
Main
Department Home Page
Biostatistics Graduate Program
Vanderbilt University Medical Center
Main Web
Main Web Home
Search
Recent Changes
Changes
Topic list
Biostatistics Webs
Archive
Main
Sandbox
System
Register
|
Log In
Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki?
Send feedback