Converting dates of the form 'mddyy', 'mddyyyy', 'mmddyy', or 'mmddyyyy' to the form 'YYYY-MM-DD'

Problem:

A data file contains multiple date columns, whose format is somewhat consistent within each column, but not consistent across the columns.

Data specifics:

  • The month, day, and year elements of each date are not seperated by any character (i.e. they are in the form 01012001, not 01/01/2001 or 01-01-2001).
  • The years are between 1980 and 2005.
  • The month element does not have a leading 0 if the month is < 10.
  • The data columns have one of four possible formats:
    1. 'mddyy' - 23180
    2. 'mmddyy' - 113180
    3. 'mddyyyy' - 2311980
    4. 'mmddyyyy' - 12311980
  • Within the same column, there may be dates of the format 'mddyy' and 'mddyyyy' or of the format 'mmddyy' or 'mmddyyyy'.

"Solution":

First, let's load some needed add-on packages. As noted, you might have to first install the chron package from CRAN to then load it. library(Hmisc) library(chron)

For illustration sake, I created a dummy data file with five columns:
  1. A subject ID column
  2. A date column of form 'mddyy'
  3. A date column of form 'mmddyy'
  4. A date column of form 'mddyyyy'
  5. A date column of form 'mmddyyyy'

outFile <- file("tmp.txt", open="wt") cat("subject,date5,date6,date7,date8", "\n", file=outFile) cat("100001,12285,102280,5241980,11101982", "\n", file=outFile) cat("100002,22502,120503,6082001,12302004", "\n", file=outFile) close(outFile)

Let's read in the data file, and see what it looks like. It's important to note the class of the date column.

datta<-csv.get("tmp.txt") datta class(datta$date5) # "integer" class(datta$date6) # "integer" class(datta$date7) # "integer" class(datta$date8) # "integer"

I now attempt to convert each date format, one by one.

The conversion process of any of the date formats follows the same basic steps:
  1. Convert the integer date to a character (i.e. 01012001 is now interpreted as "01012001", a character string).
  2. Insert any additional needed month or year elements to the date, such as a leading zero, or the century to the year ("19" or "20").
  3. Convert the character date to a recognized date-time format.

The next sections assume that all the dates in one column have all the same format.

'mddyy' date format: # 1. Convert the integer date to a character datta$date5<-as.character(datta$date5) class(datta$date5) # [1] "character" # 2. Insert a leading zero into the month element, and insert the century ("19" or "20") # into the year element datta$date5 # 3. Convert the character date to a recognized date-time format. datta$date5 <- as.character(strptime(datta$date5, "%m%d%Y")) datta$date5 datta$date5<-ifelse( # IF 'mddyy' (number of characters = 5) nchar(datta$date5)==5 & # & two digit year element > 79 as.numeric(substr(datta$date5, start=rep(4,dim(datta)[1]), stop=rep(5,dim(datta)[1]))) > 79, # THEN insert a 0 before the one digit month element paste("0", substr(datta$date5, start=rep(1,dim(datta)[1]), stop=rep(3,dim(datta)[1])), # AND insert 19 before the two digit year element "19", substr(datta$date5, start=rep(4,dim(datta)[1]), stop=rep(5,dim(datta)[1])), sep=""), # ELSE insert a 0 before the one digit month element paste("0", substr(datta$date5, start=rep(1,dim(datta)[1]), stop=rep(3,dim(datta)[1])), "20", # AND insert 20 before the two digit year element substr(datta$date5, start=rep(4,dim(datta)[1]), stop=rep(5,dim(datta)[1])),sep="")) datta$date5 # [1] "01221985" "02252002" datta$date5 <- as.Date(datta$date5) datta$date5 # [1] "1985-01-22" "2002-02-25" datta$date5 <- as.Date(datta$date5) datta$date5 # [1] "1985-01-22" "2002-02-25" class(datta$date5) # [1] "Date"

I repeat the conversion process in a similar fashion for the other date formats.

'mmddyy' date format: datta$date6<-as.character(datta$date6) datta$date6<-ifelse( # IF 'mmddyy' (6 characters) nchar(datta$date6)==6 & # & two digit year element > 79 as.numeric(substr(datta$date6, start=rep(5,dim(datta)[1]), stop=rep(6,dim(datta)[1]))) > 79, # THEN insert a 19 before the two digit year element paste(substr(datta$date6, start=rep(1,dim(datta)[1]), stop=rep(4,dim(datta)[1])), "19", substr(datta$date6, start=rep(5,dim(datta)[1]), stop=rep(6,dim(datta)[1])), sep=""), # ELSE insert a 20 before the two digit year element paste(substr(datta$date6, start=rep(1,dim(datta)[1]), stop=rep(4,dim(datta)[1])), "20", substr(datta$date6, start=rep(5,dim(datta)[1]), stop=rep(6,dim(datta)[1])),sep="")) datta$date6 = as.character(strptime(datta$date6, "%m%d%Y")) datta$date6 = as.Date(datta$date6) datta$date6 class(datta$date6)

'mddyyyy' date format: datta$date7<-as.character(datta$date7) datta$date7<- ifelse( # IF 'mddyyyy' (7 characters) nchar(datta$date7)==7, # THEN insert a 0 before the one digit month element paste("0",datta$date7,sep=""), # ELSE leave as is datta$date7) datta$date7 = as.character(strptime(datta$date7, "%m%d%Y")) datta$date7 = as.Date(datta$date7) datta$date7 class(datta$date7)

'mmddyyyy' date format: datta$date8<-as.character(datta$date8) datta$date8 = as.character(strptime(datta$date8, "%m%d%Y")) datta$date8 = as.Date(datta$date8) datta$date8 class(datta$date8)

Now I wanted to try using the real data set associated with this problem. I also defined the conversion as a funcion that handles any of the date formats.

The date set I used to test this function (r1.RData) is a subset of the actual data set the problem arose from. r1.RData has N = 8325 records, while the actual data set has over 1 million records.

NOTE: I tried writing this function several different ways, but this is the way it seemed to work best.

# Call datecon function on x, where x is the form df$col, # and df is just the name of the data frame datecon<-function(x, df) { # Define the numbers of rows of the dataframe len<-dim(df)[1] # Define x as a character vector in order to be able to manipulate # its month, day, and year elements x<-as.character(x) # Convert date using the steps outlined above x<-ifelse( # IF 'mddyy' (number of characters = 5) nchar(x)==5 & # AND two digit year element > 79 as.numeric(substr(x, start=rep(4, len), stop=rep(5, len))) > 79, # THEN insert a 0 before the 1 digit month element paste("0", substr(x, start=rep(1, len), stop=rep(3, len)), # And insert 19 for the century before the two # digit year element in order to create a 4 # digit year element "19", substr(x, start=rep(4, len), stop=rep(5, len)), sep=""), # ELSE # IF 'mddyy' (number of characters = 5) ifelse(nchar(x)==5 & #AND two digit year element <= 79 as.numeric(substr(x, start=rep(4, len), stop=rep(5, len))) <= 79, # THEN insert a 0 before the 1 digit month element paste("0", substr(x, start=rep(1, len), stop=rep(3, len)), # And insert 20 for the century before the two # digit year element in order to create a 4 # digit year element "20", substr(x, start=rep(4, len), stop=rep(5, len)), sep=""), # ELSE # IF 'mmddyy' (number of characters = 6) ifelse(nchar(x)==6 & # AND two digit year element > 79 as.numeric(substr(x, start=rep(5, len), stop=rep(6, len))) > 79, # THEN insert 19 for the century before the two digit # year element in order to create a 4 digit year element paste(substr(x, start=rep(1, len), stop=rep(4, len)), "19", substr(x, start=rep(5, len), stop=rep(6, len)), sep=""), # ELSE # IF 'mmddyy' (number of characters = 6) ifelse(nchar(x)==6 & # AND two digit year element <= 79 as.numeric(substr(x, start=rep(5, len), stop=rep(6, len))) <= 79, # THEN insert 19 for the century before the two digit # year element in order to create a 4 digit year element paste(substr(x, start=rep(1, len), stop=rep(4, len)), "20", substr(x, start=rep(5, len), stop=rep(6, len)), sep=""), # ELSE # IF 'mddyyyy' (number of characters =7) ifelse(nchar(x)==7, # THEN insert a 0 before the 1 digit month element paste("0", x, sep=""), # ELSE --- 'mmddyyyy' (number of characters = 8) # THEN no need to expand the date format, and leave as is x))))) x <- as.character(strptime(x, "%m%d%Y")) x <- as.Date(x) }

Let's actually try calling the datecon() function on one of the columns in r1.RData.

First we need to read in the r1.RData data set, and let's look at the original column (just the first 10 values).

load("r1.RData") r1$FROM[1:10] # [1] 7081999 7151996 7151996 11291999 10281999 11151999 4191999 1291999 # [9] 1231999 2221999

The datecon() function doesn't print out anything, so you simply use it to define a new column: r1$FROM2<-datecon(r1$FROM, r1) r1$FROM2[1:10] # [1] "1999-07-08" "1996-07-15" "1996-07-15" "1999-11-29" "1999-10-28" # [6] "1999-11-15" "1999-04-19" "1999-01-29" "1999-01-23" "1999-02-22"

For those of you wanting to, the datecon() function also works within the Hmisc package's upData() function in order to define new columns, or redefine old ones.

r1<-upData(r1, SDATE02=datecon(r1$SDATE01, r1), FROM2=datecon(r1$FROM, r1))

All the previous R code is give in the attached dateconversion.R file, and the file containing the r1.RData data set is also attached below. To "read" the r1.Data file, use the load() function: load("r1.Data")

Acknowledgements:

I would like to thank Richard Urbano for posing this problem, and Svetlana Eden for showing me the usefulness of the chron package.

Topic attachments
I Attachment Action Size Date Who Comment
dateconversion.RR dateconversion.R manage 6.8 K 12 Apr 2005 - 13:21 TheresaScott  
r1.RDataRData r1.RData manage 161.7 K 12 Apr 2005 - 13:21 TheresaScott  
This topic: Main > WebHome > Seminars > RClinic > OddDateManipulation
Topic revision: 15 Nov 2006, TheresaScott
 
This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback