Converting dates of the form 'mddyy' or 'mddyyyy' or 'mmddyy' or 'mmddyyyy' to the form 'YYYY-MM-DD'

Theresa A Scott, M.S.
Biostatistician II, Department of Biostatistics
Vanderbilt University School of Medicine

Problem:

A data file contains multiple date columns, whose format is somewhat consistent within each column, but not consistent across the columns.

Data specifics:

  • The month, day, and year elements of each date are not seperated by any character (i.e. they are in the form 01012001, not 01/01/2001 or 01-01-2001).
  • The years are between 1980 and 2005.
  • The month element does not have a leading 0 if the month is < 10.
  • The data columns have one of four possible formats:
    1. 'mddyy' - 23180
    2. 'mmddyy' - 113180
    3. 'mddyyyy' - 2311980
    4. 'mmddyyyy' - 12311980
  • Within the same column, there may be dates of the format 'mddyy' and 'mddyyyy' or of the format 'mmddyy' or 'mmddyyyy'.

"Solution":

NOTE: I use the term "Solution" loosely; in R, there is never just one solution. This is just the solution I have found.

NOTE: The code distinguished with a ">" is what you would type at the R command line. The R output, if any, and any of my comments have been commented out using a "#".

First, let's load some needed add-on packages. As noted, you might have to first install the chron package from CRAN to then load it.
> library(Hmisc)
# Note: must install chron from CRAN
> library(chron)

For illustration sake, I created a dummy data file with five columns:
  1. A subject ID column
  2. A date column of form 'mddyy'
  3. A date column of form 'mmddyy'
  4. A date column of form 'mddyyyy'
  5. A date column of form 'mmddyyyy'

> outFile <- file("tmp.txt", open="wt")
> cat("subject,date5,date6,date7,date8", "\n", file=outFile)
> cat("100001,12285,102280,5241980,11101982", "\n",
   file=outFile)
> cat("100002,22502,120503,6082001,12302004", "\n",
   file=outFile)
> close(outFile)

Let's read in the data file, and see what it looks like. It's important to note the class of the date column.

  
> datta<-csv.get("tmp.txt")
> datta
#   subject date5  date6   date7    date8
# 1  100001 12285 102280 5241980 11101982
# 2  100002 22502 120503 6082001 12302004
> class(datta$date5)
# [1] "integer"
> class(datta$date6)
# [1] "integer"
> class(datta$date7)
# [1] "integer"
> class(datta$date8)
# [1] "integer"

I now attempt to convert each date format, one by one.

The conversion process of any of the date formats follows the same basic steps:
  1. Convert the integer date to a character (i.e. 01012001 is now interpreted as "01012001", a character string).
  2. Insert any additional needed month or year elements to the date, such as a leading zero, or the century to the year ("19" or "20").
  3. Convert the character date to a recognized date-time format.

The next sections assume that all the dates in one column have all the same format.

'mddyy' date format:
# 1. Convert the integer date to a character
> datta$date5<-as.character(datta$date5)
> class(datta$date5)
# [1] "character"
# 2. Insert a leading zero into the month element, and insert the century ("19" or "20")
#   into the year element
> datta$date5
# 3. Convert the character date to a recognized date-time format.
> datta$date5 <- as.character(strptime(datta$date5, "%m%d%Y"))
> datta$date5
> datta$date5<-ifelse(
   # IF 'mddyy' (number of characters = 5)
   nchar(datta$date5)==5 &
   # & two digit year element > 79
   as.numeric(substr(datta$date5, start=rep(4,dim(datta)[1]),
      stop=rep(5,dim(datta)[1]))) > 79,
   # THEN insert a 0 before the one digit month element
   paste("0", 
         substr(datta$date5, start=rep(1,dim(datta)[1]),
            stop=rep(3,dim(datta)[1])), 
         # AND insert 19 before the two digit year element
         "19", 
         substr(datta$date5, start=rep(4,dim(datta)[1]),
            stop=rep(5,dim(datta)[1])), sep=""),
   # ELSE insert a 0 before the one digit month element
   paste("0", 
         substr(datta$date5, start=rep(1,dim(datta)[1]),
            stop=rep(3,dim(datta)[1])), 
         "20", 
         # AND insert 20 before the two digit year element
         substr(datta$date5, start=rep(4,dim(datta)[1]),
            stop=rep(5,dim(datta)[1])),sep=""))
> datta$date5
# [1] "01221985" "02252002"
> datta$date5 <- as.Date(datta$date5)
> datta$date5
# [1] "1985-01-22" "2002-02-25"
> datta$date5 <- as.Date(datta$date5)
> datta$date5
# [1] "1985-01-22" "2002-02-25"
> class(datta$date5)
# [1] "Date"

I repeat the conversion process in a similar fashion for the other date formats.

'mmddyy' date format:
datta$date6<-as.character(datta$date6)
datta$date6<-ifelse(
   # IF 'mmddyy' (6 characters)
   nchar(datta$date6)==6 &
   # & two digit year element > 79
   as.numeric(substr(datta$date6,  start=rep(5,dim(datta)[1]),
      stop=rep(6,dim(datta)[1]))) > 79,
   # THEN insert a 19 before the two digit year element
   paste(substr(datta$date6, start=rep(1,dim(datta)[1]),
            stop=rep(4,dim(datta)[1])), 
         "19", 
         substr(datta$date6, start=rep(5,dim(datta)[1]),
            stop=rep(6,dim(datta)[1])), sep=""),
   # ELSE insert a 20 before the two digit year element
   paste(substr(datta$date6, start=rep(1,dim(datta)[1]),
            stop=rep(4,dim(datta)[1])), 
         "20", 
         substr(datta$date6, start=rep(5,dim(datta)[1]),
            stop=rep(6,dim(datta)[1])),sep=""))
datta$date6 = as.character(strptime(datta$date6, "%m%d%Y"))
datta$date6 = as.Date(datta$date6)
datta$date6
class(datta$date6)

'mddyyyy' date format:
datta$date7<-as.character(datta$date7)
datta$date7<- ifelse(
   # IF 'mddyyyy' (7 characters)
   nchar(datta$date7)==7,
   # THEN insert a 0 before the one digit month element
   paste("0",datta$date7,sep=""), 
   # ELSE leave as is
   datta$date7)
datta$date7 = as.character(strptime(datta$date7, "%m%d%Y"))
datta$date7 = as.Date(datta$date7)
datta$date7
class(datta$date7)

'mmddyyyy' date format:
datta$date8<-as.character(datta$date8)
datta$date8 = as.character(strptime(datta$date8, "%m%d%Y"))
datta$date8 = as.Date(datta$date8)
datta$date8
class(datta$date8)

Now I wanted to try using the real data set associated with this problem. I also defined the conversion as a funcion that handles any of the date formats.

The date set I used to test this function (r1.RData) is a subset of the actual data set the problem arose from. r1.RData has N = 8325 records, while the actual data set has over 1 million records.

NOTE: I tried writing this function several different ways, but this is the way it seemed to work best.

# Call datecon function on x, where x is the form df$col, 
# and df is just the name of the data frame
> datecon<-function(x, df) {
   # Define the numbers of rows of the dataframe
   len<-dim(df)[1]
   # Define x as a character vector in order to be able to manipulate
   #   its month, day, and year elements
   x<-as.character(x)
   # Convert date using the steps outlined above   
   x<-ifelse(
      # IF 'mddyy' (number of characters = 5)
      nchar(x)==5 & 
      # AND two digit year element > 79
      as.numeric(substr(x, 
          start=rep(4, len), stop=rep(5, len))) > 79,
      # THEN insert a 0 before the 1 digit month element
      paste("0", 
            substr(x, start=rep(1, len), 
               stop=rep(3, len)), 
            # And insert 19 for the century before the two
            # digit year element in order to create a 4 
            # digit year element
            "19", 
            substr(x, start=rep(4, len), 
               stop=rep(5, len)), 
            sep=""),
      # ELSE
      # IF 'mddyy' (number of characters = 5)
      ifelse(nchar(x)==5 & 
      #AND two digit year element <= 79
      as.numeric(substr(x, 
          start=rep(4, len), stop=rep(5, len))) <= 79,
      # THEN insert a 0 before the 1 digit month element
      paste("0", 
            substr(x, start=rep(1, len), 
               stop=rep(3, len)), 
            # And insert 20 for the century before the two
            # digit year element in order to create a 4 
            # digit year element
            "20", 
            substr(x, start=rep(4, len), 
               stop=rep(5, len)), 
            sep=""),
      # ELSE
      # IF 'mmddyy' (number of characters = 6)
      ifelse(nchar(x)==6 & 
      # AND two digit year element > 79
      as.numeric(substr(x, 
          start=rep(5, len), stop=rep(6, len))) > 79,
      # THEN insert 19 for the century before the two digit
      # year element in order to create a 4  digit year element
      paste(substr(x, start=rep(1, len), 
               stop=rep(4, len)), 
            "19", 
            substr(x, start=rep(5, len), 
               stop=rep(6, len)), 
            sep=""),
      # ELSE
      # IF 'mmddyy' (number of characters = 6)
      ifelse(nchar(x)==6 & 
      # AND two digit year element <= 79
      as.numeric(substr(x, 
          start=rep(5, len), stop=rep(6, len))) <= 79,
      # THEN insert 19 for the century before the two digit
      # year element in order to create a 4  digit year element
      paste(substr(x, start=rep(1, len), 
               stop=rep(4, len)), 
            "20", 
            substr(x, start=rep(5, len), 
               stop=rep(6, len)), 
            sep=""),
      # ELSE
      # IF 'mddyyyy' (number of characters =7)
      ifelse(nchar(x)==7,
      # THEN insert a 0 before the 1 digit month element
      paste("0", x, sep=""),
      # ELSE --- 'mmddyyyy' (number of characters = 8)
      # THEN no need to expand the date format, and leave as is
      x)))))
   x <- as.character(strptime(x, "%m%d%Y"))
   x <- as.Date(x)
}

Let's actually try calling the datecon function on one of the columns in r1.RData.

First we need to read in the r1.RData data set, and let's look at the original column (just the first 10 values).

> load("r1.RData")
> r1$FROM[1:10]
# [1]  7081999  7151996  7151996 11291999 10281999 11151999  4191999  1291999
# [9]  1231999  2221999

The datecon function doesn't print out anything, so you simply use it to define a new column:
> r1$FROM2<-datecon(r1$FROM, r1)
> r1$FROM2[1:10]
# [1] "1999-07-08" "1996-07-15" "1996-07-15" "1999-11-29" "1999-10-28"
# [6] "1999-11-15" "1999-04-19" "1999-01-29" "1999-01-23" "1999-02-22"

For those of you wanting to, the datecon function also works within the Hmisc upData function in order to define new columns, or redefine old ones.

> r1<-upData(r1, SDATE02=datecon(r1$SDATE01, r1),
   FROM2=datecon(r1$FROM, r1))

All the previous R code is give in the attached dateconversion.R file, and the file containing the r1.RData data set is also attached below. To "read" the r1.Data file, use the load function:
> load("r1.Data")

Acknowledgements:

I would like to thank Richard Urbano for posing this problem, and Svetlana Eden for showing me the usefulness of the chron package.
Topic attachments
I Attachment Action Size Date Who Comment
dateconversion.RR dateconversion.R manage 6.8 K 12 Apr 2005 - 13:21 TheresaScott  
r1.RDataRData r1.RData manage 161.7 K 12 Apr 2005 - 13:21 TheresaScott  
Edit | Attach | Print version | History: r2 < r1 | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r1 - 12 Apr 2005, TheresaScott
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback