Converting dates of the form 'mddyy', 'mddyyyy', 'mmddyy', or 'mmddyyyy' to the form 'YYYY-MM-DD'
Problem:
A data file contains multiple date columns, whose format is somewhat consistent within each column, but not consistent across the columns.
Data specifics:
- The month, day, and year elements of each date are not seperated by any character (i.e. they are in the form 01012001, not 01/01/2001 or 01-01-2001).
- The years are between 1980 and 2005.
- The month element does not have a leading 0 if the month is < 10.
- The data columns have one of four possible formats:
- 'mddyy' - 23180
- 'mmddyy' - 113180
- 'mddyyyy' - 2311980
- 'mmddyyyy' - 12311980
- Within the same column, there may be dates of the format 'mddyy' and 'mddyyyy' or of the format 'mmddyy' or 'mmddyyyy'.
"Solution":
First, let's load some needed add-on packages. As noted, you might have to first install the
chron
package from CRAN to then load it.
library(Hmisc)
library(chron)
For illustration sake, I created a dummy data file with five columns:
- A subject ID column
- A date column of form 'mddyy'
- A date column of form 'mmddyy'
- A date column of form 'mddyyyy'
- A date column of form 'mmddyyyy'
outFile <- file("tmp.txt", open="wt")
cat("subject,date5,date6,date7,date8", "\n", file=outFile)
cat("100001,12285,102280,5241980,11101982", "\n",
file=outFile)
cat("100002,22502,120503,6082001,12302004", "\n",
file=outFile)
close(outFile)
Let's read in the data file, and see what it looks like. It's important to note the class of the date column.
datta<-csv.get("tmp.txt")
datta
class(datta$date5) # "integer"
class(datta$date6) # "integer"
class(datta$date7) # "integer"
class(datta$date8) # "integer"
I now attempt to convert each date format, one by one.
The conversion process of any of the date formats follows the same basic steps:
- Convert the integer date to a character (i.e.
01012001
is now interpreted as "01012001"
, a character string).
- Insert any additional needed month or year elements to the date, such as a leading zero, or the century to the year (
"19"
or "20"
).
- Convert the character date to a recognized date-time format.
The next sections assume that all the dates in one column have all the same format.
'mddyy' date format:
# 1. Convert the integer date to a character
datta$date5<-as.character(datta$date5)
class(datta$date5) # [1] "character"
# 2. Insert a leading zero into the month element, and insert the century ("19" or "20")
# into the year element
datta$date5
# 3. Convert the character date to a recognized date-time format.
datta$date5 <- as.character(strptime(datta$date5, "%m%d%Y"))
datta$date5
datta$date5<-ifelse(
# IF 'mddyy' (number of characters = 5)
nchar(datta$date5)==5 &
# & two digit year element > 79
as.numeric(substr(datta$date5, start=rep(4,dim(datta)[1]),
stop=rep(5,dim(datta)[1]))) > 79,
# THEN insert a 0 before the one digit month element
paste("0",
substr(datta$date5, start=rep(1,dim(datta)[1]),
stop=rep(3,dim(datta)[1])),
# AND insert 19 before the two digit year element
"19",
substr(datta$date5, start=rep(4,dim(datta)[1]),
stop=rep(5,dim(datta)[1])), sep=""),
# ELSE insert a 0 before the one digit month element
paste("0",
substr(datta$date5, start=rep(1,dim(datta)[1]),
stop=rep(3,dim(datta)[1])),
"20",
# AND insert 20 before the two digit year element
substr(datta$date5, start=rep(4,dim(datta)[1]),
stop=rep(5,dim(datta)[1])),sep=""))
datta$date5
# [1] "01221985" "02252002"
datta$date5 <- as.Date(datta$date5)
datta$date5
# [1] "1985-01-22" "2002-02-25"
datta$date5 <- as.Date(datta$date5)
datta$date5
# [1] "1985-01-22" "2002-02-25"
class(datta$date5)
# [1] "Date"
I repeat the conversion process in a similar fashion for the other date formats.
'mmddyy' date format:
datta$date6<-as.character(datta$date6)
datta$date6<-ifelse(
# IF 'mmddyy' (6 characters)
nchar(datta$date6)==6 &
# & two digit year element > 79
as.numeric(substr(datta$date6, start=rep(5,dim(datta)[1]),
stop=rep(6,dim(datta)[1]))) > 79,
# THEN insert a 19 before the two digit year element
paste(substr(datta$date6, start=rep(1,dim(datta)[1]),
stop=rep(4,dim(datta)[1])),
"19",
substr(datta$date6, start=rep(5,dim(datta)[1]),
stop=rep(6,dim(datta)[1])), sep=""),
# ELSE insert a 20 before the two digit year element
paste(substr(datta$date6, start=rep(1,dim(datta)[1]),
stop=rep(4,dim(datta)[1])),
"20",
substr(datta$date6, start=rep(5,dim(datta)[1]),
stop=rep(6,dim(datta)[1])),sep=""))
datta$date6 = as.character(strptime(datta$date6, "%m%d%Y"))
datta$date6 = as.Date(datta$date6)
datta$date6
class(datta$date6)
'mddyyyy' date format:
datta$date7<-as.character(datta$date7)
datta$date7<- ifelse(
# IF 'mddyyyy' (7 characters)
nchar(datta$date7)==7,
# THEN insert a 0 before the one digit month element
paste("0",datta$date7,sep=""),
# ELSE leave as is
datta$date7)
datta$date7 = as.character(strptime(datta$date7, "%m%d%Y"))
datta$date7 = as.Date(datta$date7)
datta$date7
class(datta$date7)
'mmddyyyy' date format:
datta$date8<-as.character(datta$date8)
datta$date8 = as.character(strptime(datta$date8, "%m%d%Y"))
datta$date8 = as.Date(datta$date8)
datta$date8
class(datta$date8)
Now I wanted to try using the real data set associated with this problem. I also defined the conversion as a funcion that handles any of the date formats.
The date set I used to test this function (
r1.RData
) is a subset of the actual data set the problem arose from.
r1.RData
has N = 8325 records, while the actual data set has over 1 million records.
NOTE: I tried writing this function several different ways, but this is the way it seemed to work best.
# Call datecon function on x, where x is the form df$col,
# and df is just the name of the data frame
datecon<-function(x, df) {
# Define the numbers of rows of the dataframe
len<-dim(df)[1]
# Define x as a character vector in order to be able to manipulate
# its month, day, and year elements
x<-as.character(x)
# Convert date using the steps outlined above
x<-ifelse(
# IF 'mddyy' (number of characters = 5)
nchar(x)==5 &
# AND two digit year element > 79
as.numeric(substr(x,
start=rep(4, len), stop=rep(5, len))) > 79,
# THEN insert a 0 before the 1 digit month element
paste("0",
substr(x, start=rep(1, len),
stop=rep(3, len)),
# And insert 19 for the century before the two
# digit year element in order to create a 4
# digit year element
"19",
substr(x, start=rep(4, len),
stop=rep(5, len)),
sep=""),
# ELSE
# IF 'mddyy' (number of characters = 5)
ifelse(nchar(x)==5 &
#AND two digit year element <= 79
as.numeric(substr(x,
start=rep(4, len), stop=rep(5, len))) <= 79,
# THEN insert a 0 before the 1 digit month element
paste("0",
substr(x, start=rep(1, len),
stop=rep(3, len)),
# And insert 20 for the century before the two
# digit year element in order to create a 4
# digit year element
"20",
substr(x, start=rep(4, len),
stop=rep(5, len)),
sep=""),
# ELSE
# IF 'mmddyy' (number of characters = 6)
ifelse(nchar(x)==6 &
# AND two digit year element > 79
as.numeric(substr(x,
start=rep(5, len), stop=rep(6, len))) > 79,
# THEN insert 19 for the century before the two digit
# year element in order to create a 4 digit year element
paste(substr(x, start=rep(1, len),
stop=rep(4, len)),
"19",
substr(x, start=rep(5, len),
stop=rep(6, len)),
sep=""),
# ELSE
# IF 'mmddyy' (number of characters = 6)
ifelse(nchar(x)==6 &
# AND two digit year element <= 79
as.numeric(substr(x,
start=rep(5, len), stop=rep(6, len))) <= 79,
# THEN insert 19 for the century before the two digit
# year element in order to create a 4 digit year element
paste(substr(x, start=rep(1, len),
stop=rep(4, len)),
"20",
substr(x, start=rep(5, len),
stop=rep(6, len)),
sep=""),
# ELSE
# IF 'mddyyyy' (number of characters =7)
ifelse(nchar(x)==7,
# THEN insert a 0 before the 1 digit month element
paste("0", x, sep=""),
# ELSE --- 'mmddyyyy' (number of characters = 8)
# THEN no need to expand the date format, and leave as is
x)))))
x <- as.character(strptime(x, "%m%d%Y"))
x <- as.Date(x)
}
Let's actually try calling the
datecon()
function on one of the columns in
r1.RData
.
First we need to read in the
r1.RData
data set, and let's look at the original column (just the first 10 values).
load("r1.RData")
r1$FROM[1:10]
# [1] 7081999 7151996 7151996 11291999 10281999 11151999 4191999 1291999
# [9] 1231999 2221999
The
datecon()
function doesn't print out anything, so you simply use it to define a new column:
r1$FROM2<-datecon(r1$FROM, r1)
r1$FROM2[1:10]
# [1] "1999-07-08" "1996-07-15" "1996-07-15" "1999-11-29" "1999-10-28"
# [6] "1999-11-15" "1999-04-19" "1999-01-29" "1999-01-23" "1999-02-22"
For those of you wanting to, the
datecon()
function also works within the
Hmisc
package's
upData()
function in order to define new columns, or redefine old ones.
r1<-upData(r1, SDATE02=datecon(r1$SDATE01, r1),
FROM2=datecon(r1$FROM, r1))
All the previous R code is give in the attached
dateconversion.R
file, and the file containing the
r1.RData
data set is also attached below. To "read" the
r1.Data
file, use the
load()
function:
load("r1.Data")
Acknowledgements:
I would like to thank Richard Urbano for posing this problem, and Svetlana Eden for showing me the usefulness of the
chron
package.