Coding factors with numerical levels instead of character ones
Problem:
Often times, the categorical variables in our read in data set have character denoted levels (i.e.
"No"
and
"Yes"
). Sometimes, we wish to have these same caterogical variables have numerically denoted levels (i.e.
0
and
1
).
Data specifics:
- The data file you have read in contains categorical variables whose levels are denoted with character values, not numerical ones.
"Solution":
First let's read in a dummy data file to illustrate the problem (
"file.txt"
is attached at the bottom of this page):
# Read in data file with column assigned No/Yes
x<-read.table("file.txt", header=T)
x
# id weight smoker
# 1 1 120 No
# 2 2 125 Yes
# 3 3 130 Yes
# 4 4 135 No
# 5 5 140 No
# 6 6 145 Yes
# 7 7 150 Yes
# 8 8 155 No
# 9 9 160 No
# 10 10 165 Yes
# 11 11 170 Yes
# 12 12 175 No
# 13 13 180 No
# 14 14 185 Yes
# 15 15 190 Yes
If we look at the
class of the
smoker
column, we see that it is a
"factor"
.
class(x$smoker)
# [1] "factor"
When a vector of character strings is included as a column of a data frame,
R
by default turns the vector into a factor.
There are a few ways we can extract the codes 1, 2, ... from the categorical variable.
The easiest was to extract the codes 1, 2, ... is to use the
as.numeric()
function:
as.numeric(x$smoker)
# [1] 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
The
unclass()
function will temporarily remove the effects of a class. So, if we "unclass" the
smoker
column we get the same output of the
as.numeric()
function:
unclass(x$smoker)
# [1] 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
# attr(,"levels")
# [1] "No" "Yes"
As seen, factors have an
attribute levels
which holds the level names.
We can manipulate this
as.numeric()
or
unclass()
output in order to extract the numeric codes we want, such as 0, 1, ... instead of 1, 2, ...:
as.numeric(x$smoker) - 1
# [1] 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
unclass(x$smoker) - 1
# [1] 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
# attr(,"levels")
# [1] "No" "Yes"
So, if we wanted to, we could define a new column in our dataframe which contained the desired extracted numerical codes of our factor variable:
# Define a new column which is assigned 0/1
x$smoker2<-as.numeric(x$smoker)-1
x
# id weight smoker smoker2
# 1 1 120 No 0
# 2 2 125 Yes 1
# 3 3 130 Yes 1
# 4 4 135 No 0
# 5 5 140 No 0
# 6 6 145 Yes 1
# 7 7 150 Yes 1
# 8 8 155 No 0
# 9 9 160 No 0
# 10 10 165 Yes 1
# 11 11 170 Yes 1
# 12 12 175 No 0
# 13 13 180 No 0
# 14 14 185 Yes 1
# 15 15 190 Yes 1
For more information on factors and their potential surprising characteristics, see the following books:
- Data Analysis and Graphics Using R by John Maindonald and John Braun
- Sections 1.4.5 Factors and 12.4 Factors - Additional Comments
Acknowledgements:
I would like to thank Richard Urbano for posing this problem.