Coding factors with numerical levels instead of character ones

Theresa A Scott, M.S.
Biostatistician II, Department of Biostatistics
Vanderbilt University School of Medicine

Problem:

Often times, the categorical variables in our read in data set have character denoted levels (i.e. "No" and "Yes"). Sometimes, we wish to have these same caterogical variables have numerically denoted levels (i.e. 0 and 1).

Data specifics:

  • The data file you have read in contains categorical variables whose levels are denoted with character values, not numerical ones.

"Solution":

NOTE: I use the term "Solution" loosely; in R, there is never just one solution. This is just the solution I have found.

NOTE: The code distinguished with a ">" is what you would type at the R command line. The R output, if any, and any of my comments have been commented out using a "#".

First let's read in a dummy data file to illustrate the problem ("file.txt" is attached at the bottom of this page):

# Read in data file with column assigned No/Yes
> x<-read.table("file.txt", header=T)
> x
#    id weight smoker
# 1   1    120     No
# 2   2    125    Yes
# 3   3    130    Yes
# 4   4    135     No
# 5   5    140     No
# 6   6    145    Yes
# 7   7    150    Yes
# 8   8    155     No
# 9   9    160     No
# 10 10    165    Yes
# 11 11    170    Yes
# 12 12    175     No
# 13 13    180     No
# 14 14    185    Yes
# 15 15    190    Yes

If we look at the class of the smoker column, we see that it is a "factor".
> class(x$smoker)
# [1] "factor"

When a vector of character strings is included as a column of a data frame, R by default turns the vector into a factor.

There are a few ways we can extract the codes 1, 2, ... from the categorical variable.

The easiest was to extract the codes 1, 2, ... is to use the as.numeric function:
> as.numeric(x$smoker)
#  [1] 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2

The unclass function will temporarily remove the effects of a class. So, if we "unclass" the smoker column we get the same output of the as.numeric function:
> unclass(x$smoker)
#  [1] 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
# attr(,"levels")
# [1] "No"  "Yes"

As seen, factors have an attribute levels which holds the level names.

We can manipulate this as.numeric or unclass output in order to extract the numeric codes we want, such as 0, 1, ... instead of 1, 2, ...:
> as.numeric(x$smoker) - 1
#  [1] 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
> unclass(x$smoker) - 1
#  [1] 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
# attr(,"levels")
# [1] "No"  "Yes"

So, if we wanted to, we could define a new column in our dataframe which contained the desired extracted numerical codes of our factor variable:

# Define a new column which is assigned 0/1
> x$smoker2<-as.numeric(x$smoker)-1
> x
#    id weight smoker smoker2
# 1   1    120     No       0
# 2   2    125    Yes       1
# 3   3    130    Yes       1
# 4   4    135     No       0
# 5   5    140     No       0
# 6   6    145    Yes       1
# 7   7    150    Yes       1
# 8   8    155     No       0
# 9   9    160     No       0
# 10 10    165    Yes       1
# 11 11    170    Yes       1
# 12 12    175     No       0
# 13 13    180     No       0
# 14 14    185    Yes       1
# 15 15    190    Yes       1

For more information on factors and their potential surprising characteristics, see the following books:
  • Data Analysis and Graphics Using R by John Maindonald and John Braun
    • Sections 1.4.5 Factors and 12.4 Factors - Additional Comments

Acknowledgements:

I would like to thank Richard Urbano for posing this problem.
Topic attachments
I Attachment Action Size Date Who Comment
file.txttxt file.txt manage 0.2 K 12 Apr 2005 - 13:09 TheresaScott Dummy data file
Edit | Attach | Print version | History: r2 < r1 | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r1 - 12 Apr 2005, TheresaScott
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback