Coding factors with numerical levels instead of character ones

Problem:

Often times, the categorical variables in our read in data set have character denoted levels (i.e. "No" and "Yes"). Sometimes, we wish to have these same caterogical variables have numerically denoted levels (i.e. 0 and 1).

Data specifics:

  • The data file you have read in contains categorical variables whose levels are denoted with character values, not numerical ones.

"Solution":

First let's read in a dummy data file to illustrate the problem ("file.txt" is attached at the bottom of this page):

# Read in data file with column assigned No/Yes x<-read.table("file.txt", header=T) x # id weight smoker # 1 1 120 No # 2 2 125 Yes # 3 3 130 Yes # 4 4 135 No # 5 5 140 No # 6 6 145 Yes # 7 7 150 Yes # 8 8 155 No # 9 9 160 No # 10 10 165 Yes # 11 11 170 Yes # 12 12 175 No # 13 13 180 No # 14 14 185 Yes # 15 15 190 Yes

If we look at the class of the smoker column, we see that it is a "factor". class(x$smoker) # [1] "factor"

When a vector of character strings is included as a column of a data frame, R by default turns the vector into a factor.

There are a few ways we can extract the codes 1, 2, ... from the categorical variable.

The easiest was to extract the codes 1, 2, ... is to use the as.numeric() function: as.numeric(x$smoker) # [1] 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2

The unclass() function will temporarily remove the effects of a class. So, if we "unclass" the smoker column we get the same output of the as.numeric() function: unclass(x$smoker) # [1] 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 # attr(,"levels") # [1] "No" "Yes"

As seen, factors have an attribute levels which holds the level names.

We can manipulate this as.numeric() or unclass() output in order to extract the numeric codes we want, such as 0, 1, ... instead of 1, 2, ...: as.numeric(x$smoker) - 1 # [1] 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 unclass(x$smoker) - 1 # [1] 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 # attr(,"levels") # [1] "No" "Yes"

So, if we wanted to, we could define a new column in our dataframe which contained the desired extracted numerical codes of our factor variable: # Define a new column which is assigned 0/1 x$smoker2<-as.numeric(x$smoker)-1 x # id weight smoker smoker2 # 1 1 120 No 0 # 2 2 125 Yes 1 # 3 3 130 Yes 1 # 4 4 135 No 0 # 5 5 140 No 0 # 6 6 145 Yes 1 # 7 7 150 Yes 1 # 8 8 155 No 0 # 9 9 160 No 0 # 10 10 165 Yes 1 # 11 11 170 Yes 1 # 12 12 175 No 0 # 13 13 180 No 0 # 14 14 185 Yes 1 # 15 15 190 Yes 1

For more information on factors and their potential surprising characteristics, see the following books:
  • Data Analysis and Graphics Using R by John Maindonald and John Braun
    • Sections 1.4.5 Factors and 12.4 Factors - Additional Comments

Acknowledgements:

I would like to thank Richard Urbano for posing this problem.
Topic attachments
I Attachment Action Size DateSorted ascending Who Comment
file.txttxt file.txt manage 0.2 K 12 Apr 2005 - 13:09 TheresaScott Dummy data file
Topic revision: r2 - 15 Nov 2006, TheresaScott
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback