You are here:
Vanderbilt Biostatistics Wiki
>
Main Web
>
Seminars
>
RClinic
>
NumericFactorLevels
(15 Nov 2006,
TheresaScott
)
(raw view)
E
dit
A
ttach
---+ Coding factors with numerical levels instead of character ones ---++ Problem: Often times, the categorical variables in our read in data set have character denoted levels (i.e. ="No"= and ="Yes"=). Sometimes, we wish to have these same caterogical variables have numerically denoted levels (i.e. =0= and =1=). ---++ Data specifics: * The data file you have read in contains categorical variables whose levels are denoted with character values, not numerical ones. ---++ "Solution": First let's read in a dummy data file to illustrate the problem (="file.txt"= is attached at the bottom of this page): <highlight> # Read in data file with column assigned No/Yes x<-read.table("file.txt", header=T) x # id weight smoker # 1 1 120 No # 2 2 125 Yes # 3 3 130 Yes # 4 4 135 No # 5 5 140 No # 6 6 145 Yes # 7 7 150 Yes # 8 8 155 No # 9 9 160 No # 10 10 165 Yes # 11 11 170 Yes # 12 12 175 No # 13 13 180 No # 14 14 185 Yes # 15 15 190 Yes </highlight> If we look at the _class_ of the =smoker= column, we see that it is a ="factor"=. <highlight> class(x$smoker) # [1] "factor" </highlight> When a vector of character strings is included as a column of a data frame, =R= by default turns the vector into a factor. There are a few ways we can extract the codes 1, 2, ... from the categorical variable. The easiest was to extract the codes 1, 2, ... is to use the =as.numeric()= function: <highlight> as.numeric(x$smoker) # [1] 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 </highlight> The =unclass()= function will temporarily remove the effects of a class. So, if we "unclass" the =smoker= column we get the same output of the =as.numeric()= function: <highlight> unclass(x$smoker) # [1] 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 # attr(,"levels") # [1] "No" "Yes" </highlight> As seen, factors have an _attribute_ =levels= which holds the level names. We can manipulate this =as.numeric()= or =unclass()= output in order to extract the numeric codes we want, such as 0, 1, ... instead of 1, 2, ...: <highlight> as.numeric(x$smoker) - 1 # [1] 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 unclass(x$smoker) - 1 # [1] 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 # attr(,"levels") # [1] "No" "Yes" </highlight> So, if we wanted to, we could define a new column in our dataframe which contained the desired extracted numerical codes of our factor variable: <highlight> # Define a new column which is assigned 0/1 x$smoker2<-as.numeric(x$smoker)-1 x # id weight smoker smoker2 # 1 1 120 No 0 # 2 2 125 Yes 1 # 3 3 130 Yes 1 # 4 4 135 No 0 # 5 5 140 No 0 # 6 6 145 Yes 1 # 7 7 150 Yes 1 # 8 8 155 No 0 # 9 9 160 No 0 # 10 10 165 Yes 1 # 11 11 170 Yes 1 # 12 12 175 No 0 # 13 13 180 No 0 # 14 14 185 Yes 1 # 15 15 190 Yes 1 </highlight> For more information on factors and their potential surprising characteristics, see the following books: * _Data Analysis and Graphics Using R_ by John Maindonald and John Braun * Sections _1.4.5 Factors_ and *12.4 Factors - Additional Comments* ---++ Acknowledgements: I would like to thank Richard Urbano for posing this problem.
Attachments
1
Attachments
1
Topic attachments
I
Attachment
Action
Size
Date
Who
Comment
txt
file.txt
manage
0.2 K
12 Apr 2005 - 13:09
TheresaScott
Dummy data file
E
dit
|
A
ttach
|
P
rint version
|
H
istory
: r2
<
r1
|
B
acklinks
|
V
iew topic
|
Edit
w
iki text
|
M
ore topic actions
Topic revision: r2 - 15 Nov 2006,
TheresaScott
Main
Department Home Page
Biostatistics Graduate Program
Vanderbilt University Medical Center
Main Web
Main Web Home
Search
Recent Changes
Changes
Topic list
Biostatistics Webs
Archive
Main
Sandbox
System
Register
|
Log In
Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki?
Send feedback