The documentation for factors gives this code as its first example of constructing a factor variable:
(ff <- factor(substring("statistics", 1:10, 1:10), levels = letters))
Said documentation suggests the following:
To transform a factor
f
to approximately its original numeric values,as.numeric(levels(f))[f]
is recommended and slightly more efficient thanas.numeric(as.character(f))
.
But when I try those on their example, I get nonsense:
> (ff <- factor(substring("statistics", 1:10, 1:10), levels = letters))
[1] s t a t i s t i c s
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> ff
[1] s t a t i s t i c s
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> as.numeric(levels(ff))[ff]
[1] NA NA NA NA NA NA NA NA NA NA
Warning message:
NAs introduced by coercion
> as.numeric(as.character(ff))
[1] NA NA NA NA NA NA NA NA NA NA
Warning message:
NAs introduced by coercion
Where is my misunderstanding? I see nothing abnormal about the ff
factor variable. It definitely has underlying numbers:
> as.integer(ff)
[1] 19 20 1 20 9 19 20 9 3 19
and although its levels are characters, I don't see anything strange about that either - factor variables always have levels that are characters.
Once you created ff run this: table(ff)
, It will tell you frequency of each alphabet even those which are not present there, the frequency for those are correspondingly 0.
Now levels(ff)
returns all of these alphabets as character, wrapping them inside as.numeric(levels(ff))
will always return NA. The same goes with as.numeric(as.character(ff))
.
My guess is that you may be confused between labels
and levels
. If you run labels(ff)
then you will get numbers 1 to 10 quoted. If you convert using as.numeric()
. You will get your results as 10 numbers. Run: as.numeric(labels(ff))
I hope this explains what you are confused about. please let me know otherwise.
output :
R>table(ff)
ff
a b c d e f g h i j k l m n o p q r
1 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
s t u v w x y z
3 3 0 0 0 0 0 0
R>levels(ff)
[1] "a" "b" "c" "d" "e" "f" "g" "h"
[9] "i" "j" "k" "l" "m" "n" "o" "p"
[17] "q" "r" "s" "t" "u" "v" "w" "x"
[25] "y" "z"
R>labels(ff)
[1] "1" "2" "3" "4" "5" "6"
[7] "7" "8" "9" "10"
EDIT:
Okay, It seems OP is having issue with this line in documentation:
The interpretation of a factor depends on both the codes and the "levels" attribute. Be careful only to compare factors with the same set of levels (in the same order). In particular, as.numeric applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f)).
Now the above says that if you have factors(originally as numbers), don't convert them into numbers directly for example:
nums <- c(1, 2, 3, 10)
new_fact <- as.factor(nums)
Now, if we try to get numbers from new_fact and run as.numeric(new_fact)
we will get 1,2,3,4 (wrong)!!! Now that is wrong, so all the documentation is saying is to convert to original numbers, one has to perform as.numeric(as.character(new_fact))
or as.numeric(levels(new_fact))[new_fact]
, Both of which will return 1 2 3 10
. I hope this helps