I frequently need to normalize columns of DataFrames that have a mixture of numeric and non-numeric columns. Sometimes I know the names of the numeric columns, other times not.
I have tried what seem to me to be very logical tidy eval methods. Most don't work. I have only found one that does.
Towards better understanding tidy evaluation, could I please have an explanation of why the following either work or don't work?
library(tidyverse)
df = data.frame(
A=runif(10, 1, 10),
B=runif(10, 1, 10),
C=rep(0, 10),
D=LETTERS[1:10]
)
df
#> A B C D
#> 1 2.157171 1.434351 0 A
#> 2 7.746638 6.987983 0 B
#> 3 7.861337 1.528145 0 C
#> 4 8.657990 4.101441 0 D
#> 5 8.307844 5.809815 0 E
#> 6 1.376084 9.202047 0 F
#> 7 7.197999 5.532681 0 G
#> 8 1.878676 1.012917 0 H
#> 9 2.231955 4.572273 0 I
#> 10 4.340488 2.640728 0 J
print("Does normalize columns, but can't handle col of 0s")
#> [1] "Does normalize columns, but can't handle col of 0s"
test = df %>% mutate_if(is.numeric, ~./sum(.))
test %>% select_if(is.numeric) %>% colSums()
#> A B C
#> 1 1 NaN
print("Virtually the same as above, but tries to handle col of 0s, but doesn't work")
#> [1] "Virtually the same as above, but tries to handle col of 0s, but doesn't work"
test = df %>% mutate_if(is.numeric, ~ifelse(sum(.)>0, ./sum(.), 0))
test %>% select_if(is.numeric) %>% colSums()
#> A B C
#> 0.4167949 0.3349536 0.0000000
print("Does normalize columns, but can't handle col of 0s")
#> [1] "Does normalize columns, but can't handle col of 0s"
test = df %>% mutate_if(is.numeric, function(x) x/sum(x))
test %>% select_if(is.numeric) %>% colSums()
#> A B C
#> 1 1 NaN
print("Virtually the same as above, but tries to handle col of 0s, but doesn't work")
#> [1] "Virtually the same as above, but tries to handle col of 0s, but doesn't work"
test = df %>% mutate_if(is.numeric, function(x) ifelse(sum(x)>0, x/sum(x), 0))
test %>% select_if(is.numeric) %>% colSums()
#> A B C
#> 0.4167949 0.3349536 0.0000000
print("Strange error I don't understand")
#> [1] "Strange error I don't understand"
test = df %>% mutate_if(is.numeric, ~apply(., 2, function(x) x/sum(x)))
#> Error in apply(., 2, function(x) x/sum(x)): dim(X) must have a positive length
print("THIS DOES WORK! Why?")
#> [1] "THIS DOES WORK! Why?"
test = df %>% mutate_if(is.numeric, function(x) if(sum(x)>0) x/sum(x))
test %>% select_if(is.numeric) %>% colSums()
#> A B
#> 1 1
Created on 2019-10-29 by the reprex package (v0.3.0)
Ack! Just noticed a huge problem In the last example, that "works", the column of 0s is dropped. I do not understand this at all. I want to keep that column, just not try to normalize it.
test = df %>% mutate_if(is.numeric, function(x) if(sum(x)>0) x/sum(x))
> test
# A B D
# 1 0.15571120 0.12033237 A
# 2 0.10561824 0.11198394 B
# 3 0.06041408 0.12068372 C
# 4 0.16785724 0.06241538 D
# 5 0.03112945 0.02559354 E
# 6 0.02791520 0.06363215 F
# 7 0.17132200 0.16625761 G
# 8 0.06641540 0.14038458 H
# 9 0.04015548 0.12420858 I
# 10 0.17346171 0.06450813 J
Figured out I need to include else
.
test = df %>% mutate_if(is.numeric, function(x) if(sum(x)>0) {x/sum(x)}else{0})
> test
# A B C D
# 1 0.15571120 0.12033237 0 A
# 2 0.10561824 0.11198394 0 B
# 3 0.06041408 0.12068372 0 C
# 4 0.16785724 0.06241538 0 D
# 5 0.03112945 0.02559354 0 E
# 6 0.02791520 0.06363215 0 F
# 7 0.17132200 0.16625761 0 G
# 8 0.06641540 0.14038458 0 H
# 9 0.04015548 0.12420858 0 I
# 10 0.17346171 0.06450813 0 J
numeric_columns =
df %>%
select_if(is.numeric) %>%
colnames()
test = df %>% mutate_at(numeric_columns, function(x) if (sum(x) > 0) x/sum(x))
> test
# A B C D
# 1 0.15571120 0.12033237 0 A
# 2 0.10561824 0.11198394 0 B
# 3 0.06041408 0.12068372 0 C
# 4 0.16785724 0.06241538 0 D
# 5 0.03112945 0.02559354 0 E
# 6 0.02791520 0.06363215 0 F
# 7 0.17132200 0.16625761 0 G
# 8 0.06641540 0.14038458 0 H
# 9 0.04015548 0.12420858 0 I
# 10 0.17346171 0.06450813 0 J
test = df %>% mutate_if(is.numeric, ~./sum(.))
test %>% select_if(is.numeric) %>% colSums( ,na.rm = T)
test = df %>% mutate_if(is.numeric, function(x) x/sum(x))
test %>% select_if(is.numeric) %>% colSums()
You can handle your problem specifying na.rm = T
such that you don't keep NA
.
They occur because you divide by 0.
It is the same thing for the second syntax which does the same. mutate_if
apply for each numeric column the desired operation so for the third one it returns Nan because of 0.
test = df %>% mutate_if(is.numeric, function(x){ifelse(x > 0, x/sum(x), rep(0, length(x)))})
test %>% select_if(is.numeric) %>% colSums()
test = df %>% mutate_if(is.numeric, function(x) ifelse(sum(x)>0, x/sum(x), 0))
test %>% select_if(is.numeric) %>% colSums()
ifelse returns a value with the same shape as test so in your case because you check 'sum(x) > 0' you return only the first value. See :
https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/ifelse
test = df %>% mutate_if(is.numeric, ~apply(., 2, function(x) x/sum(x)))
Here, it is tricky, mutate_if apply by vector and you want to use apply next but your object is a vector and apply is correct only for object like matrix
or data.frame
with at least two columns.
test = df %>% mutate_if(is.numeric, function(x) if(sum(x)>0) x/sum(x))
test %>% select_if(is.numeric) %>% colSums()
Indeed it is a right syntax because if
doesn't require to return a specific size of object.
However you could also use ifelse
but with a vector condition indeed a sum of positive value isn't nul if at least one element is different from 0.
test = df %>% mutate_if(is.numeric, function(x){ifelse(x > 0, x/sum(x), rep(0, length(x)))})
test %>% select_if(is.numeric) %>% colSums()
I hope it helps you to understand what is going on when error appears. The solution isn't unique.
The reason is : you return something only if your sum is stricly greater than 0. You must specify what to do if not. Like this for instance :
test = df %>% mutate_if(is.numeric, function(x) if(sum(x)>0){x/sum(x)}else{0})