I'm in the process of converting my existing R code to Python as a way to teach myself, but I've run into something that I can't seem to crack.
Here's a example of the R code which works as expected
var <- 0.08
a <- data.frame(a = runif(10, 0, 1),
b = runif(10, 0, 1),
c = runif(10, 0, 1),
d = runif(10, 0, 1))
b <- data.frame(a = c(0,4,6,8,10,12,12,14,16,18),
b = c(2,6,8,10,12,14,14,16,18,20),
c = c(4,8,10,12,14,16,16,18,20,22),
d = c(6,10,12,14,16,18,18,20,22,24))
output <- data.table(total = seq(0, 10))
output[total%%2==0, prob:= apply(output[total%%2==0], 1, function(x) { sum(a[, 1:4] * (b[, 1:4]==x[1]))})]
output[total%%2==1, prob:= apply(output[total%%2==1], 1, function(x) { sum(a[, 1:4] * (b[, 1:4]==(x[1]-1))) * var/(1-var)})]
and here's what I tried in Python which is returning 'nan' fields in the 'prob' column
import numpy as np
import pandas as pd
var = 0.08
a = pd.DataFrame(np.random.uniform(0, 1, size=(10, 4)), columns=['a', 'b', 'c', 'd'])
b = pd.DataFrame({'a': [0, 4, 6, 8, 10, 12, 12, 14, 16, 18],
'b': [2, 6, 8, 10, 12, 14, 14, 16, 18, 20],
'c': [4, 8, 10, 12, 14, 16, 16, 18, 20, 22],
'd': [6, 10, 12, 14, 16, 18, 18, 20, 22, 24]})
output = pd.DataFrame({'total': range(0, 11)})
output.loc[output['total'] % 2 == 0, 'prob'] = output[output['total'] % 2 == 0].apply(lambda x: np.sum(a.iloc[:, 0:4] * (b.iloc[:, 0:4] == x[0])), axis=1)
output.loc[output['total'] % 2 == 1, 'prob'] = output[output['total'] % 2 == 1].apply(lambda x: np.sum(a.iloc[:, 0:4] * (b.iloc[:, 0:4] == (x[0] - 1))) * var / (1 - var), axis=1)
any help would be appreciated!
Thanks
Unfortunately this is some of the things you will need to learn when migrating R code to Python code. In R you know the sum of the values of a data.frame
will sum every element, this is not the case with pandas
. For example, see this question.
By default when you call sum
in a DataFrame
it will sum across the rows, not all values. What you end up having is a Series
in each element of the DataFrame
you use apply
, when in fact you were expecting a single value. You can test this if you print each iteration.
output.loc[output['total'] % 2 == 0, 'prob'] = output[output['total'] % 2 == 0].apply(
lambda x: print(np.sum(a.iloc[:, 0:4] * (b.iloc[:, 0:4] == x[0]))),
axis = 1
)
You will see a bunch of Series
. The solution to your problem is to sum again the values for the Series
, or convert the DataFrame
into an numpy.array
.
import numpy as np
import pandas as pd
var = 0.08
a = pd.DataFrame(np.random.uniform(0, 1, size=(10, 4)), columns=['a', 'b', 'c', 'd'])
b = pd.DataFrame({'a': [0, 4, 6, 8, 10, 12, 12, 14, 16, 18],
'b': [2, 6, 8, 10, 12, 14, 14, 16, 18, 20],
'c': [4, 8, 10, 12, 14, 16, 16, 18, 20, 22],
'd': [6, 10, 12, 14, 16, 18, 18, 20, 22, 24]})
output = pd.DataFrame({'total': range(0, 11)})
output.loc[output['total'] % 2 == 0, 'prob'] = output[output['total'] % 2 == 0].apply(
lambda x: np.sum(a.iloc[:, 0:4] * (b.iloc[:, 0:4] == x[0])).sum(),
axis = 1
)
output.loc[output['total'] % 2 == 1, 'prob'] = output[output['total'] % 2 == 1].apply(
lambda x: np.sum(a.iloc[:, 0:4] * (b.iloc[:, 0:4] == (x[0] - 1))).sum() * var / (1 - var),
axis = 1
)
output
total | prob | |
---|---|---|
0 | 0 | 0.503596 |
1 | 1 | 0.0437909 |
2 | 2 | 0.20748 |
3 | 3 | 0.0180417 |
4 | 4 | 0.666049 |
5 | 5 | 0.0579173 |
6 | 6 | 1.35971 |
7 | 7 | 0.118235 |
8 | 8 | 1.33156 |
9 | 9 | 0.115787 |
10 | 10 | 2.5496 |
Which I guess is what you want. You should definitely provide the desired output in further questions as it make much easier to help that way.