Search code examples
pythonnumpyapply

Converting R apply function to Python


I'm in the process of converting my existing R code to Python as a way to teach myself, but I've run into something that I can't seem to crack.

Here's a example of the R code which works as expected

var <- 0.08

a <- data.frame(a = runif(10, 0, 1), 
                b = runif(10, 0, 1), 
                c = runif(10, 0, 1), 
                d = runif(10, 0, 1))

b <- data.frame(a = c(0,4,6,8,10,12,12,14,16,18), 
                b = c(2,6,8,10,12,14,14,16,18,20), 
                c = c(4,8,10,12,14,16,16,18,20,22),
                d = c(6,10,12,14,16,18,18,20,22,24))

output <- data.table(total = seq(0, 10))

output[total%%2==0, prob:= apply(output[total%%2==0], 1, function(x) { sum(a[, 1:4] * (b[, 1:4]==x[1]))})]
output[total%%2==1, prob:= apply(output[total%%2==1], 1, function(x) { sum(a[, 1:4] * (b[, 1:4]==(x[1]-1))) * var/(1-var)})]

and here's what I tried in Python which is returning 'nan' fields in the 'prob' column

import numpy as np
import pandas as pd

var = 0.08

a = pd.DataFrame(np.random.uniform(0, 1, size=(10, 4)), columns=['a', 'b', 'c', 'd'])

b = pd.DataFrame({'a': [0, 4, 6, 8, 10, 12, 12, 14, 16, 18],
                  'b': [2, 6, 8, 10, 12, 14, 14, 16, 18, 20],
                  'c': [4, 8, 10, 12, 14, 16, 16, 18, 20, 22],
                  'd': [6, 10, 12, 14, 16, 18, 18, 20, 22, 24]})

output = pd.DataFrame({'total': range(0, 11)})

output.loc[output['total'] % 2 == 0, 'prob'] = output[output['total'] % 2 == 0].apply(lambda x: np.sum(a.iloc[:, 0:4] * (b.iloc[:, 0:4] == x[0])), axis=1)
output.loc[output['total'] % 2 == 1, 'prob'] = output[output['total'] % 2 == 1].apply(lambda x: np.sum(a.iloc[:, 0:4] * (b.iloc[:, 0:4] == (x[0] - 1))) * var / (1 - var), axis=1)

any help would be appreciated!

Thanks


Solution

  • Unfortunately this is some of the things you will need to learn when migrating R code to Python code. In R you know the sum of the values of a data.frame will sum every element, this is not the case with pandas. For example, see this question.

    By default when you call sum in a DataFrame it will sum across the rows, not all values. What you end up having is a Series in each element of the DataFrame you use apply, when in fact you were expecting a single value. You can test this if you print each iteration.

    output.loc[output['total'] % 2 == 0, 'prob'] = output[output['total'] % 2 == 0].apply(
        lambda x: print(np.sum(a.iloc[:, 0:4] * (b.iloc[:, 0:4] == x[0]))),
        axis = 1
    )
    

    You will see a bunch of Series. The solution to your problem is to sum again the values for the Series, or convert the DataFrame into an numpy.array.

    import numpy as np
    import pandas as pd
    
    var = 0.08
    
    a = pd.DataFrame(np.random.uniform(0, 1, size=(10, 4)), columns=['a', 'b', 'c', 'd'])
    
    b = pd.DataFrame({'a': [0, 4, 6, 8, 10, 12, 12, 14, 16, 18],
                      'b': [2, 6, 8, 10, 12, 14, 14, 16, 18, 20],
                      'c': [4, 8, 10, 12, 14, 16, 16, 18, 20, 22],
                      'd': [6, 10, 12, 14, 16, 18, 18, 20, 22, 24]})
    
    output = pd.DataFrame({'total': range(0, 11)})
    
    output.loc[output['total'] % 2 == 0, 'prob'] = output[output['total'] % 2 == 0].apply(
        lambda x: np.sum(a.iloc[:, 0:4] * (b.iloc[:, 0:4] == x[0])).sum(),
        axis = 1
    )
    
    output.loc[output['total'] % 2 == 1, 'prob'] = output[output['total'] % 2 == 1].apply(
        lambda x: np.sum(a.iloc[:, 0:4] * (b.iloc[:, 0:4] == (x[0] - 1))).sum() * var / (1 - var),
        axis = 1
    )
    
    output
    
    total prob
    0 0 0.503596
    1 1 0.0437909
    2 2 0.20748
    3 3 0.0180417
    4 4 0.666049
    5 5 0.0579173
    6 6 1.35971
    7 7 0.118235
    8 8 1.33156
    9 9 0.115787
    10 10 2.5496

    Which I guess is what you want. You should definitely provide the desired output in further questions as it make much easier to help that way.