How to make a function jitted with nopython = True if there's use of pandas to read_csv inside the function?

I am trying to make some computations fast using numba's @njit. There is a function in my code, call it, fn1(alias) which needs to be jitted. This fn1 calls another function fn2(alias). In my understanding, both functions needs to be jitted.

I am posting below a similar function to fn2. The below code doesn't comply to minimum reproducible example as it requires reading csv files and just to be used to understand essence of the question.

@njit 
def calculate_prob(list_of_value):
  list_to_return = []
  for xx in list_of_value:
    xdf = pd.read_csv(r'../corpus/{}.csv'.format(xx))
    # do computations with xdf data and get a probability value pr. 
    list_to_return.append(pr)
  return list_to_return
  
  calculate_prob(list_of_value)

I also tried to do a POC(see function test_value, to see if numpy ndarray was supported by numba but it is also not working for me. I was hoping to convert dataframes to numpy ndarray in preprocessing step and then pass them to my fn2 as input parameter.

df = pd.DataFrame(columns = ['a','b','c'], data = [[1,'a',3],[2,'b',4]])

@njit 
def test_value(df_np):
  print(df_np[:,0])

test_value(df.to_numpy())

My question is- Is it possible to read a csv file inside a jitted function?(which i think is not possible), and if no, can someone suggest what all options I am left with?

Thanks.

Solution

You cannot directly manipulate Pandas dataframes in a Numba function. This is explicitly stated in the first page of the documentation. There is an experimental objmode so to execute arbitrary python code like pd.read_csv but the main problem is about the type of variables like xdf or list_to_return: Numba cannot type them (at least not yet). Types are mandatory to compile a Numba function (this is what makes Numba fast compared to CPython).

The main way to fix this is to put the computational work in a specific function operating on Numpy array. If your dataframe contains a variable number of columns with dynamically-defined heterogeneous types, then it is not possible to directly compute the dataframe with a Numba function. The reading part cannot be in a Numba function unless the dataframe do not escape a well-defined region (that will not be jitted). Note that you can convert the dataframe to a Numpy array, compute it with Numba and then convert the result back to a dataframe (with some additional overheads).