I am trying to make some computations fast using numba's @njit. There is a function in my code, call it, fn1
(alias) which needs to be jitted. This fn1
calls another function fn2
(alias). In my understanding, both functions needs to be jitted.
I am posting below a similar function to fn2
. The below code doesn't comply to minimum reproducible example as it requires reading csv files and just to be used to understand essence of the question.
@njit
def calculate_prob(list_of_value):
list_to_return = []
for xx in list_of_value:
xdf = pd.read_csv(r'../corpus/{}.csv'.format(xx))
# do computations with xdf data and get a probability value pr.
list_to_return.append(pr)
return list_to_return
calculate_prob(list_of_value)
I also tried to do a POC(see function test_value
, to see if numpy ndarray was supported by numba but it is also not working for me. I was hoping to convert dataframes to numpy ndarray in preprocessing step and then pass them to my fn2
as input parameter.
df = pd.DataFrame(columns = ['a','b','c'], data = [[1,'a',3],[2,'b',4]])
@njit
def test_value(df_np):
print(df_np[:,0])
test_value(df.to_numpy())
My question is- Is it possible to read a csv file inside a jitted function?(which i think is not possible), and if no, can someone suggest what all options I am left with?
Thanks.
You cannot directly manipulate Pandas dataframes in a Numba function. This is explicitly stated in the first page of the documentation. There is an experimental objmode
so to execute arbitrary python code like pd.read_csv
but the main problem is about the type of variables like xdf
or list_to_return
: Numba cannot type them (at least not yet). Types are mandatory to compile a Numba function (this is what makes Numba fast compared to CPython).
The main way to fix this is to put the computational work in a specific function operating on Numpy array. If your dataframe contains a variable number of columns with dynamically-defined heterogeneous types, then it is not possible to directly compute the dataframe with a Numba function. The reading part cannot be in a Numba function unless the dataframe do not escape a well-defined region (that will not be jitted). Note that you can convert the dataframe to a Numpy array, compute it with Numba and then convert the result back to a dataframe (with some additional overheads).