Search code examples
pythonpandasdataframevectorization

How to correctly vectorize Pandas column from lookup


I am not a Python whiz so forgive me if this is a stupid question.

I have imported a csv which has been imported into pandas, called pw and looks like this:


                 datetime  outlet  inlet  flow
0     2023-04-12 00:00:00    38.5   48.5   3.3
1     2023-04-11 23:55:00    38.5   48.0   3.0
2     2023-04-11 23:50:00    38.5   48.5   3.6
3     2023-04-11 23:45:00    38.5   48.0   3.3
4     2023-04-11 23:40:00    38.5   48.0   3.3
...                   ...     ...    ...   ...

I have made some conditional probability lookup tables:

pgf = pw.groupby('flow')['outlet'].value_counts()/pw.groupby('flow')['outlet'].count()
fgp = pw.groupby('outlet')['flow'].value_counts()/pw.groupby('outlet')['flow'].count()

which can be accessed like this:

pgf[some flow value][some pressure value]
fgp[some pressure value][some flow value]

i want to create a pgf and fgp column in pw. i thought i should be able to write:

pw['p_p'] = pgf[pw['flow']][pw['outlet']]

which i thought would take the flow and outlet value from each row to calculate the probability value and place it in the 'p_p' column, however when i run this i get the following message:

KeyError: '[38.5 38.5 38.5 ... 47.5 47.5 47.5] not in index'

but I do not understand how this can be so when the pgf was created using the data of pw

pgf is <class 'pandas.core.series.Series'>


i am able to get around this issue using:

for i in range(len(pw)):
  flow = pw['flow'].loc[i]
  pres = pw['outlet'].loc[i]

however i feel that this must be the slowest way to achieve this


Solution

  • The problem you're facing is related to the way indexing works with multi-index Pandas series. Using the for loop as you've done is definitely not the most efficient way, especially when working with large datasets. Instead, you can use the .apply() method with a lambda function to access the multi-indexed series.

    pw['p_p'] = pw.apply(lambda row: pgf.get((row['flow'], row['outlet']), 0), axis=1)
    pw['f_p'] = pw.apply(lambda row: fgp.get((row['outlet'], row['flow']), 0), axis=1)
    

    I used the apply method on the dataframe pw along axis=1 (which means you are applying the function row-wise). For each row, I used a lambda function to look up the values in the pgf or fgp series. Then I used the get method of the series to handle the possibility of a key not being present (this will return 0 in that case, but you can modify it as you see fit).