I am not a Python whiz so forgive me if this is a stupid question.
I have imported a csv which has been imported into pandas, called pw
and looks like this:
datetime outlet inlet flow
0 2023-04-12 00:00:00 38.5 48.5 3.3
1 2023-04-11 23:55:00 38.5 48.0 3.0
2 2023-04-11 23:50:00 38.5 48.5 3.6
3 2023-04-11 23:45:00 38.5 48.0 3.3
4 2023-04-11 23:40:00 38.5 48.0 3.3
... ... ... ... ...
I have made some conditional probability lookup tables:
pgf = pw.groupby('flow')['outlet'].value_counts()/pw.groupby('flow')['outlet'].count()
fgp = pw.groupby('outlet')['flow'].value_counts()/pw.groupby('outlet')['flow'].count()
which can be accessed like this:
pgf[some flow value][some pressure value]
fgp[some pressure value][some flow value]
i want to create a pgf
and fgp
column in pw
. i thought i should be able to write:
pw['p_p'] = pgf[pw['flow']][pw['outlet']]
which i thought would take the flow and outlet value from each row to calculate the probability value and place it in the 'p_p' column, however when i run this i get the following message:
KeyError: '[38.5 38.5 38.5 ... 47.5 47.5 47.5] not in index'
but I do not understand how this can be so when the pgf
was created using the data of pw
pgf is <class 'pandas.core.series.Series'>
i am able to get around this issue using:
for i in range(len(pw)):
flow = pw['flow'].loc[i]
pres = pw['outlet'].loc[i]
however i feel that this must be the slowest way to achieve this
The problem you're facing is related to the way indexing works with multi-index Pandas series. Using the for loop as you've done is definitely not the most efficient way, especially when working with large datasets. Instead, you can use the .apply()
method with a lambda function to access the multi-indexed series.
pw['p_p'] = pw.apply(lambda row: pgf.get((row['flow'], row['outlet']), 0), axis=1)
pw['f_p'] = pw.apply(lambda row: fgp.get((row['outlet'], row['flow']), 0), axis=1)
I used the apply
method on the dataframe pw
along axis=1
(which means you are applying the function row-wise). For each row, I used a lambda function to look up the values in the pgf
or fgp
series.
Then I used the get
method of the series to handle the possibility of a key not being present (this will return 0
in that case, but you can modify it as you see fit).