Search code examples
pythonpandasdataframeperformancelarge-files

Quickest way to iterate over a pandas dataframe and perform an operation on a column, when what the operation is depends on the row


I have a table that is laid out somewhat like this:

t linenum many other columns
1234567 0 ...
1234568 0 ...
1234569 0 ...
1234570 1 ...
1234571 1 ...

Except it is very, very large. As in, the raw .dat files can get up to 20 gb. I have them converted into .h5 files so they are slightly smaller, but still large (about half the size, I'd say.)

I want to add a column that is time within line, so it subtracts the first time value for the line from each time, so I end up having something like this:

t linenum time within line
1234567 0 0
1234568 0 1
1234569 0 2
1234570 1 0
1234571 1 1

The thing is, while I know that doing an operation on the whole dataframe at once is much faster, I haven't been able to figure out how to do this without using a for loop, since the number that needs to be subtracted depends on linenum, and it takes ages. (Yesterday, I tested this on a file about 9gb big, and I gave up and went home after it had been processing for half an hour, only to find this morning that my computer had restarted overnight so the jupyter server had to restart and I lost the processed dataframe...) Here is the relevant parts of the code I currently have:

import pandas as pd
file = [h5 file address]
df = pd.read_hdf(file)
for linenum in pd.unique(df['linenum']):
  line_df = df.loc[df['linenum'] == linenum]
  first_t = int(line_df['t'].iloc[0])
  df.loc[df['linenum'] == linenum, 't_adjusted'] = (df.loc[df['linenum'] == linenum, 't'] - first_t)

Is there any way to do this without a for loop, and if not, is there any way to make it faster? I'm trying to graph one of the other columns using matplotlib.pyplot.tricontourf, with linenum on the x axis and time within line on the y axis, if that's relevant at all. There is another column I can use as a workaround because it's approximately proportional to time within line but I'd prefer to find a way to use the time. Thank you!

Edit: Also, if it's relevant, I am using Python 3.7. For some reason some of the computers my programs have to run on at my work are still on Windows 7 so I can't update...


Solution

  • You can use a groupby on 'linenum' and then transform to populate each group

    df['timewithinline'] = df.groupby('linenum')['t'].transform(lambda x: x - min(x))
    

    If the times are already sorted, you can use:

    df['timewithinline'] = df.groupby('linenum')['t'].transform(lambda x: x - x.iloc[0])