python pandas dataframe csv data-science

How do I combine my dataframes while iterating through rows and adding a new column everytime

I have a raw dataframe, similar to below:

Original DF I have

index	text
0	i am happy today ...
1	i am confused because ...
2	i would love to do ...
...	...
1000000	i am exhausted about ...

So I have to run all these texts through different models which each produce a score. Thereafter, I need to combine them into one dataframe as below:

Processed DF i want

index	text	score_1	score_2	score 3
0	i am happy today ...	0.2	0.4	0.238
1	i am confused because ...	0.8	0.3	0.64
2	i would love to do ...	0.67	0.546	0.35
...	...	...	...	...
1000000	i am exhausted about ...	0.21	0.41	0.8

So i have to load individual models for each one (which isn't instant) and because there are so many rows, I have to split it up into batches (of 100 for example). After that I have to combine my dataframes. My code is something like this:

full_df = pd.read_csv('fulldf.csv')
batch_size = 100
num_batches = len(full_df)/100 # assume it's a round number
df_list = []
new_df = []

for i in range(num_batches): 
    # Breaking up the main dataframe
    df_list.append(full_df.iloc[i*batch_size:(i+1)*batch_size]

for model in list_of_models:
    model.load() # Time consuming step so I only do it once per model
    for df in df_list:
        df = df.reset_index()
        # Some code to generate scores for each row of df subset
        df['score_' + model_number] = score
        df.reset_index(drop = True, inplace = True)
        new_df.append(df)

total_df = pd.concat(new_df)

However, the results appear somewhat incorrectly.

DF I am getting with above code

index	text	score_1	score_2	score 3
0	i am happy today ...	0.2	NA	NA
1	i am confused because ...	0.8	NA	NA
...	...	...	...	...
1000000	i am exhausted about ...	0.21	NA	NA
0	i am happy today ...	NA	0.4	NA
1	i am confused because ...	NA	0.3	NA
...	...	...	...	...
1000000	i am exhausted about ...	NA	0.41	NA
0	i am happy today ...	NA	NA	0.238
1	i am confused because ...	NA	NA	0.64
...	...	...	...	...
1000000	i am exhausted about ...	NA	NA	0.8

As you can see, the numbers are correctly aligned to the index but the rows basically repeat 3 times (or more times if there are more 'scores').

I have the constraints that cannot load all the rows into memory at once so I have to do them in batches. Moreover, I cannot load a model, do 100 rows, then load another model and do the same 100 rows again as this takes too long due to model loading time).

I Have tried several solutions, such as adding `total_df = pd.concat(new_df, axis = 1) to the concat, but that doesn't work as it just appends sideways.

Is there any way to fix this and get the desired result?

Solution

Ok so I managed to derive a solution which works for me and gives me the desired output.

The basic logic behind it is that I create multiple temporary dataframes to hold different data portions.

I create 2 empty lists, df_list and midway_df
I split up the testset dataframe into batches and place them into df_list
I iterate through each model i want to run the data through
I create another empty list, predicted_dataframes
I further iterate through the whole test dataframe (in batches as per df_list) and derive the scores for each row of the test dataframe subset.
I append this scored dataframe to the predicted_dataframes list.
After finishing inference for the whole model, I combine into one model-specific dataframe i.e. a df which contains scores for a model for the whole test set.
Append the combined model-specific dataframe to the midway_df list
Repeat for every model
Once done, concat midway_df's dataframes into one. This will result in a messy dataframe with many repeating column names. This needs to be filtered down.
Get the columns of the scores
Get the columns except the scores (only from the first dataframe to avoid repeats)
Filter big dataframe to only contain these columns.

full_df = pd.read_csv('fulldf.csv')
batch_size = 100
num_batches = len(full_df)/100 # assume it's a round number

df_list = []
midway_df = []

for i in range(num_batches): 
    # Breaking up the main dataframe
    df_list.append(full_df.iloc[i*per_batch:(i+1)*per_batch])

for model in list_of_models:
    model.load() # Time consuming step so I only do it once per model
    predicted_dataframes = [] # Captures the predictions for each model
    for df in df_list:
        df = df.reset_index()
        # Some code to generate scores for each row of df subset
        df['score_' + str(model.name)] = scores
        df.reset_index(drop = True, inplace = True)

        # After model has completed on one subset of full test dataset
        predicted_dataframes.append(df)

    # After one model has completed for all rows of the test dataset
    temp_df = pd.concat(predicted_dataframes)
    midway_df.append(temp_df) # Question: Should I be doing deepcopy somewhere here?

# After all models have completed and all scores have been saved in midway_df
total_df = pd.concat(midway_df, axis = 1)
score_cols = [f'score_{model.name}' for model.name in list_of_models]
standard_cols = [col for col in midway_df[0].columns if col not in score_cols]
total_df = pd.concat([midway_df[0][standard_cols], total_df[score_cols]], axis = 1)
total_df = reset_index(drop = True, inplace = True)