Search code examples
pythonpandasdataframecsvdata-science

How do I combine my dataframes while iterating through rows and adding a new column everytime


I have a raw dataframe, similar to below:

Original DF I have

index text
0 i am happy today ...
1 i am confused because ...
2 i would love to do ...
... ...
1000000 i am exhausted about ...

So I have to run all these texts through different models which each produce a score. Thereafter, I need to combine them into one dataframe as below:

Processed DF i want

index text score_1 score_2 score 3
0 i am happy today ... 0.2 0.4 0.238
1 i am confused because ... 0.8 0.3 0.64
2 i would love to do ... 0.67 0.546 0.35
... ... ... ... ...
1000000 i am exhausted about ... 0.21 0.41 0.8

So i have to load individual models for each one (which isn't instant) and because there are so many rows, I have to split it up into batches (of 100 for example). After that I have to combine my dataframes. My code is something like this:

full_df = pd.read_csv('fulldf.csv')
batch_size = 100
num_batches = len(full_df)/100 # assume it's a round number
df_list = []
new_df = []

for i in range(num_batches): 
    # Breaking up the main dataframe
    df_list.append(full_df.iloc[i*batch_size:(i+1)*batch_size]

for model in list_of_models:
    model.load() # Time consuming step so I only do it once per model
    for df in df_list:
        df = df.reset_index()
        # Some code to generate scores for each row of df subset
        df['score_' + model_number] = score
        df.reset_index(drop = True, inplace = True)
        new_df.append(df)

total_df = pd.concat(new_df)

However, the results appear somewhat incorrectly.

DF I am getting with above code

index text score_1 score_2 score 3
0 i am happy today ... 0.2 NA NA
1 i am confused because ... 0.8 NA NA
... ... ... ... ...
1000000 i am exhausted about ... 0.21 NA NA
0 i am happy today ... NA 0.4 NA
1 i am confused because ... NA 0.3 NA
... ... ... ... ...
1000000 i am exhausted about ... NA 0.41 NA
0 i am happy today ... NA NA 0.238
1 i am confused because ... NA NA 0.64
... ... ... ... ...
1000000 i am exhausted about ... NA NA 0.8

As you can see, the numbers are correctly aligned to the index but the rows basically repeat 3 times (or more times if there are more 'scores').

I have the constraints that cannot load all the rows into memory at once so I have to do them in batches. Moreover, I cannot load a model, do 100 rows, then load another model and do the same 100 rows again as this takes too long due to model loading time).

I Have tried several solutions, such as adding `total_df = pd.concat(new_df, axis = 1) to the concat, but that doesn't work as it just appends sideways.

Is there any way to fix this and get the desired result?


Solution

  • Ok so I managed to derive a solution which works for me and gives me the desired output.

    The basic logic behind it is that I create multiple temporary dataframes to hold different data portions.

    1. I create 2 empty lists, df_list and midway_df
    2. I split up the testset dataframe into batches and place them into df_list
    3. I iterate through each model i want to run the data through
    4. I create another empty list, predicted_dataframes
    5. I further iterate through the whole test dataframe (in batches as per df_list) and derive the scores for each row of the test dataframe subset.
    6. I append this scored dataframe to the predicted_dataframes list.
    7. After finishing inference for the whole model, I combine into one model-specific dataframe i.e. a df which contains scores for a model for the whole test set.
    8. Append the combined model-specific dataframe to the midway_df list
    9. Repeat for every model
    10. Once done, concat midway_df's dataframes into one. This will result in a messy dataframe with many repeating column names. This needs to be filtered down.
    11. Get the columns of the scores
    12. Get the columns except the scores (only from the first dataframe to avoid repeats)
    13. Filter big dataframe to only contain these columns.
    full_df = pd.read_csv('fulldf.csv')
    batch_size = 100
    num_batches = len(full_df)/100 # assume it's a round number
    
    df_list = []
    midway_df = []
    
    for i in range(num_batches): 
        # Breaking up the main dataframe
        df_list.append(full_df.iloc[i*per_batch:(i+1)*per_batch])
    
    for model in list_of_models:
        model.load() # Time consuming step so I only do it once per model
        predicted_dataframes = [] # Captures the predictions for each model
        for df in df_list:
            df = df.reset_index()
            # Some code to generate scores for each row of df subset
            df['score_' + str(model.name)] = scores
            df.reset_index(drop = True, inplace = True)
    
            # After model has completed on one subset of full test dataset
            predicted_dataframes.append(df)
    
        # After one model has completed for all rows of the test dataset
        temp_df = pd.concat(predicted_dataframes)
        midway_df.append(temp_df) # Question: Should I be doing deepcopy somewhere here?
    
    # After all models have completed and all scores have been saved in midway_df
    total_df = pd.concat(midway_df, axis = 1)
    score_cols = [f'score_{model.name}' for model.name in list_of_models]
    standard_cols = [col for col in midway_df[0].columns if col not in score_cols]
    total_df = pd.concat([midway_df[0][standard_cols], total_df[score_cols]], axis = 1)
    total_df = reset_index(drop = True, inplace = True)