Search code examples
pythonpandasjoinmerge

How to merge/join 2 rows after long columns were misread into separate rows?


My dataframe has 3 columns:

  • File-Name
  • Time Stamp
  • File Size

The read adds rows where the file-Name is too long. For each occurrence of this, I need to:

  • merge the two rows
  • remove the None and NaN values
  • finally remove the extra row.

Here I have row 20 and row 21. row 21 is the excess of the file-path and name:

20  ./.c9/dependencies/extensionHost/956a43f1ecc7f34424174768925cd9cc5a3e006d2ad71e6bd939f2a    2024-04-17 12:56    135268
21  None    NaN e04e9ef74933c4cab24ec0f62a973d0b06e59e466286a73382db011071887b5eafad4b8b9/package

Here is how it should look ...

And again another extra row was generated for row 74:

70  ./.nvm/versions/node/v20.12.0/include/node/ope...   2024-03-26 16:18    53440
71  ./.cache/node-gyp/18.19.0/include/node/openssl...   2024-03-06 19:24    53440
72  ./.nvm/versions/node/v20.11.1/include/node/ope...   2024-02-13 23:50    53440
73  ./.c9/node/include/node/openssl/archs   2023-11-29 14:59    53440
74  ./.c9/dependencies/node18-linux-x64/7e2ea6e5ed...   2023-11-29 14:59    53440
75  None    NaN 85de0311e77fec00f3ce2303de1affbe390e7b22b96402...

I'm trying to use merge or combine all the rows with NaN values.


Solution

  • Option 1 (always just one trailing erroneous row)

    Minimal reproducible example

    import pandas as pd
    import numpy as np
    
    data = {'File-Name': {0: 'file', 1: None, 2: 'file2', 3: 'file', 4: None}, 
             'Time Stamp': {0: '2024-01-01 00:00:00', 1: np.nan, 2: '2024-01-01 00:00:00', 
                           3: '2024-01-01 00:00:00', 4: np.nan}, 
             'File Size': {0: 1, 1: '1_too_long', 2: 2, 3: 3, 4: '3_too_long'}
             }
    df = pd.DataFrame(data)
    
    df
    
      File-Name           Time Stamp   File Size
    0      file  2024-01-01 00:00:00           1
    1      None                  NaN  1_too_long
    2     file2  2024-01-01 00:00:00           2
    3      file  2024-01-01 00:00:00           3
    4      None                  NaN  3_too_long
    

    Code

    out = (
        df.assign(**{'File-Name': (df['File-Name'] + (df['File Size']
                                                      .where(df['File-Name'].isna())
                                                      .shift(-1)
                                                      .fillna(''))
                                   )
                     }
                  )
        .dropna(subset='File-Name')
        .reset_index(drop=True)
    )
    

    Output

            File-Name           Time Stamp File Size
    0  file1_too_long  2024-01-01 00:00:00         1
    1           file2  2024-01-01 00:00:00         2
    2  file3_too_long  2024-01-01 00:00:00         3
    

    Explanation

    • Use Series.where for df['File Size'], checking condition Series.isna for df['File-Name'], and apply Series.shift with -1 to shift the strings that need to be added to df['File-Name'] one row back.
    • Add Series.fillna to convert NaN values to an empty string (''), add to df['File-Name'] and overwrite the original column via df.assign.
    • Now, get rid of the None rows with df.dropna on subset='File-Name', and df.reset_index for a new continuous index.

    Option 2 (possibility of more than one trailing erroneous rows)

    Option 1 won't work as desired if the file names are so long that there are multiple trailing rows. In this case, we can use df.groupby:

    Sample

    data2 = {'File-Name': {0: 'file', 1: None, 2: 'file2', 3: 'file', 4: None, 5: None}, 
             'Time Stamp': {0: '2024-01-01 00:00:00', 1: np.nan, 2: '2024-01-01 00:00:00', 
                           3: '2024-01-01 00:00:00', 4: np.nan, 5: np.nan}, 
             'File Size': {0: 1, 1: '1_too_long', 2: 2, 3: 3, 4: '3_way', 5: '_too_long'}
             }
    df2 = pd.DataFrame(data2)
    
    df2
    
      File-Name           Time Stamp   File Size
    0      file  2024-01-01 00:00:00           1
    1      None                  NaN  1_too_long
    2     file2  2024-01-01 00:00:00           2
    3      file  2024-01-01 00:00:00           3
    4      None                  NaN       3_way # multiple rows
    5      None                  NaN   _too_long # multiple rows
    

    Code

    gr = df2['File-Name'].notna().cumsum()
    
    out2 = (
        df2.assign(**{'File-Name': df2['File-Name'].where(df2['File-Name'].notna(), 
                                                    df2['File Size'])
                      }
                   )
        .groupby(gr, as_index=False)
        .agg({'File-Name': lambda x: ''.join(x), 
              'Time Stamp': 'first', 
              'File Size': 'first'})
    )
    

    Output

                File-Name           Time Stamp File Size
    0      file1_too_long  2024-01-01 00:00:00         1
    1               file2  2024-01-01 00:00:00         2
    2  file3_way_too_long  2024-01-01 00:00:00         3
    

    Explanation

    • Use Series.notna with Series.cumsum to group rows that belong to one 'File-Name'.
    • Use Series.where on df2['File-Name']: keep value if notna, else get df2['File Size'] and overwrite original 'File-Name' column via assign.
    • Now, apply df.groupby, passing our groups (gr), and apply groupby.agg. For 'File-Name' we want a join. For 'Time Stamp' and 'File Size' simply get first.