Search code examples
pythonpandasdataframetabula-py

Combine Consecutive Rows for given index values in Pandas DataFrame


I was extracting tables from a PDF with tabula-py. But in a table where some rows were more than one line, but in tabula-py, a single-table row is converted as multiple rows in DataFrame. I'm giving a sample here.

    Serial No.  Name    Type    Total
0   1   Easter         Multiple    19   
1   2   Costeri        Roundabout  16   
2   3   Zhiop            Tee       16   
3   4   Nesss           Cross      10   
4   5   Uoar Lhahara    Tee        10   
5   6   Trino Nishra (KX) Tee       9   
6   7   Old-FX Box      Cross       8
7   8   Gardeners    Roundabout     8   
8   9   Max Detter   Roundabout     7   
9   NaN Others (Asynco, NaN        NaN  
10  10  D+ E,           Cross       7   
11  NaN etc)            NaN        NaN  

If you look at the sample you will see that rows in 9, 10, and 11 indices are actually a single row. There was multiple line in the table (in pdf). This table has more than 100 rows and at least 12 places those issues have occurred. Some places it is 2 consecutive rows and in some places it is 3 consecutive rows. How can we merge those rows with index values?


Solution

  • You can try:

    df['Serial No.'] = df['Serial No.'].bfill().ffill()
    df['Total'] = df['Total'].astype(str).replace('nan', np.nan)
    
    df_out = df.groupby('Serial No.', as_index=False).agg(lambda x: ''.join(x.dropna()))
    df_out['Total'] = df_out['Total'].replace('', np.nan, regex=True).astype(float)
    

    Result:

    print(df_out)
    
       Serial No.                      Name        Type  Total
    0         1.0                    Easter    Multiple   19.0
    1         2.0                   Costeri  Roundabout   16.0
    2         3.0                     Zhiop         Tee   16.0
    3         4.0                     Nesss       Cross   10.0
    4         5.0              Uoar Lhahara         Tee   10.0
    5         6.0          Trino Nishra(KX)         Tee    9.0
    6         7.0                Old-FX Box       Cross    8.0
    7         8.0                 Gardeners  Roundabout    8.0
    8         9.0                Max Detter  Roundabout    7.0
    9        10.0  Others (Asynco,D+ E,etc)       Cross    7.0