I was extracting tables from a PDF with tabula-py. But in a table where some rows were more than one line, but in tabula-py, a single-table row is converted as multiple rows in DataFrame. I'm giving a sample here.
Serial No. Name Type Total
0 1 Easter Multiple 19
1 2 Costeri Roundabout 16
2 3 Zhiop Tee 16
3 4 Nesss Cross 10
4 5 Uoar Lhahara Tee 10
5 6 Trino Nishra (KX) Tee 9
6 7 Old-FX Box Cross 8
7 8 Gardeners Roundabout 8
8 9 Max Detter Roundabout 7
9 NaN Others (Asynco, NaN NaN
10 10 D+ E, Cross 7
11 NaN etc) NaN NaN
If you look at the sample you will see that rows in 9, 10, and 11 indices are actually a single row. There was multiple line in the table (in pdf). This table has more than 100 rows and at least 12 places those issues have occurred. Some places it is 2 consecutive rows and in some places it is 3 consecutive rows. How can we merge those rows with index values?
You can try:
df['Serial No.'] = df['Serial No.'].bfill().ffill()
df['Total'] = df['Total'].astype(str).replace('nan', np.nan)
df_out = df.groupby('Serial No.', as_index=False).agg(lambda x: ''.join(x.dropna()))
df_out['Total'] = df_out['Total'].replace('', np.nan, regex=True).astype(float)
Result:
print(df_out)
Serial No. Name Type Total
0 1.0 Easter Multiple 19.0
1 2.0 Costeri Roundabout 16.0
2 3.0 Zhiop Tee 16.0
3 4.0 Nesss Cross 10.0
4 5.0 Uoar Lhahara Tee 10.0
5 6.0 Trino Nishra(KX) Tee 9.0
6 7.0 Old-FX Box Cross 8.0
7 8.0 Gardeners Roundabout 8.0
8 9.0 Max Detter Roundabout 7.0
9 10.0 Others (Asynco,D+ E,etc) Cross 7.0