Search code examples
pythonpandasdataframedata-cleaning

Cleaning Messy Scraped Table Data Rows in Pandas


I am relatively new to pandas and python. I am trying to cleaning up some scraped PDF data read into a data frame that isn't isn't correctly parsed. where a few of the single row in the source PDF table is split into three rows like below:

index Item Quantity Price
0 some string that is split... NaN NaN
1 NaN 4 20
2 ... rest of string NaN NaN
3 some string that parsed as it should 1 22
4 another string that is split... NaN NaN
5 NaN 3 121
6 ... rest of the other string NaN NaN
7 3rd string that is split... NaN NaN
8 NaN 15 5
9 ... rest of the 3rd string NaN NaN

When ever there is this pattern I basically need to merge the three lines into one line.

so the desired output is

index Item Quantity Price
0 some string that is split... ... rest of string 4 20
3 some string that parsed as it should 1 22
4 another string that is split... ... rest of the other string 3 121
7 3rd string that is split... ...rest of the 3rd string 15 5

What would be the best way to do this?

what I have done so far is very clumsy and feels quite wrong even though it sometime works on some of the scraped data. It just can't trust it.


# Get the rows that have NaN in the quantity column
split_rows = df[df['Quantity'].isna()]

# every other line where 'Total' is NaN needs to be removed later, but not before the dtata is copied        
rows_to_remove = split_rows.iloc[1::2,:].index
        
# 
__i = 0
while __i < len(split_rows):
   df.at[split_rows.index[__i],'Quantity'] = df.at[split_rows.index[__i]+1,'Quantity']
   df.at[split_rows.index[__i],'Price'] = df.at[split_rows.index[__i]+1,'Price']
   df.at[split_rows.index[__i],'Item'] =  str(df.at[split_rows.index[__i],'Item']) + " " + str(df.at[split_rows.index[__i]+2,'Item'])
    __i += 1 
        

# remove the rows like row 3
df = df.drop(rows_to_remove, axis=0)

# drops rows like row 2    
df.dropna(inplace=True)

Is there a better way to do this. What would be the canonical way to approach something like this?

much obliged

EDIT:

made a table of the desired output, made the input more like the ones I am dealing with. clarified the question.


Solution

  • The solution is a bit tricky. First, create a boolean match to exclude right rows. Next, with this mask create virtual groups of bad rows then reindex the output with the original index of your dataframe and use interpolate to create new groups for good rows. The rest is pretty standard: group rows the aggregate data:

    m = df.isna().any(axis=1)
    g = df.loc[m, 'Item'].isna().cumsum().shift(-1).ffill().reindex(df.index).interpolate()
    
    out = (df.fillna({'Item': ' '}).groupby(g, as_index=False)
             .agg({'Item': ''.join, 'Quantity': max, 'Price': max}))
    

    Output:

    >>> out
                                                               Item  Quantity  Price
    0               some string that is split... ... rest of string       4.0   20.0
    1                          some string that parsed as it should       1.0   22.0
    2  another string that is split... ... rest of the other string       3.0  121.0
    3        3rd string that is split... ... rest of the 3rd string      15.0    5.0
    
    >>> g
    0    1.0  # bad rows, group 1
    1    1.0  # bad rows, group 1
    2    1.0  # bad rows, group 1
    3    1.5  # good row, group 1.5 due to interpolation
    4    2.0  # bad rows, group 2
    5    2.0  # bad rows, group 2
    6    2.0  # bad rows, group 2
    7    3.0  # bad rows, group 3
    8    3.0  # bad rows, group 3
    9    3.0  # bad rows, group 3
    Name: Item, dtype: float64