Search code examples
pandaspdftabula

Update row index when all columns of the next row ara NaN in a Pandas DataFrame


I have a Pandas DataFrame extracted from a PDF with tabula-py.

The PDF is like this:

+--------------+--------+-------+
|     name     | letter | value |
+--------------+--------+-------+
| A short name | a      |     1 |
+-------------------------------+
| Another      | b      |     2 |
+-------------------------------+
| A very large | c      |     3 |
| name         |        |       |
+-------------------------------+
| other one    | d      |     4 |
+-------------------------------+
| My name is   | e      |     5 |
| big          |        |       |
+--------------+--------+-------+

As you can see A very large name has a line break and, as the original pdf does not have borders, a row with ['name', NaN, NaN] and another with ['A very large', 'c', 3] are created in the DataFrame, when I want only a sigle one with content: ['A very large name', 'c', 3].

Same happens with My name is big

As this happens for several rows which I'm trying to achieve is concatenate the content of the name cell with the previous one when the rest of the cells in the row are NaN. Then delete the NaN rows.

But any other strategy that obtain the same result is welcome.

import pandas as pd
import numpy as np

data = {
    "name": ["A short name", "Another", "A very large", "name", "other one", "My name is", "big"],
    "letter": ["a", "b", "c", np.NaN, "d", "e", np.NaN],
    "value": [1, 2, 3, np.NaN, 4, 5, np.NaN],
}
df = pd.DataFrame(data)

data_expected = {
    "name": ["A short name", "Another", "A very large name", "other one", "My name is big"],
    "letter": ["a", "b", "c", "d", "e"],
    "value": [1, 2, 3, 4, 5],
}

df_expected = pd.DataFrame(data_expected)

I'm trying code like this, but is not working

# Not works and not very `pandastonic`
nan_indexes = df[df.iloc[:, 1:].isna().all(axis='columns')].index 
df.loc[nan_indexes - 1, "name"] = df.loc[nan_indexes - 1, "name"].str.cat(df.loc[nan_indexes, "name"], ' ')

# remove NaN rows




Solution

  • you can try with groupby.agg with join or first depending on the columns. the groups are created with checking where it is notna in the column letter and value and cumsum.

    print (df.groupby(df[['letter', 'value']].notna().any(1).cumsum())
             .agg({'name': ' '.join, 'letter':'first', 'value':'first'})
          )
                    name letter  value
    1       A short name      a    1.0
    2            Another      b    2.0
    3  A very large name      c    3.0
    4          other one      d    4.0
    5     My name is big      e    5.0