I have a Pandas DataFrame extracted from a PDF with tabula-py.
The PDF is like this:
+--------------+--------+-------+
| name | letter | value |
+--------------+--------+-------+
| A short name | a | 1 |
+-------------------------------+
| Another | b | 2 |
+-------------------------------+
| A very large | c | 3 |
| name | | |
+-------------------------------+
| other one | d | 4 |
+-------------------------------+
| My name is | e | 5 |
| big | | |
+--------------+--------+-------+
As you can see A very large name
has a line break and, as the original pdf does not have borders, a row with ['name', NaN, NaN]
and another with ['A very large', 'c', 3]
are created in the DataFrame, when I want only a sigle one with content: ['A very large name', 'c', 3]
.
Same happens with My name is big
As this happens for several rows which I'm trying to achieve is concatenate the content of the name
cell with the previous one when the rest of the cells in the row are NaN
. Then delete the NaN rows.
But any other strategy that obtain the same result is welcome.
import pandas as pd
import numpy as np
data = {
"name": ["A short name", "Another", "A very large", "name", "other one", "My name is", "big"],
"letter": ["a", "b", "c", np.NaN, "d", "e", np.NaN],
"value": [1, 2, 3, np.NaN, 4, 5, np.NaN],
}
df = pd.DataFrame(data)
data_expected = {
"name": ["A short name", "Another", "A very large name", "other one", "My name is big"],
"letter": ["a", "b", "c", "d", "e"],
"value": [1, 2, 3, 4, 5],
}
df_expected = pd.DataFrame(data_expected)
I'm trying code like this, but is not working
# Not works and not very `pandastonic`
nan_indexes = df[df.iloc[:, 1:].isna().all(axis='columns')].index
df.loc[nan_indexes - 1, "name"] = df.loc[nan_indexes - 1, "name"].str.cat(df.loc[nan_indexes, "name"], ' ')
# remove NaN rows
you can try with groupby.agg
with join
or first
depending on the columns. the groups are created with checking where it is notna
in the column letter and value and cumsum
.
print (df.groupby(df[['letter', 'value']].notna().any(1).cumsum())
.agg({'name': ' '.join, 'letter':'first', 'value':'first'})
)
name letter value
1 A short name a 1.0
2 Another b 2.0
3 A very large name c 3.0
4 other one d 4.0
5 My name is big e 5.0