My dataframe has 3 columns:
File-Name
Time Stamp
File Size
The read adds rows where the file-Name is too long. For each occurrence of this, I need to:
Here I have row 20 and row 21. row 21 is the excess of the file-path and name:
20 ./.c9/dependencies/extensionHost/956a43f1ecc7f34424174768925cd9cc5a3e006d2ad71e6bd939f2a 2024-04-17 12:56 135268
21 None NaN e04e9ef74933c4cab24ec0f62a973d0b06e59e466286a73382db011071887b5eafad4b8b9/package
Here is how it should look ...
And again another extra row was generated for row 74:
70 ./.nvm/versions/node/v20.12.0/include/node/ope... 2024-03-26 16:18 53440
71 ./.cache/node-gyp/18.19.0/include/node/openssl... 2024-03-06 19:24 53440
72 ./.nvm/versions/node/v20.11.1/include/node/ope... 2024-02-13 23:50 53440
73 ./.c9/node/include/node/openssl/archs 2023-11-29 14:59 53440
74 ./.c9/dependencies/node18-linux-x64/7e2ea6e5ed... 2023-11-29 14:59 53440
75 None NaN 85de0311e77fec00f3ce2303de1affbe390e7b22b96402...
I'm trying to use merge or combine all the rows with NaN values.
Option 1 (always just one trailing erroneous row)
import pandas as pd
import numpy as np
data = {'File-Name': {0: 'file', 1: None, 2: 'file2', 3: 'file', 4: None},
'Time Stamp': {0: '2024-01-01 00:00:00', 1: np.nan, 2: '2024-01-01 00:00:00',
3: '2024-01-01 00:00:00', 4: np.nan},
'File Size': {0: 1, 1: '1_too_long', 2: 2, 3: 3, 4: '3_too_long'}
}
df = pd.DataFrame(data)
df
File-Name Time Stamp File Size
0 file 2024-01-01 00:00:00 1
1 None NaN 1_too_long
2 file2 2024-01-01 00:00:00 2
3 file 2024-01-01 00:00:00 3
4 None NaN 3_too_long
Code
out = (
df.assign(**{'File-Name': (df['File-Name'] + (df['File Size']
.where(df['File-Name'].isna())
.shift(-1)
.fillna(''))
)
}
)
.dropna(subset='File-Name')
.reset_index(drop=True)
)
Output
File-Name Time Stamp File Size
0 file1_too_long 2024-01-01 00:00:00 1
1 file2 2024-01-01 00:00:00 2
2 file3_too_long 2024-01-01 00:00:00 3
Explanation
Series.where
for df['File Size']
, checking condition Series.isna
for df['File-Name']
, and apply Series.shift
with -1
to shift the strings that need to be added to df['File-Name']
one row back.Series.fillna
to convert NaN
values to an empty string (''
), add to df['File-Name']
and overwrite the original column via df.assign
.None
rows with df.dropna
on subset='File-Name'
, and df.reset_index
for a new continuous index.Option 2 (possibility of more than one trailing erroneous rows)
Option 1 won't work as desired if the file names are so long that there are multiple trailing rows. In this case, we can use df.groupby
:
Sample
data2 = {'File-Name': {0: 'file', 1: None, 2: 'file2', 3: 'file', 4: None, 5: None},
'Time Stamp': {0: '2024-01-01 00:00:00', 1: np.nan, 2: '2024-01-01 00:00:00',
3: '2024-01-01 00:00:00', 4: np.nan, 5: np.nan},
'File Size': {0: 1, 1: '1_too_long', 2: 2, 3: 3, 4: '3_way', 5: '_too_long'}
}
df2 = pd.DataFrame(data2)
df2
File-Name Time Stamp File Size
0 file 2024-01-01 00:00:00 1
1 None NaN 1_too_long
2 file2 2024-01-01 00:00:00 2
3 file 2024-01-01 00:00:00 3
4 None NaN 3_way # multiple rows
5 None NaN _too_long # multiple rows
Code
gr = df2['File-Name'].notna().cumsum()
out2 = (
df2.assign(**{'File-Name': df2['File-Name'].where(df2['File-Name'].notna(),
df2['File Size'])
}
)
.groupby(gr, as_index=False)
.agg({'File-Name': lambda x: ''.join(x),
'Time Stamp': 'first',
'File Size': 'first'})
)
Output
File-Name Time Stamp File Size
0 file1_too_long 2024-01-01 00:00:00 1
1 file2 2024-01-01 00:00:00 2
2 file3_way_too_long 2024-01-01 00:00:00 3
Explanation
Series.notna
with Series.cumsum
to group rows that belong to one 'File-Name'.Series.where
on df2['File-Name']
: keep value if notna
, else get df2['File Size']
and overwrite original 'File-Name' column via assign
.df.groupby
, passing our groups (gr
), and apply groupby.agg
. For 'File-Name' we want a join
. For 'Time Stamp' and 'File Size' simply get first
.