I have some annoying csv files with multiple headers of different lengths that look something like this:
data = {'Line': ['0', '0', 'Line', '0', '0'],
'Date': ['8/25/2021', '8/25/2021', 'Date', '8/25/2021', '8/25/2021'],
'LibraryFile':['PSI_210825_G2_ASD4_F.LIB','PSI_210825_G2_ASD4_F.LIB',
'LibraryFile','PSI_210825_G2_ASD3.LIB','PSI_210825_G2_ASD3.LIB']}
df = pd.DataFrame(data)
which looks like:
Line Date LibraryFile
0 0 8/25/2021 PSI_210825_G2_ASD4_F.LIB
1 0 8/25/2021 PSI_210825_G2_ASD4_F.LIB
2 Line Date LibraryFile
3 0 8/25/2021 PSI_210825_G2_ASD3.LIB
4 0 8/25/2021 PSI_210825_G2_ASD3.LIB
Each "header" Lines has different column names after the LibraryFile
column so what I want to do is split up the files at each "Line" row and retain that row as the new header with the data underneath it. I've tried to look at options that use split functions but with no luck. Currently I'm trying to use the LibraryFile
column which is unique for each chunk of data. I've tried using the pandas groupby
function
grouped = df.groupby(df['LibraryFile'])
path_to_directory = 'filepath'
for lib in df['LibraryFile'].unique():
temporary_df = grouped.get_group(lib)
temporary_df.to_csv(f'filepath/temp.csv')
This gets me a chunk of data but I can't figure out how best to go from here to retain the "Line" row as the new header for all data chunks.
I also tried numpy:
dfs = np.split(df, np.flatnonzero(df[0] == 'Line'))
print(*dfs, sep='\n\n')
But this just throws an error. I'm unfortunately relearning Python after a long time of not using it so I'm sure there's a solution I'm just ignorant of.
I did a brute force method below.
I also used a snippet from @enke to get the row-header indices. I originally thought all the row-headers had Line
as their value so I commented that out and used the snippet from @enke
Note, I changed the data you had to make the output split data easier to see. I changed row index 2 to have DateX
and LibraryFileX
to see the new header applied, and the values in the column Line
"""
Extract row as header and split up df into chunks
"""
import pandas as pd
# Original data
#data = {'Line': ['0', '0', 'Line', '0', '0'], 'Date': ['8/25/2021', '8/25/2021', 'Date', '8/25/2021', '8/25/2021'],
# 'LibraryFile':['PSI_210825_G2_ASD4_F.LIB','PSI_210825_G2_ASD4_F.LIB','LibraryFile','PSI_210825_G2_ASD3.LIB','PSI_210825_G2_ASD3.LIB']}
# Changed 'Line' Values and the row-header values to show outputs better
data = {'Line': ['0', '1', 'Line', '2', '3'], 'Date': ['8/25/2021', '8/25/2021', 'DateX', '8/25/2021', '8/25/2021'],
'LibraryFile':['PSI_210825_G2_ASD4_F.LIB','PSI_210825_G2_ASD4_F.LIB','LibraryFileX','PSI_210825_G2_ASD3.LIB','PSI_210825_G2_ASD3.LIB']}
# Create DataFrame.
df = pd.DataFrame(data)
# Print the output.
print("INPUT")
print(df)
print("")
FIELD_TO_SEARCH = 'Line'
# VALUE_TO_MATCH_FIELD = 'Line'
# header_indices = df.index[df[FIELD_TO_SEARCH] == VALUE_TO_MATCH_FIELD].tolist()
header_indices = df.index[pd.to_numeric(df[FIELD_TO_SEARCH], errors='coerce').isna()].tolist()
# Add one row past the end so we can have a stopping point later
header_indices.append(df.shape[0] + 1)
# Preallocate output df list with the first chunk (using existing headers).
list_of_dfs = [df.iloc[0:header_indices[0]]]
if len(header_indices) > 1:
for idx in range(len(header_indices) - 1):
# Extract new header
header_index = header_indices[idx]
next_header_index = header_indices[idx + 1]
current_header = df.iloc[[header_index]].values.flatten().tolist()
# Make a df from this chunk
current_df = df[header_index + 1:next_header_index]
# Apply the new header
current_df.columns = current_header
current_df.reset_index(drop=True, inplace=True)
list_of_dfs.append(current_df)
# Show output
print("OUTPUT")
for df_index, current_df in enumerate(list_of_dfs):
print("DF chunk index: {}".format(df_index))
print(current_df)
Here's the output from what I ran:
INPUT
Line Date LibraryFile
0 0 8/25/2021 PSI_210825_G2_ASD4_F.LIB
1 1 8/25/2021 PSI_210825_G2_ASD4_F.LIB
2 Line DateX LibraryFileX
3 2 8/25/2021 PSI_210825_G2_ASD3.LIB
4 3 8/25/2021 PSI_210825_G2_ASD3.LIB
OUTPUT
DF chunk index: 0
Line Date LibraryFile
0 0 8/25/2021 PSI_210825_G2_ASD4_F.LIB
1 1 8/25/2021 PSI_210825_G2_ASD4_F.LIB
DF chunk index: 1
Line DateX LibraryFileX
0 2 8/25/2021 PSI_210825_G2_ASD3.LIB
1 3 8/25/2021 PSI_210825_G2_ASD3.LIB