python pandas numpy split pandas-groupby

Split a dataframe with multiple header rows into unique dataframes

I have some annoying csv files with multiple headers of different lengths that look something like this:

data = {'Line': ['0', '0', 'Line', '0', '0'], 
        'Date': ['8/25/2021', '8/25/2021', 'Date', '8/25/2021', '8/25/2021'], 
        'LibraryFile':['PSI_210825_G2_ASD4_F.LIB','PSI_210825_G2_ASD4_F.LIB',
                       'LibraryFile','PSI_210825_G2_ASD3.LIB','PSI_210825_G2_ASD3.LIB']}
df = pd.DataFrame(data)

which looks like:

   Line       Date               LibraryFile
0     0  8/25/2021  PSI_210825_G2_ASD4_F.LIB
1     0  8/25/2021  PSI_210825_G2_ASD4_F.LIB
2  Line       Date               LibraryFile
3     0  8/25/2021    PSI_210825_G2_ASD3.LIB
4     0  8/25/2021    PSI_210825_G2_ASD3.LIB

Each "header" Lines has different column names after the LibraryFile column so what I want to do is split up the files at each "Line" row and retain that row as the new header with the data underneath it. I've tried to look at options that use split functions but with no luck. Currently I'm trying to use the LibraryFile column which is unique for each chunk of data. I've tried using the pandas groupby function

grouped = df.groupby(df['LibraryFile'])
path_to_directory = 'filepath'
for lib in df['LibraryFile'].unique():
    temporary_df = grouped.get_group(lib)
    temporary_df.to_csv(f'filepath/temp.csv')

This gets me a chunk of data but I can't figure out how best to go from here to retain the "Line" row as the new header for all data chunks.

I also tried numpy:

dfs = np.split(df, np.flatnonzero(df[0] == 'Line'))
print(*dfs, sep='\n\n')

But this just throws an error. I'm unfortunately relearning Python after a long time of not using it so I'm sure there's a solution I'm just ignorant of.

Solution

I did a brute force method below.

I also used a snippet from @enke to get the row-header indices. I originally thought all the row-headers had Line as their value so I commented that out and used the snippet from @enke

Note, I changed the data you had to make the output split data easier to see. I changed row index 2 to have DateX and LibraryFileX to see the new header applied, and the values in the column Line

"""
Extract row as header and split up df into chunks
"""

import pandas as pd

# Original data
#data = {'Line': ['0', '0', 'Line', '0', '0'], 'Date': ['8/25/2021', '8/25/2021', 'Date', '8/25/2021', '8/25/2021'],
#        'LibraryFile':['PSI_210825_G2_ASD4_F.LIB','PSI_210825_G2_ASD4_F.LIB','LibraryFile','PSI_210825_G2_ASD3.LIB','PSI_210825_G2_ASD3.LIB']}

# Changed 'Line' Values and the row-header values to show outputs better
data = {'Line': ['0', '1', 'Line', '2', '3'], 'Date': ['8/25/2021', '8/25/2021', 'DateX', '8/25/2021', '8/25/2021'],
        'LibraryFile':['PSI_210825_G2_ASD4_F.LIB','PSI_210825_G2_ASD4_F.LIB','LibraryFileX','PSI_210825_G2_ASD3.LIB','PSI_210825_G2_ASD3.LIB']}

# Create DataFrame.
df = pd.DataFrame(data)
# Print the output.
print("INPUT")
print(df)
print("")

FIELD_TO_SEARCH = 'Line'
# VALUE_TO_MATCH_FIELD = 'Line'
# header_indices = df.index[df[FIELD_TO_SEARCH] == VALUE_TO_MATCH_FIELD].tolist()
header_indices = df.index[pd.to_numeric(df[FIELD_TO_SEARCH], errors='coerce').isna()].tolist()
# Add one row past the end so we can have a stopping point later
header_indices.append(df.shape[0] + 1)

# Preallocate output df list with the first chunk (using existing headers).
list_of_dfs = [df.iloc[0:header_indices[0]]]

if len(header_indices) > 1:
    for idx in range(len(header_indices) - 1):
        # Extract new header
        header_index = header_indices[idx]
        next_header_index = header_indices[idx + 1]
        current_header = df.iloc[[header_index]].values.flatten().tolist()
    
        # Make a df from this chunk
        current_df = df[header_index + 1:next_header_index]
        # Apply the new header
        current_df.columns = current_header
        current_df.reset_index(drop=True, inplace=True)
        list_of_dfs.append(current_df)

# Show output
print("OUTPUT")
for df_index, current_df in enumerate(list_of_dfs):
    print("DF chunk index: {}".format(df_index))
    print(current_df)

Here's the output from what I ran:

INPUT
   Line       Date               LibraryFile
0     0  8/25/2021  PSI_210825_G2_ASD4_F.LIB
1     1  8/25/2021  PSI_210825_G2_ASD4_F.LIB
2  Line      DateX              LibraryFileX
3     2  8/25/2021    PSI_210825_G2_ASD3.LIB
4     3  8/25/2021    PSI_210825_G2_ASD3.LIB

OUTPUT
DF chunk index: 0
  Line       Date               LibraryFile
0    0  8/25/2021  PSI_210825_G2_ASD4_F.LIB
1    1  8/25/2021  PSI_210825_G2_ASD4_F.LIB
DF chunk index: 1
  Line      DateX            LibraryFileX
0    2  8/25/2021  PSI_210825_G2_ASD3.LIB
1    3  8/25/2021  PSI_210825_G2_ASD3.LIB