Search code examples
pythonpandasdataframecsvreshape

Reshape a stacked style data file into a dataframe using pandas


I have a csv input file with the below format and i'm looking for a fairly easy way to convert to normal shape dataframe in pandas. the csv data file has all data stacked up into two columns with each data block separated by an empty row like below. note for the ease of explanation, i made the timestamp values the same for the three blockset, but in reality they can be different:

Trace Name,SignalName1
Signal,<signal info>
Timestamp,Value
2023-10-04 15:36:43.757193 EDT,13
2023-10-04 15:36:43.829083 EDT,14
2023-10-04 15:36:43.895651 EDT,17
2023-10-04 15:36:43.931145 EDT,11
,
Trace Name,SignalName2
Signal,<signal info>
Timestamp,Value
2023-10-04 15:36:43.757193 EDT,131
2023-10-04 15:36:43.829083 EDT,238
2023-10-04 15:36:43.895651 EDT,413
2023-10-04 15:36:43.931145 EDT,689
,
Trace Name,SignalName3
Signal,<signal info>
Timestamp,Value
2023-10-04 15:36:43.757193 EDT,9867
2023-10-04 15:36:43.829083 EDT,1257
2023-10-04 15:36:43.895651 EDT,5736
2023-10-04 15:36:43.931145 EDT,4935

the desired output after reshaping should look like the following:

    Timestamp           SignalName1 SignalName2 SignalName3
0   10/4/2023 15:36:43  13          131         9867
1   10/4/2023 15:36:43  14          238         1257
2   10/4/2023 15:36:43  17          413         5736
3   10/4/2023 15:36:43  11          689         4935

Solution

  • You could split the file using the multi newlines, then read_csv on each chunk and concat:

    import re
    import io
    
    with open('csv_file.csv') as f:
        out = (pd.concat([pd.read_csv(io.StringIO(chunk),
                                      header=0, skiprows=[1,2])
                            .set_index('Trace Name')
                          for chunk in re.split('(?:\n,)+\n', f.read())
                          if chunk], axis=1)
                 .rename_axis('Timestamp').reset_index()
              )
    

    Assumptions (can be tweaked if needed):

    • there are 3 rows in each header
    • the first row of each header is "Trace Name" then the name to be used as column

    Output:

                 Timestamp  SignalName1  SignalName2  SignalName3
    0  2023-10-04 15:36:43           13          131         9867
    1  2023-10-04 15:36:43           14          238         1257
    2  2023-10-04 15:36:43           17          413         5736
    3  2023-10-04 15:36:43           11          689         4935