Normalising data in dat file with sections into a pandas dataframe

I have a file that show an example of a data as below

CAR REPORT                                  AREA 1                                                                                                                                
               Car  Honda              Country  America                        
                  Type  Car                                  
                 Built. location  Japan             Ship. location China                       
                 Date  2023/01/02        transport  shipping                                                   
                                                                               
============================================================================================================================
 PPM         Price   Age
============================================================================================================================
   2,000     100     12
   3,000     100     13
   3,000     100     13
   3,000     100     13
   3,000     100     13




CAR REPORT                                  AREA 2                                                                                                                                
               Car  Toyota              Country  America                        
                  Type  Car                                
                 Built. location  Japan             Ship. location China                       
                 Date  2023/01/02        transport  shipping                                                
                                                                     
============================================================================================================================
 PPM         Price   Age
============================================================================================================================
   2,000     100     12
   3,000     100     13
   3,000     100     13
   3,000     100     13
   3,000     100     13

Im trying to write a script in pandas/python so that I can produce a dataframe as the below

I tried to do an open file via python and read line by line and what I did was to do a regex to put the subsections like the header portions into a separate dataframe and then perform a merge back. It works but its a little inefficient.

I wonder is there any ready pandas functions which I can leverage instead of looping over the file as there are multiple sections involved and multiple files.

CAR | COUNTRY | TYPE | Built.Location | ship.Location | Date | Transport | PPM | Price| Age
Honda| America | Car | Japan | China | 2023/01/02 | shipping | 2,000 | 100| 12
Honda| America | Car | Japan | China | 2023/01/02 | shipping | 3,000 | 100| 13
Honda| America | Car | Japan | China | 2023/01/02 | shipping | 3,000 | 100| 13
Honda| America | Car | Japan | China | 2023/01/02 | shipping | 3,000 | 100| 13
Honda| America | Car | Japan | China | 2023/01/02 | shipping | 3,000 | 100| 13
Toyota| America | Car | Japan | China | 2023/01/02 | shipping | 2,000 | 100| 12
Toyota| America | Car | Japan | China | 2023/01/02 | shipping | 3,000 | 100| 13
Toyota| America | Car | Japan | China | 2023/01/02 | shipping | 3,000 | 100| 13
Toyota| America | Car | Japan | China | 2023/01/02 | shipping | 3,000 | 100| 13
Toyota| America | Car | Japan | China | 2023/01/02 | shipping | 3,000 | 100| 13

Solution

I don't think there are any ready made functions that can read this arbitrary datastructure into a nice table.

First step would be to check where this data comes from. Most likely the original source is in a neat table, and if you can read that (e.g. through some API), your problem disappears.

If not, here is how I would try to transform this data.

# import
from io import StringIO

import pandas as pd

# paste data
# Pasted from the question
data = """
CAR REPORT                                  AREA 1                                                                                                                                
               Car  Honda              Country  America                        
                  Type  Car                                  
<SNIPPED FOR BREVITY>
   3,000     100     13
   3,000     100     13
   3,000     100     13
"""

Now read the data into a list and clean up a bit

# Read the pasted data into a list, one line at a time
lines = StringIO(data).readlines()
# Strip out trailing and leading whitespace
no_whitespace_lines = [x.strip() for x in lines]
clean_lines = [x for x in no_whitespace_lines if x != '']

# Break up the reports, a new report starts when the words "CAR REPORT" appear
reports = []
report = []
for line in clean_lines:
    if line.startswith("CAR REPORT"):
        # Store the existing report before starting anew
        if len(report) > 0:
            reports.append(report)
        # Start a new report
        report = []
        report.append(line)
    else:
        report.append(line)
# The final report never hits a subsequent "CAR REPORT" so store final time here
reports.append(report)

This is "the heart" of the parser and can be improved a lot, especially the metadata parsing. It's now very fragile and breaks if a value (e.g. the country) contains a space. You can look into regex to build a more robust parser.

# Now build a parser for each report
def parser(report: list[str]) -> pd.DataFrame:
    # Assume that this header is identical.
    # This breaks if there are whitespaces in a value, e.g. 'South Korea'
    metadata = dict(
        area=report[0].split()[3],
        car=report[1].split()[1],
        country=report[1].split()[3],
        vehicle_type=report[2].split()[1],
        built_loc=report[3].split()[2],
        ship_loc=report[3].split()[5],
        date=report[4].split()[1],
        transport=report[4].split()[3],
    )

    columns = report[6].split()

    # Find where the tabular data starts and place that into a dataframe
    mark_counter = 0
    dataset = []
    for line in report:
        if line.startswith("====="):
            mark_counter += 1
            continue
        if mark_counter == 2:
            dataset.append(line.split())

    df = pd.DataFrame(data=dataset, columns=columns)
    for k, v in metadata.items():
        df[k] = v

    return df

Now parse your list of reports and concatenate the various reports into one dataframe:

dfs = []
for report in reports:
    dfs.append(parser(report))
pd.concat(dfs).reset_index()

PPM	Price	Age	area	car	country	vehicle_type	built_loc	ship_loc	date	transport
2,000	100	12	1	Honda	America	Car	Japan	China	2023/01/02	shipping
3,000	100	13	1	Honda	America	Car	Japan	China	2023/01/02	shipping
3,000	100	13	1	Honda	America	Car	Japan	China	2023/01/02	shipping
3,000	100	13	1	Honda	America	Car	Japan	China	2023/01/02	shipping
3,000	100	13	1	Honda	America	Car	Japan	China	2023/01/02	shipping
2,000	100	12	2	Toyota	America	Car	Japan	China	2023/01/02	shipping
3,000	100	13	2	Toyota	America	Car	Japan	China	2023/01/02	shipping
3,000	100	13	2	Toyota	America	Car	Japan	China	2023/01/02	shipping
3,000	100	13	2	Toyota	America	Car	Japan	China	2023/01/02	shipping
3,000	100	13	2	Toyota	America	Car	Japan	China	2023/01/02	shipping

Alternatively, if this is a one off, you could try ChatGPT. Provide it with your data and "ask" it to transform it into a tabular format with the columns you want. In my brief test the outcome is quite good.