Search code examples
pythonpandas

Normalising data in dat file with sections into a pandas dataframe


I have a file that show an example of a data as below

CAR REPORT                                  AREA 1                                                                                                                                
               Car  Honda              Country  America                        
                  Type  Car                                  
                 Built. location  Japan             Ship. location China                       
                 Date  2023/01/02        transport  shipping                                                   
                                                                               
============================================================================================================================
 PPM         Price   Age
============================================================================================================================
   2,000     100     12
   3,000     100     13
   3,000     100     13
   3,000     100     13
   3,000     100     13




CAR REPORT                                  AREA 2                                                                                                                                
               Car  Toyota              Country  America                        
                  Type  Car                                
                 Built. location  Japan             Ship. location China                       
                 Date  2023/01/02        transport  shipping                                                
                                                                     
============================================================================================================================
 PPM         Price   Age
============================================================================================================================
   2,000     100     12
   3,000     100     13
   3,000     100     13
   3,000     100     13
   3,000     100     13

Im trying to write a script in pandas/python so that I can produce a dataframe as the below

I tried to do an open file via python and read line by line and what I did was to do a regex to put the subsections like the header portions into a separate dataframe and then perform a merge back. It works but its a little inefficient.

I wonder is there any ready pandas functions which I can leverage instead of looping over the file as there are multiple sections involved and multiple files.

CAR | COUNTRY | TYPE | Built.Location | ship.Location | Date | Transport | PPM | Price| Age
Honda| America | Car | Japan | China | 2023/01/02 | shipping | 2,000 | 100| 12
Honda| America | Car | Japan | China | 2023/01/02 | shipping | 3,000 | 100| 13
Honda| America | Car | Japan | China | 2023/01/02 | shipping | 3,000 | 100| 13
Honda| America | Car | Japan | China | 2023/01/02 | shipping | 3,000 | 100| 13
Honda| America | Car | Japan | China | 2023/01/02 | shipping | 3,000 | 100| 13
Toyota| America | Car | Japan | China | 2023/01/02 | shipping | 2,000 | 100| 12
Toyota| America | Car | Japan | China | 2023/01/02 | shipping | 3,000 | 100| 13
Toyota| America | Car | Japan | China | 2023/01/02 | shipping | 3,000 | 100| 13
Toyota| America | Car | Japan | China | 2023/01/02 | shipping | 3,000 | 100| 13
Toyota| America | Car | Japan | China | 2023/01/02 | shipping | 3,000 | 100| 13

Solution

  • I don't think there are any ready made functions that can read this arbitrary datastructure into a nice table.

    First step would be to check where this data comes from. Most likely the original source is in a neat table, and if you can read that (e.g. through some API), your problem disappears.

    If not, here is how I would try to transform this data.

    # import
    from io import StringIO
    
    import pandas as pd
    
    # paste data
    # Pasted from the question
    data = """
    CAR REPORT                                  AREA 1                                                                                                                                
                   Car  Honda              Country  America                        
                      Type  Car                                  
    <SNIPPED FOR BREVITY>
       3,000     100     13
       3,000     100     13
       3,000     100     13
    """
    

    Now read the data into a list and clean up a bit

    # Read the pasted data into a list, one line at a time
    lines = StringIO(data).readlines()
    # Strip out trailing and leading whitespace
    no_whitespace_lines = [x.strip() for x in lines]
    clean_lines = [x for x in no_whitespace_lines if x != '']
    
    # Break up the reports, a new report starts when the words "CAR REPORT" appear
    reports = []
    report = []
    for line in clean_lines:
        if line.startswith("CAR REPORT"):
            # Store the existing report before starting anew
            if len(report) > 0:
                reports.append(report)
            # Start a new report
            report = []
            report.append(line)
        else:
            report.append(line)
    # The final report never hits a subsequent "CAR REPORT" so store final time here
    reports.append(report)
    

    This is "the heart" of the parser and can be improved a lot, especially the metadata parsing. It's now very fragile and breaks if a value (e.g. the country) contains a space. You can look into regex to build a more robust parser.

    # Now build a parser for each report
    def parser(report: list[str]) -> pd.DataFrame:
        # Assume that this header is identical.
        # This breaks if there are whitespaces in a value, e.g. 'South Korea'
        metadata = dict(
            area=report[0].split()[3],
            car=report[1].split()[1],
            country=report[1].split()[3],
            vehicle_type=report[2].split()[1],
            built_loc=report[3].split()[2],
            ship_loc=report[3].split()[5],
            date=report[4].split()[1],
            transport=report[4].split()[3],
        )
    
        columns = report[6].split()
    
        # Find where the tabular data starts and place that into a dataframe
        mark_counter = 0
        dataset = []
        for line in report:
            if line.startswith("====="):
                mark_counter += 1
                continue
            if mark_counter == 2:
                dataset.append(line.split())
    
        df = pd.DataFrame(data=dataset, columns=columns)
        for k, v in metadata.items():
            df[k] = v
    
        return df
    

    Now parse your list of reports and concatenate the various reports into one dataframe:

    dfs = []
    for report in reports:
        dfs.append(parser(report))
    pd.concat(dfs).reset_index()
    
    PPM Price Age area car country vehicle_type built_loc ship_loc date transport
    2,000 100 12 1 Honda America Car Japan China 2023/01/02 shipping
    3,000 100 13 1 Honda America Car Japan China 2023/01/02 shipping
    3,000 100 13 1 Honda America Car Japan China 2023/01/02 shipping
    3,000 100 13 1 Honda America Car Japan China 2023/01/02 shipping
    3,000 100 13 1 Honda America Car Japan China 2023/01/02 shipping
    2,000 100 12 2 Toyota America Car Japan China 2023/01/02 shipping
    3,000 100 13 2 Toyota America Car Japan China 2023/01/02 shipping
    3,000 100 13 2 Toyota America Car Japan China 2023/01/02 shipping
    3,000 100 13 2 Toyota America Car Japan China 2023/01/02 shipping
    3,000 100 13 2 Toyota America Car Japan China 2023/01/02 shipping

    Alternatively, if this is a one off, you could try ChatGPT. Provide it with your data and "ask" it to transform it into a tabular format with the columns you want. In my brief test the outcome is quite good.