Search code examples
pythonpandascsvlarge-datadata-augmentation

What is the most efficient way to read and augment (copy samples and change some values) large dataset in .csv


Currently, I have managed to solve this but it is slower than what I need. It takes approximately: 1 hour for 500k samples, the entire dataset is ~100M samples, which requires ~200hours for 100M samples.

Hardware/Software specs: RAM 8GB, Windows 11 64bit, Python 3.8.8

The problem:
I have a dataset in .csv (~13GB) where each sample has a value and a respective start-end period of few months.I want to create a dataset where each sample will have the same value but referring to each specific month.

For example:

from:

idx | start date | end date | month | year | value
0 | 20/05/2022 | 20/07/2022 | 0 | 0 | X

to:

0 | 20/05/2022 | 20/07/2022 | 5 | 2022 | X
1 | 20/05/2022 | 20/07/2022 | 6 | 2022 | X
2 | 20/05/2022 | 20/07/2022 | 7 | 2022 | X

Ideas: Manage to do it parallel (like Dask, but I am not sure how for this task).

My implementation:
Chunk read in pandas, augment in dictionaries , append to CSV. Use a function that, given a df, calculates for each sample the months from start date to end date and creates a copy sample for each month appending it to a dictionary. Then it returns the final dictionary.

The calculations are done in dictionaries as they were found to be way faster than doing it in pandas. Then I iterate through the original CSV in chunks and apply the function at each chunk appending the resulting augmented df to another csv.

The function:

def augment_to_monthly_dict(chunk):
    '''
    Function takes a df or subdf  data and creates and returns an Augmented dataset with monthly data in 
    Dictionary form (for efficiency)
    '''
    dict={}
    l=1
    for i in range(len(chunk)):#iterate through every sample
        # print(str(chunk.iloc[i].APO)[4:6] )  
        #Find the months and years period
        mst =int(float((str(chunk.iloc[i].start)[4:6])))#start month
        mend=int(str(chunk.iloc[i].end)[4:6]) #end month
        yst =int(str(chunk.iloc[i].start)[:4] )#start year
        yend=int(str(chunk.iloc[i].end)[:4] )#end year

        if yend==yst:
            months=[ m for m in range(mst,mend+1)]   
            years=[yend for i in range(len(months))]         
        elif yend==yst+1:# year change at same sample
            months=[m for m in range(mst,13)]
            years=[yst for i in range(mst,13)]
            months= months+[m for m in range(1, mend+1)]
            years= years+[yend for i in range(1, mend+1)]
        else:
            continue
        #months is a list of each month in the period of the sample and years is a same 
        #length list of the respective years eg months=[11,12,1,2] , years= 
        #[2021,2022,2022,2022]

        for j in range(len(months)):#iterate through list of months
            #copy the original sample make it a dictionary
            tmp=pd.DataFrame(chunk.iloc[i]).transpose().to_dict(orient='records')

            #change the month and year values accordingly (they were 0 for initiation)

            tmp[0]['month'] = months[j]
            tmp[0]['year'] = years[j]
            # Here could add more calcs e.g. drop irrelevant columns, change datatypes etc 
            #to reduce size
            #
            #-------------------------------------
            #Append new row to the Augmented data
            dict[l] = tmp[0]
            l+=1
    return dict

Reading the original dataset (.csv ~13GB), augment using the function and append result to new .csv:

chunk_count=0
for chunk in pd.read_csv('enc_star_logar_ek.csv', delimiter=';', chunksize=10000):

  chunk.index = chunk.reset_index().index

  aug_dict = augment_to_monthly_dict(chunk)#make chunk dictionary to work faster
  chunk_count+=1  

  if chunk_count ==1: #get the column names and open csv write headers and 1st chunk

       #Find the dicts keys, the column names only from the first dict(not reading all data)
       for kk in aug_dict.values():
            key_names = [i for i in kk.keys()] 
            print(key_names)
            break #break after first input dict

       #Open csv file and write ';' separated data
       with open('dic_to_csv2.csv', 'w', newline='') as csvfile:
            writer = csv.DictWriter(csvfile,delimiter=';', fieldnames=key_names)
            writer.writeheader()
            writer.writerows(aug_dict.values())

  else: # Save the rest of the data chunks
       print('added chunk: ', chunk_count)
       with open('dic_to_csv2.csv', 'a', newline='') as csvfile:
            writer = csv.DictWriter(csvfile,delimiter=';', fieldnames=key_names)
            writer.writerows(aug_dict.values())

Solution

  • Pandas efficiency comes in to play when you need to manipulate columns of data, and to do that Pandas reads the input row-by-row building up a series of data for each column; that's a lot of extra computation your problem doesn't benefit from, and in fact just slows your solution down.

    You actually need to manipulate rows, and for that the fastest way is to use the standard csv module; all you need to do is read a row in, write the derived rows out, and repeat:

    import csv
    import sys
    
    from datetime import datetime
    
    
    def parse_dt(s):
        return datetime.strptime(s, r"%d/%m/%Y")
    
    
    def get_dt_range(beg_dt, end_dt):
        """
        Returns a range of (month, year) tuples, from beg_dt up-to-and-including end_dt.
        """
        if end_dt < beg_dt:
            raise ValueError(f"end {end_dt} is before beg {beg_dt}")
    
        mo, yr = beg_dt.month, beg_dt.year
    
        dt_range = []
        while True:
            dt_range.append((mo, yr))
            if mo == 12:
                mo = 1
                yr = yr + 1
            else:
                mo += 1
            if (yr, mo) > (end_dt.year, end_dt.month):
                break
    
        return dt_range
    
    
    fname = sys.argv[1]
    with open(fname, newline="") as f_in, open("output_csv.csv", "w", newline="") as f_out:
        reader = csv.reader(f_in)
        writer = csv.writer(f_out)
        writer.writerow(next(reader))  # transfer header
    
        for row in reader:
            beg_dt = parse_dt(row[1])
            end_dt = parse_dt(row[2])
            for mo, yr in get_dt_range(beg_dt, end_dt):
                row[3] = mo
                row[4] = yr
                writer.writerow(row)
    

    And, to compare with Pandas in general, let's examine @abokey's specifc Pandas solution—I'm not sure if there is a better Pandas implementation, but this one kinda does the right thing:

    import sys
    import pandas as pd
    
    fname = sys.argv[1]
    df = pd.read_csv(fname)
    
    df["start date"] = pd.to_datetime(df["start date"], format="%d/%m/%Y")
    df["end date"] = pd.to_datetime(df["end date"], format="%d/%m/%Y")
    
    df["month"] = df.apply(
        lambda x: pd.date_range(
            start=x["start date"], end=x["end date"] + pd.DateOffset(months=1), freq="M"
        ).month.tolist(),
        axis=1,
    )
    df["year"] = df["start date"].dt.year
    
    out = df.explode("month").reset_index(drop=True)
    
    out.to_csv("output_pd.csv")
    

    Let's start with the basics, though, do the programs actually do the right thing. Given this input:

    idx,start date,end date,month,year,value
    0,20/05/2022,20/05/2022,0,0,X
    0,20/05/2022,20/07/2022,0,0,X
    0,20/12/2022,20/01/2023,0,0,X
    

    My program, ./main.py input.csv, produces:

    idx,start date,end date,month,year,value
    0,20/05/2022,20/05/2022,5,2022,X
    0,20/05/2022,20/07/2022,5,2022,X
    0,20/05/2022,20/07/2022,6,2022,X
    0,20/05/2022,20/07/2022,7,2022,X
    0,20/12/2022,20/01/2023,12,2022,X
    0,20/12/2022,20/01/2023,1,2023,X
    

    I believe that's what you're looking for.

    The Pandas solution, ./main_pd.py input.csv, produces:

    ,idx,start date,end date,month,year,value
    0,0,2022-05-20,2022-05-20,5,2022,X
    1,0,2022-05-20,2022-07-20,5,2022,X
    2,0,2022-05-20,2022-07-20,6,2022,X
    3,0,2022-05-20,2022-07-20,7,2022,X
    4,0,2022-12-20,2023-01-20,12,2022,X
    5,0,2022-12-20,2023-01-20,1,2022,X
    

    Ignoring the added column for the frame index, and the fact the date format has been changed (I'm pretty sure that can be fixed with some Pandas directive I don't know), it still does the right thing with regards to creating new rows with the appropriate date range.

    So, both do the right thing. Now, on to performance. I duplicated your initial sample, just the 1 row, for 1_000_000 and 10_000_000 rows:

    import sys
    
    nrows = int(sys.argv[1])
    with open(f"input_{nrows}.csv", "w") as f:
        f.write("idx,start date,end date,month,year,value\n")
        for _ in range(nrows):
            f.write("0,20/05/2022,20/07/2022,0,0,X\n")
    

    I'm running a 2020, M1 MacBook Air with the 2TB SSD (which gives very good read/write speeds):

    1M rows (sec, RAM) 10M rows (sec, RAM)
    csv module 7.8s, 6MB 78s, 6MB
    Pandas 75s, 569MB 750s, 5.8GB

    You can see both programs following a linear increase in time-to-run that follows the increase in the size of rows. The csv module's memory remains constanly non-existent because it's streaming data in-and-out (holding on to virtually nothing); Pandas's memory rises with the size of rows it has to hold so that it can do the actual date-range computations, again on whole columns. Also, not shown, but for the 10M-rows Pandas test, Pandas spent nearly 2 minutes just writing the CSV—longer than the csv-module approach took to complete the entire task.

    Now, for all my putting-down of Pandas, the solution is far fewer lines, and is probably bug free from the get-go. I did have a problem writing get_dt_range(), and had to spend about 5 minutes thinking about what it actually needed to do and debug it.

    You can view my setup with the small test harness, and the results, here.