Search code examples
pythonpandasdataframelarge-datalarge-files

How to remove part of a panda dataframe without loading the full file?


I have a very large dataframe with many million rows and it is normally not feasible to load the entire file into memory to work with. Recently some bad data have gotten in and I need to remove them from the database. So far what I've done is:

    file = '/path to database'
    rf = pd.read_csv(f'{file}.csv', chunksize = 3000000, index_col=False)
    res = pd.concat([chunk[chunk['timestamp'] < 1.6636434764745E+018] for chunk in rf)]
    res.to_csv(f'{file}.csv', index=False) 

Basically it is opening the database and saving the part i want, overwriting the original file.

However the data has gotten so large that this is failing to fit in memory. Is there a better way to truncate a part of the dataframe based on a simple query?

The truncated part would usually be very small compared to the rest, say 100k rows and always at the end.


Solution

  • I would avoid using pandas in this case and just directly edit the csv file itself. For example:

    import csv
    
    with open("test_big.csv", "r") as f_in, open("test_out.csv", "w") as f_out:
      reader = csv.reader(f_in)
      writer = csv.writer(f_out)
      for row in reader:
        if int(row[-1]) > 9900: # your condition here
          writer.writerow(row)
    

    For context, test_big.csv looks like this

    1,2,3,4,5891
    1,2,3,4,7286
    1,2,3,4,7917
    1,2,3,4,937
    ...
    

    And is 400,000 records long. Execution took 0.2s.

    Edit: Ran with 40,000,000 records and took 15s.