Search code examples
pythonpandaslarge-filesgeopandasgeopackage

Slicing a large file, removing duplicates and merging into output using Pandas


So, I have a geopackage with 1.25 billion features. The file doesn't actually contain geometry and only has one attribute 'id' which is a unique id. There are a lot of duplicates and I want to remove duplicated 'id' and keep only unique values. Due to the sheer amount of data present (the geopackage contains 19 GB), I went with slicing. I tried multiprocessing but that didn't work and it would have problems since I have to keep track of the unique 'id' and multiprocessing would not allow this (to my knowledge at least).

What I have:

import fiona
import geopandas as gpd
import pandas as pd
# import numpy as np

slice_count = 200
start = 0
end = slice_count
fname = "path/Output.gpkg"

file_gpd = gpd.read_file(fname, rows=slice(start, end))
chunk = pd.DataFrame(file_gpd)
chunks = pd.DataFrame()
only_ids = pd.DataFrame(columns=['id'])
loop = True
while loop:
    try:
        # Dropping duplicates in current dataset
        chunk = chunk.drop_duplicates(subset=['id'])

        # Extract only unique IDS from chunk variable to save memory 
        only_ids_in_chunk = pd.DataFrame()
        only_ids_in_chunk['id'] = chunk['id']

        only_ids = only_ids.append(only_ids_in_chunk)
        only_ids = only_ids.drop_duplicates(subset=['id'])

        # If we want to make another file which have all values unique
        # we must store somewhere what we have in chunk variable, to be able to load new chunk
        # Because we must not have all chunks in memory at the same time

        del chunk

        # Load next chunk

        start += slice_count
        end += slice_count
        file_gpd = gpd.read_file(fname, rows=slice(start, end))
        chunk = pd.DataFrame(file_gpd)
        if len(chunk) == 0:
            print(len(only_ids))
            loop = False
        else:
            pass
    except Exception:
        loop = False
        print("Iteration is stopped")

I am getting an infinite loop. I thought that using the if statement will find when the length of the chunk is equal to 0 or when the slicing came to its end.


Solution

  • So, here is the final script. The issue that I was having is that when you slice a geopackage file using geopandas, when you get to the end, it starts from start and doesn't stop. So I added the if statement at the end of the code to cover that.

    import fiona
    import geopandas as gpd
    import pandas as pd
    import logging
    import time
    
    slice_count = 20000000
    start = 0
    end = slice_count
    fname = "/Output.gpkg"
    
    chunk = gpd.read_file(fname, rows=slice(start, end), ignore_geometry=True)
    
    chunks = pd.DataFrame()
    only_ids = pd.DataFrame(columns=['id'])
    loop = True
    chunk_num = 1
    while loop:
        start_time = time.time()
        # Dropping duplicates in current dataset
        chunk = chunk.drop_duplicates(subset=['id'])
            
        only_ids = only_ids.append(chunk)
        only_ids = only_ids.drop_duplicates(subset=['id'])
    
        # delete chunk to save memory
        del chunk
    
        # Load next chunk
        start += slice_count
        end += slice_count
        chunk = gpd.read_file(fname, rows=slice(start, end), ignore_geometry=True)
        
        FORMAT = '%(asctime)s:%(name)s:%(levelname)s - %(message)s'
        logging.basicConfig(format=FORMAT, level=logging.INFO)
        logging.info(f"Chunk {chunk_num} done")
        print(f"Duration: {time.time() - start_time}")
        chunk_num += 1
    
        if len(chunk) != slice_count:
            chunk = chunk.drop_duplicates(subset=['id'])
            only_ids = only_ids.append(chunk)
            only_ids = only_ids.drop_duplicates(subset=['id'])
            del chunk
            break
    
    only_ids.to_csv('output.csv')