Search code examples
pythonpandasparquetfastparquet

Is it possible to store a parquet file on disk, while appending, and also retrieving rows by index?


I have 185 files of data, which contains a total number of 30 million rows. Each two has two columns; a single int which I want to use as an index, and a list of 512 ints.

So it looks something like this

IndexID Ids
1899317 [0, 47715, 1757, 9, 38994, 230, 12, 241, 12228...
22861131    [0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...
2163410 [0, 26039, 41156, 227, 860, 3320, 6673, 260, 1...
15760716    [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
12244098    [0, 45651, 4128, 227, 5, 10397, 995, 731, 9, 3...

The data is too large to load into memory, but I would like to retrieve say a couple hundred rows at a time using a list of indices.

I got advice from this comment to use Parquet. Most efficient way of saving a pandas dataframe or 2d numpy array into h5py, with each row a seperate key, using a column

I've been looking at the official parquet python guide

https://arrow.apache.org/docs/python/parquet.html

and

fast parquet guide

https://fastparquet.readthedocs.io/en/latest/api.html

But I can't seem to find to find any way to retrieve a row using an index, and if the table is stored on disk, or if it's all loaded into memory.

Is this possible? If so, how would I do something like this?

For example

ParquetTable[22861131, 15760716]

[0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1... [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...


Solution

  • Parquet is a columnar data store that will not fit your use case. If your goal is to store data too large to fit in memory, and yet still be able to retrieve rows at a time to work on, I would suggest you use a database.

    The simplest database to start with is sqlite3, which is built into Python itself. Sqlite databases are stored as files instead of requiring you to setup a database system.

    Before we continue, you'll need to break down your column of lists into 512 columns to (1) make them easier to work with, and (2) I don't think lists are innately supported by database systems.

    Here's a minimal example on how you can dump your data and retrieve the rows you need:

    # Creating sample data to work on
    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame(
            data=np.random.randint(low=1, high=1000, size=(1000, 3)),
            columns=['a', 'b', 'c'],
            index=pd.Series(range(1000), name='IndexID')
    )
    
    import sqlite3
    
    # Write dataframes to database
    with sqlite3.connect('sqlite.db') as conn:
        df.to_sql('data', con=conn, if_exists='append')
    

    This code above dumps the contents of df into a sqlite database in your current working directory, named sqlite.db. Notice the if_exists option in the to_sql call; you need to change it to replace if you want to overwrite an existing database.

    When you want to retrieve specific rows from the same working directory, you can run the following. This example below retrieves the 200th to 210th index:

    # How to read from database
    with sqlite3.connect('sqlite.db') as conn:
        # `rowid` is a keyword in sqlite queries to represent the index
        query = "SELECT * FROM data WHERE rowid BETWEEN %d AND %d" % (200, 210)
        subset = pd.read_sql(query, con=conn)
    
    print(subset)
    # This prints the following
    #     IndexID    a    b    c
    # 0       199  704    3  423
    # 1       200  590  299  767
    # 2       201   45  953  560
    # 3       202  237  662  746
    # 4       203  123  920  275
    # 5       204  453   10  370
    # 6       205   35  628  602
    # 7       206  957  465  735
    # 8       207  602  810  154
    # 9       208  927  796  352
    # 10      209  969  130  217
    

    For more info on moving data between sqlite3 and pandas, I recommend reading this https://www.dataquest.io/blog/python-pandas-databases/ and How to open and convert sqlite database to pandas dataframe.