Search code examples
pythonpandasparquet

Does Pandas have a dataframe length limit?


I want to create a system where I load and analyze large amounts of data into pandas. Also, I will later use this to write back to .parquet files

when I try to test this using a simple example, I see that there is some kind of built in limit on the number of rows

import pandas as pd

# Create file with 100 000 000 rows
contents = """
Tommy;19
Karen;20
"""*50000000

open("person.csv","w").write(
f"""
Name;Age
{contents}
"""
)
print("Test generated")

df = pd.read_csv("person.csv",delimiter=";")
len(df)

returns 10 000 000. Not 100 000 000


Solution

  • Change the method to create the file because I think you have to many blank rows and you don't close properly your file (without context manager or explicit close() method):

    # Create file with 100 000 000 rows
    contents = """\
    Tommy;19
    Karen;20
    """*50000000
    
    with open('person.csv', 'w') as fp:
        fp.write('Name;Age\n')
        fp.write(contents)
    

    Read the file:

    df = pd.read_csv('person.csv', delimiter=';')
    print(df)
    
    # Output
               Name  Age
    0         Tommy   19
    1         Karen   20
    2         Tommy   19
    3         Karen   20
    4         Tommy   19
    ...         ...  ...
    99999995  Karen   20
    99999996  Tommy   19
    99999997  Karen   20
    99999998  Tommy   19
    99999999  Karen   20
    
    [100000000 rows x 2 columns]