Search code examples
pythonsqlpandascsvpandasql

Reading Large CSV and splitting it into smaller chunks


I'm trying to read and analyze a large csv file (11.5 GB) using python. And then use Power BI to create some visuals around it. But everytime I run any command line or even making changes to the dataframe in Power BI, it takes about 20-30 mins between each change.

One of the column heading is DeviceID. I would like to split the large CSV into multiple csv files so that each file will have data that belongs to the one unique DeviceID value.

Currently the dataframe looks like this in the single Full.csv file looks like this:

DeviceID    AreaName     Longitude    Latitude
12311       Dubai        55.55431     25.45631
12311       Dubai        55.55432     25.45634
12311       Dubai        55.55433     25.45637
12311       Dubai        55.55431     25.45621
12309       Dubai        55.55427     25.45627
12309       Dubai        55.55436     25.45655
12412       Dubai        55.55441     25.45657
12412       Dubai        55.55442     25.45656

After running the code, the single Full.csv file should produce 3 csv files: 12311.csv , 12309.csv , 12412.csv with each looking like this:

DeviceID    AreaName     Longitude    Latitude
12311       Dubai        55.55431     25.45631
12311       Dubai        55.55432     25.45634
12311       Dubai        55.55433     25.45637
12311       Dubai        55.55431     25.45621

AND

DeviceID    AreaName     Longitude    Latitude
12309       Dubai        55.55427     25.45627
12309       Dubai        55.55436     25.45655

AND

DeviceID    AreaName     Longitude    Latitude
12412       Dubai        55.55441     25.45657
12412       Dubai        55.55442     25.45656

I read that the best way to deal with large files in python is to use the pandasql module. Would I be able to achieve what I have described above using pandsql?

Thanks


Solution

  • One of the column heading is DeviceID. I would like to split the large CSV into multiple csv files so that each file will have data that belongs to the one unique DeviceID value.

    I do not think that this will speed up your process in PowerBI, do you do your calculation in PowerQuery or in PowerBI themself?

    But anyway, you can create a list of your unique values for DeviceID:

    df = pd.read_csv('Full.csv')
    uniquelist = list(df['DeviceID'].unique())
    

    and then split it based on this list and save it into csv files:

    for i in uniquelist:
       i = df.loc[df['DeviceID'] == i]
       i.to_csv