Search code examples
pythonpandasdataframedataset

Best way of reading a large dataset and dividing it into seperate files by a specific column with IDs


I'm trying to read large dataset (over 6GB with over 20 columns) into pandas dataframe. The data looks like this (but divided with tabs instead of commas):

0,   -1,  -1, [...], 52FC53D2C1C2, C27BFF0305
1, 30.0, 200, [...], FB490491C2F7, DA55A029E7
1, 20.0, 192, [...], 52FC53D2C1C2, 5E3822B0E0
0,   -1,  -1, [...], 9702BF6338D3, 7D1FB49049

The second last column is an ID (not unique)that I need the dataset to be divided by into smaller, seperate files. For example:

# File 1
0,   -1,  -1, [...], 52FC53D2C1C2, C27BFF0305
1, 20.0, 192, [...], 52FC53D2C1C2, 5E3822B0E0

# File 2
1, 30.0, 200, [...], FB490491C2F7, DA55A029E7

# File 3
0,   -1,  -1, [...], 9702BF6338D3, 7D1FB49049

What I tried to do was to read the dataset to a dataframe with pandas.read_csv.

colnames = ["aaa", "bbb", "ccc", [...], "ID", "zzz"]
data = pd.read_csv(path, engine='python', header=None, names=colnames, sep=r'\t')

Then I tried to get unique IDs by using .unique option of DataFrame, search for lines with the specific ID and then export them to a seperate csv files.

unique_ids = data['ID'].unique()
unique_ids = unique_ids.tolist()
for i, val in enumerate(unique_ids):
    data[data['ID'] == val].to_csv(os.path.join(path + 'ID' + str(val) + '.csv'),
                                           index=False, na_rep='N/A')

I tested it on a smaller sample and it worked fine, but doing it on a full dataset takes forever. Is there any way in pandas or maybe other library to read and process the file much faster? Thanks!


Solution

  • You can read your file by chunks and use groupby to write data:

    import pandas as pd
    import pathlib
    
    CHUNKSIZE = 10000
    DATA_DIR = pathlib.Path('.')
    
    colnames = ["aaa", "bbb", "ccc", "ID", "zzz"]
    for chunk in pd.read_csv(DATA_DIR / 'data.csv', sep='\t', chunksize=CHUNKSIZE, header=None, names=colnames):
        for ID, df in chunk.groupby('ID'):
            file = DATA_DIR / f'ID{ID}.csv'
            df.to_csv(file, mode='a', header=not file.exists(), na_rep='N/A', index=False)