I'm trying to read large dataset (over 6GB with over 20 columns) into pandas dataframe. The data looks like this (but divided with tabs instead of commas):
0, -1, -1, [...], 52FC53D2C1C2, C27BFF0305
1, 30.0, 200, [...], FB490491C2F7, DA55A029E7
1, 20.0, 192, [...], 52FC53D2C1C2, 5E3822B0E0
0, -1, -1, [...], 9702BF6338D3, 7D1FB49049
The second last column is an ID (not unique)that I need the dataset to be divided by into smaller, seperate files. For example:
# File 1
0, -1, -1, [...], 52FC53D2C1C2, C27BFF0305
1, 20.0, 192, [...], 52FC53D2C1C2, 5E3822B0E0
# File 2
1, 30.0, 200, [...], FB490491C2F7, DA55A029E7
# File 3
0, -1, -1, [...], 9702BF6338D3, 7D1FB49049
What I tried to do was to read the dataset to a dataframe with pandas.read_csv.
colnames = ["aaa", "bbb", "ccc", [...], "ID", "zzz"]
data = pd.read_csv(path, engine='python', header=None, names=colnames, sep=r'\t')
Then I tried to get unique IDs by using .unique option of DataFrame, search for lines with the specific ID and then export them to a seperate csv files.
unique_ids = data['ID'].unique()
unique_ids = unique_ids.tolist()
for i, val in enumerate(unique_ids):
data[data['ID'] == val].to_csv(os.path.join(path + 'ID' + str(val) + '.csv'),
index=False, na_rep='N/A')
I tested it on a smaller sample and it worked fine, but doing it on a full dataset takes forever. Is there any way in pandas or maybe other library to read and process the file much faster? Thanks!
You can read your file by chunks and use groupby
to write data:
import pandas as pd
import pathlib
CHUNKSIZE = 10000
DATA_DIR = pathlib.Path('.')
colnames = ["aaa", "bbb", "ccc", "ID", "zzz"]
for chunk in pd.read_csv(DATA_DIR / 'data.csv', sep='\t', chunksize=CHUNKSIZE, header=None, names=colnames):
for ID, df in chunk.groupby('ID'):
file = DATA_DIR / f'ID{ID}.csv'
df.to_csv(file, mode='a', header=not file.exists(), na_rep='N/A', index=False)