I am trying to read in a large data set in chunks using pandas
, aggregate rows, append the aggregated chunks to a list, then concatenate the list. I can't figure out why my list is empty.
Test Data
"Test 1", 1, 1, 1, 1, 1
"Test 1", 1, 2, 2, 2, 2
"Test 2", 2, 3, 3, 3, 3
"Test 2", 2, 4, 4, 3, 4
"Test 3", 0, 1, 2, 3, 4
"Test 4", 0, 1, 2, 3, 4
Code
### Test 2
cols_to_keep = [0, 1, 2, 3]
df_test = pd.read_csv("test.txt", sep=",", header=None, chunksize=2, usecols=cols_to_keep)
for chunk in df_test:
print chunk
### Aggregate Chunks
chunk_list = [] # append each chunk df here
# Each chunk is in df format
for chunk in df_test:
chunk_agg = chunk.groupby([0,1]).agg('sum')
chunk_list.append(chunk_agg) # append aggregated chunk to list
df_test_concat = pd.concat(chunk_list)
print(df_test_concat)
As bernie mentioned in the comments of your question, you are consuming the contents of the TextFileReader object created when you use pd.read_csv()
.
This happens because TextFileReader objects exists so you don't have to read the full content of the csv file at once (some files may be Gigabytes in size), therefore, it keeps the file opened while reading its content in chunks.
When it finishes reading, it closes the document and the variable df_test
now points to the end of the file, not the beginning, so there's nothing more to iterate over, you have to pd.read_csv()
again in order to "reset" this pointer to the beginning of the file (it will actually create another TextFileReader object and discard the old one).