I'm trying to build a small program that uses the censusgeo
package to interact with the US Census Bureau API address batch facility. The API has a limit of 10,000 addresses in any single call but my dataframe has approx. 3 million rows. As such I want to split the dataframe into N parts, each comprising roughly 10,000 rows, and then feed each one into the API call, extract the output and append it all together.
I found this stackoverflow post which has been quite helpful in giving me a function to split my df. It doesn't return dataframes though (e.g. they don't show up if I run %who_ls DataFrame
) and I don't know how to call the outputs individually in order to then feed them into an API call.
This is the function i'm using to split the dataframe:
def split_dataframe(df, chunk_size = 10000):
chunks = list()
num_chunks = math.ceil(len(df) / chunk_size)
for i in range(num_chunks):
chunks.append(df[i*chunk_size:(i+1)*chunk_size])
return chunks
How do I refer to the chunks that are returned from that function? And is the best way to proceed simply to loop over them and feed them into the API call? I.e. something like:
for i in chunks:
censusgeocode --csv batch_i.csv
Or is there a smarter/more efficient way to do this?
Any pointers folks can give would be appreciated!
I think i've found a solution to my question. If I assign the function call to an object, I can then access the different chunks it creates using standard indexing notation. E.g.
splits = split_dataframe(df, chunk_size=100000)
for i in range(len(splits)):
print(len(splits[i]))
I'm sure there is a more elegant way to then pass these outputs into the API call, but this works for the time being.